Why Kiswahili content moderation is nightmare for social media giants

Sheng was cited as a rapidly evolving language, with diverse variations presenting unique challenges for NLP models compared to other more stable dialects.

Photo credit: Shutterstock

Social media giants are finding it increasingly difficult to moderate harmful content uploaded to their platforms in the Kiswahili language and other low-resource and indigenous languages spoken in the Global South due to underlying complexities that hamper the development of automated moderation systems, a new report shows.

The Washington-based Center for Democracy and Technology (CDT), in its report dubbed Moderating Kiswahili Content on Social Media that trains focus on Kenya and Tanzania, says the situation is further compounded by the fact that social media platforms have a limited physical presence in Africa, operating only a few offices and engaging minimal staff.

CDT identifies itself as a non-profit organisation that advocates for digital rights and the freedom of expression, and seeks to promote legislation that enables individuals to use the internet for purposes of well-intent, while at the same time reducing its potential harm.

“The scarcity of the public domain corpora for low-resource languages, such as Kiswahili, hinders the development of machine learning models, which continue to be optimised primarily for high-resource languages like English,” notes CDT.

Further, the report says, the evolution of Kiswahili in the post-colonial era has led to the emergence of two linguistically challenging phenomena that further complicate the development of accurate Artificial Intelligence (AI) models; code-mixing and Sheng.

Code-mixing, also referred to as code-switching, happens when a speaker incorporates two or more languages within a single sentence or clause.

The practice is common among Kenyan and Tanzanian Kiswahili speakers as many are bi-lingual and have a working knowledge of English.

“Given the variations, complexities, and scarcity of training datasets, the current content moderation systems used to regulate Kiswahili content online have significant shortcomings that affect Kiswahili speakers,” wrote CDT.

Another notable challenge, the report notes, is the existence of same words that have different cultural meanings between the variations, pointing out that this can lead to misinterpretations.

“For example, shoga could mean ‘friend’ in Tanzania but is more likely to mean ‘homosexual’ or ‘gay’ in Kenya. Moderating such variations on social media is challenging,” reads the report.

The study further notes that tech companies are using a combination of classifiers and automated multi-lingual language models to moderate and detect harmful content for low-resource languages, to compensate for the lack of digitised training data in these languages.

“Data for low-resource languages is frequently of low quality, often suffering from mistranslations or sourced from a handful of specific sources like religious texts and Wikipedia,” says CTD in the report.

“Multilingual language models seek to bridge these data gaps by applying semantic rules inferred from higher resource languages onto lower-resource ones, but they typically depend on English texts, which can lead to the introduction of unsuitable values and assumptions into other languages.”

The organisation adds that better datasets are needed for Kiswahili and other low-resource languages to have more accurate automated content moderation, but notes that collecting and annotating such datasets has its massive share of limitations that include the continuous evolution of the language.

“We conducted a roundtable attended by 10 Kiswahili Natural Language Processing (NLP) and linguistic researchers from Kenya, Uganda, and Tanzania. The discussions revealed significant linguistic complexities, including the evolution of Kiswahili, Sheng, and the integration of Code-Mixing with local dialects,” wrote CDT.

“These factors complicate data collection and model generalisation, especially given the diversity in cultures across East Africa.”

Sheng, for example, was cited as a rapidly evolving language, with diverse variations presenting unique challenges for NLP models compared to other more stable dialects.

In addition to the linguistic challenges, the process of data collection and annotation was also found to impact the quality of models, with participants identifying critical data access issues that led them to rely on community-based resources to collect training data.

Data annotation was cited as costly, with participants indicating that they relied on students or friends, while others indicated that they turned to AI tools like ChatGPT for annotation.

Based on the survey findings, TikTok leads in popularity among Kiswahili users with Meta platforms Facebook and Instagram trailing closely. Others that have gained significant traction are YouTube and X (formerly Twitter).

Seventy eight percent of Kiswahili users who participated in the survey shared their concerns about the spread of misleading content and inciting materials online, while 70 percent of them indicated that they have at least once reported content they believed was violating a platform’s policy.

PAYE Tax Calculator

Note: The results are not exact but very close to the actual.