Tokenisation bias: How AI language gap raises digitisation cost

In more practical terms, a Nairobi-based fintech firm building a Swahili-language virtual assistant could pay nearly half as much in API fees as a firm offering an English-only version of the same service.

Photo credit: Shutterstock

Artificial intelligence (AI) platforms are emerging as a new frontier of digital disparity, with African users paying more to use global systems that process commands in the English language far more efficiently than local languages like Swahili.

The difference stems from how leading AI developers bill their services through tokens, as small text fragments that represent words or partial words used to process commands and generate responses within a model.

Global models, including those built by multinationals OpenAI and Google DeepMind, have been primarily trained on vast English-language datasets scrapped from the internet, academic publications, and books, giving the systems native efficiency in interpreting and generating English text.

When the same requests are made in Swahili, however, the models require between 30 and 50 percent more tokens to deliver the same output, according to open research data from American AI firm Hugging Face.

The difference translates into higher operating costs for developers, companies, as well as users who run AI tools or chatbots in African languages, since most commercial platforms charge per token processed.

In more practical terms, a Nairobi-based fintech firm building a Swahili-language virtual assistant could pay nearly half as much in API fees as a firm offering an English-only version of the same service.

This charging structure, described by African language researchers as a ‘tokenisation bias’, reflects an underlying gap in how global AI systems are trained and designed, rather than any deliberate pricing discrimination.

“Language models learn statistical patterns from the data they are exposed to, and with English accounting for over 60 percent of the internet’s text content, most systems naturally optimise for it,” asserts IT specialist David Waithaka.

“African languages, which make up less than one percent of the world’s digitised text corpus, are therefore broken down into smaller sub-units by the model to match existing English-based patterns, increasing the number of tokens processed.”

That structural inefficiency, analysts argue, has direct commercial implications for the continent’s growing market, where language localisation is becoming central to new-age concepts such as digital government, e-commerce and customer support systems, among others.

African computational linguists have commenced research to correct the imbalance by building open-source language datasets and training models directly in local languages, as in the case of South Africa’s Masakhane and Lelapa AI, as well as AI4Afrika.

Masakhane has developed translation datasets for over 200 African languages, while Lelapa AI is training foundational models that natively handle local dialects and idioms without breaking them into inefficient token fragments.

Experts say these homegrown efforts are critical to ensuring that Africa’s digital transformation does not depend entirely on external systems that were not designed for its linguistic landscape.

“We don’t have to translate who we are to be understood. AI shouldn’t charge extra for being African,” observes AI trainer Nyandia Gachago.

Limited research funding and computing services, however, continue to constrain the progress of homegrown solutions, leaving the continent dependent on imported models whose performance and costs it cannot fully control.

If the imbalance persist, industry analysts warn, African businesses could be staring at a long-term structural disadvantage in adopting generative AI technologies, compared to regions where linguistic efficiency aligns with the global model training.

Kenya’s rapid digitisation of public services, including Swahili-language chatbots for citizen engagement, could also face cost implications unless local language models are developed to match global standards.

In May last year, tech giant Microsoft announced that it would support the development of an AI model in Swahili as part of its grand plan to invest up to $1 billion (Sh129.2 billion at current conversion rates) in Kenya’s digital ecosystem.

At the time, the firm said the initiative would be geared toward supporting Kenya’s unique cultural and linguistic needs in a development that was touted as the magic wand that would drive AI uptake among native communities.

Earlier in July 2023, Swahili had become the first African language to be onboarded to Google’s conversational generative AI chatbot known as Bard, alongside 40 other international languages that included Chinese, German, Spanish, Arabic, and Hindi.

Bard is Google’s experimental AI chat service whose function closely mirrors that of ChatGPT, with the only deviation being that Bard pulls its information from the web.

PAYE Tax Calculator

Note: The results are not exact but very close to the actual.