As an expat working in an international environment, I’m used to conducting nearly all my daily and professional tasks in a language that isn’t my own. As a linguist, I find that both fascinating and rewarding, yet it feels unsettling how this one particular language—English—has come to dominate almost every part of my life, distancing me from other languages and cultures that shaped me or that I’ve grown to love. This very blog post, the sources I consulted and questions I googled or asked ChatGPT in order to write it, but also my private searches for recipes or workout videos all happen in English. Most of the content I consume online is in English, despite not living in an English-speaking country. This quiet dominance reflects something larger than personal habit: the global linguistic hegemony of English, which mirrors deep social, political, and economic inequities. It privileges those of us fluent in it while excluding countless others—and even entire communities—from access to knowledge, participation in technology, and influence in the conversations that define our shared future. With the rapid spread of generative AI into nearly every aspect of daily life, this imbalance becomes even more amplified.
Even though there are over 7,000 languages spoken worldwide, the internet is primarily written in English and a small group of other mostly European or Asian languages, such as Spanish and Mandarin. Since generative AI tools are mostly trained on internet data—websites, books, Wikipedia, news articles, forums, code repositories, and social media—access to these tools may be limited for individuals or communities who are not fluent in these languages. To make matters worse, this limited access leads speakers of under-resourced languages to have a smaller digital footprint, making it less likely that their languages are included in web-scraped training data, resulting in a literal downward spiral: without sufficient data to train usable language-based systems, most of the world’s AI applications will under-represent billions of people, further deepening existing economic and political inequities.
Of all the languages spoken worldwide, only about twenty are considered “high-resource.” The rest—so, practically most of the world’s languages—are “low-resource.” This term refers not only to the limited amount of textual data available to train language-based systems effectively, but also to the lack of computational infrastructure needed to model and process these languages: things like keyboards, Unicode support, or even basic digital tools. It also includes the absence of researchers with relevant expertise and the lack of financial or political support for institutions researching and modeling these languages. Together, these factors pose a major challenge for building models that work well with low-resource languages and often lead to their significantly poorer performance.
Large-scale evaluations of generative AI reveal a striking imbalance in how well these systems handle the world’s languages. In the MEGA benchmark (Ahuja et al., 2023), which tested GPT-3.5, GPT-4, and other multilingual models across seventy languages, English and a few other high-resource languages consistently outperformed the rest by a wide margin. GPT-4, for instance, achieved over 96 percent accuracy on an English reasoning task but dropped to around 77 percent for Burmese, while similar disparities appeared for Tamil, Haitian Creole, and other low-resource or non-Latin-script languages. Even when prompts and translation strategies were adjusted, the gap persisted—showing how the dominance of English in training data continues to shape the very boundaries of what AI can understand. This inequity becomes even more pronounced within a single language family. In the SADID evaluation dataset (Abid, 2020), translation models handled Modern Standard Arabic—the formal, well-documented variety—far better than the spoken dialects of Egyptian or Levantine Arabic, where translation quality fell by more than half. These results expose a deeper layer of inequality: not only do low-resource languages struggle to be recognized by AI, but even within a shared linguistic tradition, the everyday voices of millions are systematically left out.
The results of this inequity, which is also known as digital language divide, go far deeper than making it harder for speakers of Indigenous or low-resource languages to navigate Internet as chatbots, translation devices, and voice assistants become a crucial way to do so. It has a direct impact on the quality, reliability, safety and even pricing of the generated content. Since it is very true that LLMs learn their values from their training data, this data should be carefully selected, filtered and curated to make a chatbot helpful, human-sounding, not-racist or non-sexist. Unfortunately, the texts readily available in low-resource languages are often of poor quality, badly translated or of limited use. For years, the main sources of text for many such low-resource languages in Africa were translations of the Bible or missionary websites, such as those from Jehovah’s Witnesses. While these texts are historically and linguistically of a great value, they are not the most useful base for an application that should help you to tutor your child, draft work memos, summarize books, conduct research, manage a calendar, book a vacation, fill out tax forms, surf the web, and so on.
Another issue lies in a vicious circle of models trained on incorrect data producing content of a questionable quality that is than used to train them further resulting in worse, unreliable or even unintelligible content. Since most websites make money through advertisements and subscriptions, which rely on attracting clicks and attention, an enormous portion of the web consists of content with limited literary or informational merit—an endless ocean of junk that exists only because it might be clicked on. For a wider outreach motivated by profit purposes this (poor) content is very often (poorly) machine translated to multiple languages by freely available AI programs of questionable accuracy. This same content is then scraped by AI developers to train their models further. Since there is still lot of high-quality data available for high-resource languages -especially for English given that half of internet websites are written in it – this problem is not that accentuated as in the case of low-resource languages, where data is generally scarce.
Another issue is that the pricing of models—and even their usage limitations—are based on linguistic features of English. While ChatGPT is free, many other LLMs charge users according to the number of tokens. For morphologically rich languages such as Armenian or Burmese, which require more tokens than English to express the same meaning, the price for text generation will be much higher. The same applies to limitations on the length of prompts or responses. Due to prompt limits, some tasks that require more elaborate instructions might be impossible in languages like Malayalam, whose token usage is 15.69 times higher than that of English. GPT-3, for instance, can return only up to 4,000 tokens for the combined prompt and response. That token count might correspond to a short tweet in one language but a medium-sized blog post in another.
Recent research also shows that large language models are not only less capable in low-resource languages — they are also less safe. Many studies (Yong et al., 2023; Deng et al., 2023; Shen et al., 2024) reveal that models like GPT-4 are three times more likely to produce harmful or unsafe content when prompted in low-resource languages than in English or other high-resource ones. The reason lies in the unequal depth of training and alignment: safety fine-tuning and moderation data exist mostly for English, so the guardrails that prevent toxic or biased outputs in one language simply fail to generalize to others. In addition, moderation systems themselves are often designed for English-like morphology and miss toxic phrasing in languages with richer or more complex structures. The result is a troubling paradox — the very languages most underrepresented in AI are also the ones most exposed to harm when they are finally included.
When AI fails to function well in under-resourced tongues, the consequences go far beyond convenience and safety—they strike at identity. Many speakers of low-visibility languages already live in a digital invisibility, and younger generations may see no reason to learn a language that no app, chatbot, or search engine understands. For languages with rich literary and artistic traditions—like Arabic, Persian, or many Indigenous and African languages—this effect is magnified: if people turn online for music, poetry, storytelling, or cultural content, and those forms aren’t supported in their language, the culture itself becomes harder to access and perpetuate. Languages that seem invisible to AI risk being abandoned—and with them, the songs, poems, and narratives that give them life.
The onus is on AI developers to listen to the actual needs of speakers, not just to generalize “big-tech” solutions for everyone. True linguistic inclusion requires more than adding translation features or tokenizing more languages—it means engaging with the communities who speak them. Speakers know best how their languages live and change, how they are taught, sung, and used in everyday life. Without their participation, even well-intentioned AI efforts risk reproducing the same hierarchies they claim to overcome. Building models that genuinely support linguistic diversity means designing with context, not just for convenience—acknowledging that a language is not only a system of words but a vessel of culture, memory, and identity. Only by centering real voices, not abstract datasets, can AI become a tool that strengthens rather than erases the world’s linguistic richness.




