M. Xenia Kudela – Closing the Gap in Non-Latin-Script Data

Generative AI and Low-Resource Languages: The Quest for Linguistic Equity

As an expat working in an international environment, I’m used to conducting nearly all my daily and professional tasks in a language that isn’t my own. As a linguist, I find that both fascinating and rewarding, yet it feels unsettling how this one particular language—English—has come to dominate almost every part of my life, distancing me from other languages and cultures that shaped me or that I’ve grown to love. This very blog post, the sources I consulted and questions I googled or asked ChatGPT in order to write it, but also my private searches for recipes or workout videos all happen in English. Most of the content I consume online is in English, despite not living in an English-speaking country. This quiet dominance reflects something larger than personal habit: the global linguistic hegemony of English, which mirrors deep social, political, and economic inequities. It privileges those of us fluent in it while excluding countless others—and even entire communities—from access to knowledge, participation in technology, and influence in the conversations that define our shared future. With the rapid spread of generative AI into nearly every aspect of daily life, this imbalance becomes even more amplified.

Even though there are over 7,000 languages spoken worldwide, the internet is primarily written in English and a small group of other mostly European or Asian languages, such as Spanish and Mandarin. Since generative AI tools are mostly trained on internet data—websites, books, Wikipedia, news articles, forums, code repositories, and social media—access to these tools may be limited for individuals or communities who are not fluent in these languages. To make matters worse, this limited access leads speakers of under-resourced languages to have a smaller digital footprint, making it less likely that their languages are included in web-scraped training data, resulting in a literal downward spiral: without sufficient data to train usable language-based systems, most of the world’s AI applications will under-represent billions of people, further deepening existing economic and political inequities.

Of all the languages spoken worldwide, only about twenty are considered “high-resource.” The rest—so, practically most of the world’s languages—are “low-resource.” This term refers not only to the limited amount of textual data available to train language-based systems effectively, but also to the lack of computational infrastructure needed to model and process these languages: things like keyboards, Unicode support, or even basic digital tools. It also includes the absence of researchers with relevant expertise and the lack of financial or political support for institutions researching and modeling these languages. Together, these factors pose a major challenge for building models that work well with low-resource languages and often lead to their significantly poorer performance.

Large-scale evaluations of generative AI reveal a striking imbalance in how well these systems handle the world’s languages. In the MEGA benchmark (Ahuja et al., 2023), which tested GPT-3.5, GPT-4, and other multilingual models across seventy languages, English and a few other high-resource languages consistently outperformed the rest by a wide margin. GPT-4, for instance, achieved over 96 percent accuracy on an English reasoning task but dropped to around 77 percent for Burmese, while similar disparities appeared for Tamil, Haitian Creole, and other low-resource or non-Latin-script languages. Even when prompts and translation strategies were adjusted, the gap persisted—showing how the dominance of English in training data continues to shape the very boundaries of what AI can understand. This inequity becomes even more pronounced within a single language family. In the SADID evaluation dataset (Abid, 2020), translation models handled Modern Standard Arabic—the formal, well-documented variety—far better than the spoken dialects of Egyptian or Levantine Arabic, where translation quality fell by more than half. These results expose a deeper layer of inequality: not only do low-resource languages struggle to be recognized by AI, but even within a shared linguistic tradition, the everyday voices of millions are systematically left out.

The results of this inequity, which is also known as digital language divide, go far deeper than making it harder for speakers of Indigenous or low-resource languages to navigate Internet as chatbots, translation devices, and voice assistants become a crucial way to do so. It has a direct impact on the quality, reliability, safety and even pricing of the generated content. Since it is very true that LLMs learn their values from their training data, this data should be carefully selected, filtered and curated to make a chatbot helpful, human-sounding, not-racist or non-sexist. Unfortunately, the texts readily available in low-resource languages are often of poor quality, badly translated or of limited use. For years, the main sources of text for many such low-resource languages in Africa were translations of the Bible or missionary websites, such as those from Jehovah’s Witnesses. While these texts are historically and linguistically of a great value, they are not the most useful base for an application that should help you to tutor your child, draft work memos, summarize books, conduct research, manage a calendar, book a vacation, fill out tax forms, surf the web, and so on.

Another issue lies in a vicious circle of models trained on incorrect data producing content of a questionable quality that is than used to train them further resulting in worse, unreliable or even unintelligible content. Since most websites make money through advertisements and subscriptions, which rely on attracting clicks and attention, an enormous portion of the web consists of content with limited literary or informational merit—an endless ocean of junk that exists only because it might be clicked on. For a wider outreach motivated by profit purposes this (poor) content is very often (poorly) machine translated to multiple languages by freely available AI programs of questionable accuracy. This same content is then scraped by AI developers to train their models further. Since there is still lot of high-quality data available for high-resource languages -especially for English given that half of internet websites are written in it – this problem is not that accentuated as in the case of low-resource languages, where data is generally scarce.

Another issue is that the pricing of models—and even their usage limitations—are based on linguistic features of English. While ChatGPT is free, many other LLMs charge users according to the number of tokens. For morphologically rich languages such as Armenian or Burmese, which require more tokens than English to express the same meaning, the price for text generation will be much higher. The same applies to limitations on the length of prompts or responses. Due to prompt limits, some tasks that require more elaborate instructions might be impossible in languages like Malayalam, whose token usage is 15.69 times higher than that of English. GPT-3, for instance, can return only up to 4,000 tokens for the combined prompt and response. That token count might correspond to a short tweet in one language but a medium-sized blog post in another.

Recent research also shows that large language models are not only less capable in low-resource languages — they are also less safe. Many studies (Yong et al., 2023; Deng et al., 2023; Shen et al., 2024) reveal that models like GPT-4 are three times more likely to produce harmful or unsafe content when prompted in low-resource languages than in English or other high-resource ones. The reason lies in the unequal depth of training and alignment: safety fine-tuning and moderation data exist mostly for English, so the guardrails that prevent toxic or biased outputs in one language simply fail to generalize to others. In addition, moderation systems themselves are often designed for English-like morphology and miss toxic phrasing in languages with richer or more complex structures. The result is a troubling paradox — the very languages most underrepresented in AI are also the ones most exposed to harm when they are finally included.

When AI fails to function well in under-resourced tongues, the consequences go far beyond convenience and safety—they strike at identity. Many speakers of low-visibility languages already live in a digital invisibility, and younger generations may see no reason to learn a language that no app, chatbot, or search engine understands. For languages with rich literary and artistic traditions—like Arabic, Persian, or many Indigenous and African languages—this effect is magnified: if people turn online for music, poetry, storytelling, or cultural content, and those forms aren’t supported in their language, the culture itself becomes harder to access and perpetuate. Languages that seem invisible to AI risk being abandoned—and with them, the songs, poems, and narratives that give them life.

The onus is on AI developers to listen to the actual needs of speakers, not just to generalize “big-tech” solutions for everyone. True linguistic inclusion requires more than adding translation features or tokenizing more languages—it means engaging with the communities who speak them. Speakers know best how their languages live and change, how they are taught, sung, and used in everyday life. Without their participation, even well-intentioned AI efforts risk reproducing the same hierarchies they claim to overcome. Building models that genuinely support linguistic diversity means designing with context, not just for convenience—acknowledging that a language is not only a system of words but a vessel of culture, memory, and identity. Only by centering real voices, not abstract datasets, can AI become a tool that strengthens rather than erases the world’s linguistic richness.

The Future of Our Digital Heritage?

Human Heritage in the Digital Era: A Fragile Legacy

In our rapidly advancing digital age, the concept of heritage—the legacy of physical artifacts and intangible attributes of a group or society passed down from previous generations—has taken on new dimensions. Traditionally, heritage encompasses tangible elements like monuments and artifacts, as well as intangible aspects such as traditions, languages, and knowledge. These elements serve as vital links to our past, shaping our identities and guiding our future.

Why Do We Produce Heritage?

Heritage production is an intrinsic human endeavor. It serves as a repository of collective memory, a means to understand our origins, and a foundation upon which we build our identities. By preserving stories, customs, and knowledge, we ensure that future generations have access to the wisdom and experiences of the past.

The Role of Heritage in Human Development

Heritage plays a pivotal role in the evolution of humankind. It fosters social cohesion, provides a sense of belonging, and contributes to economic development. Engagement with cultural heritage has been linked to improved mental health and well-being, highlighting its importance beyond mere historical interest.

The Heritage We Produce Today

In contemporary society, much of our heritage is being created and stored digitally. From social media interactions to digital art, scientific research, and governmental records, our collective memory is increasingly housed in digital formats. This shift presents both opportunities and challenges for preservation.

The Invisible Backbone of Modern Society

Our digital infrastructure functions as the transparent base of modern society. We rely on it for communication, commerce, education, and even the production of food and warmth. Yet, this infrastructure is largely invisible and often taken for granted. In our survival as a people, we rely—often without question—on invisible infrastructures, whether too distant to see or entirely digital. The knowledge and systems that underpin our daily lives are stored digitally, making them susceptible to various risks.

The Fragility of Digital Knowledge

Despite the conveniences of digital storage, the knowledge we depend on is at risk of disappearing. Factors as simple as the introduction of new data formats, hardware obsolescence, or cyber-attacks—and even less probable, yet still possible events like natural disasters—can render digital information inaccessible. While measures like backups and security systems are in place, they are not foolproof. The challenges of digital preservation include data loss, file format challenges, fragility of storage media, rapid technological evolution, and lack of funding.

This leads us to a surprisingly simple question: what has a greater chance of being discovered in 1,000 years—a meticulously stored digital dataset from a decade-long, two-million-euro research project, or a little girl’s diary?

Event Report: Exploratory Visualizations of Cultural Heritage. Introduction and Hands-on.

Cultural heritage is much more than mere objects placed side by side for isolated viewing in a display case or their digital counterparts presented together in online collections. Their value largely lies in something not directly visible: their relations to each other, mutual influences, tangled origins, and parallel developments. In short, in the stories they tell. The question is, how can one offer viewers adequate access to this history? How does one visualize what cannot be directly seen? The workshop “Explorative Visualisierungen von Kulturgut. Einführung und Hands-on” (Exploratory Visualizations of Cultural Heritage), held on January 14 at the SUB Hamburg, addressed precisely this question.

But first, let’s take a few steps back. Online catalogs as such are already a small revolution. They have enabled cultural participation for all by making countless collections permanently accessible to a wide audience. However, it is the method of isolated object representation that is increasingly subject to criticism. Lack of scale perception, loss of relations between objects, limited contextualization, and often lacking narrative are some points that can affect the interpretation of cultural heritage.

But the good news is that many are already trying to do better. One example was the British Museum with its “Museum of the World” (via Wayback Machine), an animated timeline (developed in collaboration with Weir+Wong and technological support from the Google Cultural Institute), which allowed interactive exploration of collections along parameters of geography and time. Another example is the Museum of Modern Art in New York with its virtual exhibition on abstract art “Inventing Abstraction 1910-1925,” where relationships between individual artists were made tangible through interactive networks. Not only museums themselves but also research institutes are working on innovative solutions. This includes the Fachhochschule Potsdam and the “Vikus Viewer”, a web-based visualization tool developed there, which arranges cultural artifacts on a dynamic canvas and supports the exploration of thematic and temporal patterns in large collections. What connects these approaches is the attempt to unlock cultural heritage in its continuity and interconnectedness and to enable viewers to freely explore this integrity.

This innovative approach is also followed by the project “Restaging Fashion” (ReFa), dedicated to the cultural history of clothing, in which the workshop speaker Dr. Sabine de Günther participates as a research associate. In the project, vestimentary sources are presented in a graph-based visualization that combines narration with exploration, aiming to offer the audience a guided collection entry as well as a freely explorable collection view along the thematic connections.

Building on the experiences gathered in the project, Dr. de Günther accompanied the workshop participants through the process of creating visualizations, from conveying the basic principles and analyzing existing examples to the collaborative design process of their own visualizations. Hundreds of notes with images of historical clothing from the project’s collection were distributed on the tables, a materialized dataset on which the participants could unleash their visualization creativity. Unlike other workshops in the “Digital Humanities – Wie geht das?” series, the tasks were to be tackled analog and without computers, as the focus was on understanding the logic of the visualization process and on the development of creative visualization approaches. Participants were, for example, asked to collage the mock-up visualization of an aspect of the collection that interested them the most or to find their own unique approach to the collection and make it visually understandable for others.

The impressive variety of ideas and approaches clearly demonstrated that there is much potential in unconventional data representation. How can one depict the temporal development of male headwear without losing sight of regional differences? How can one tell the story of a piece of jewelry that appears in several images simultaneously? Or more generally, how does one present a dataset to allow the viewer to freely explore the network of information according to their interests? Even if the answer seems complex, the workshop showed the variety of creative solutions that exploratory visualization of cultural heritage offers.

This contribution was simultaneously published in German on the blog “DH³ – Digital Humanities in der Hansestadt Hamburg” operated by the Referat für Digitale Forschungsdienste of the Staats- und Universitätsbibliothek (SUB) Hamburg.

Workshop: How to Preserve Diverse Data in a Monolingual Environment: Introducing the Project Closing the Gap in Non-Latin-Script Data (14.02)

Wednesday, February 14, 9:30 – 12:30
at HG154 (Vortragsraum), VMP3

Registration: [email protected]

In our era of vast technological developments, digital methods have unlocked a broad spectrum of new research possibilities, not only in the natural sciences but also in the social sciences and the humanities. Digital preservation, new tools for distant reading, and quantitative text analysis have revolutionized knowledge extraction from texts. However, as these fields are largely dominated by the Global North, research involving materials in languages from beyond that sphere often faces limitations that hinder the utilization of novel technologies.

The workshop “How to Preserve Diverse Data in a Monolingual Environment,” to be held on February 14 at the Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky, is part of an initiative to address this asymmetry. The research project Closing the Gap in Non-Latin-Script Data (based at the Freie Universität Berlin), in cooperation with the Referat für Digitale Forschungsdienste at the SUB Hamburg, has been conducting a survey and analysis of the field of Digital Humanities with a focus on low-resource and non-Latin-script (NLS) languages. The aim is to identify technical and structural limitations that may arise across various stages of projects working with such languages, particularly in terms of data analysis and sustainable data preservation. Furthermore, Closing the Gap strives to set an example for multilingual DH research aligned with FAIR principles, offering its workflows and solutions as guidelines for the community.

The goal of this workshop is twofold. First, members of the Closing the Gap team will present some of the data that the project has collected and the workflows that have been developed, as well as preliminary insights from this research—thereby providing an overview of challenges that are commonly faced in multilingual DH. Second, the workshop is intended to create a space for open discussion and exchange of ideas among DH practitioners, librarians, and others who are interested in improving the conditions for working with NLS textual data.

Teaching multilingual DH?

With the increasing prevalence of algorithms and artificial intelligence in information and knowledge transfer, proficiency in digital research methods became an indispensable skill for most natural and social scientists. The field of humanities is also increasingly recognizing the importance of staying up-to-date with possibilities offered by digital research and overall technological developments, as demonstrated by the emergence and growing popularity of Digital Humanities. Yet at this point arises the most crucial question: how to enhance data literacy among both humanities students and researchers? This question seems to be even more critical in the field of Multilingual Digital Humanities, which as a subfield of broader understood humanities struggles not only with insufficient institutional support and shortage of resources but also with technical challenges resulting from multiscriptuality and variety of its target languages, many of them being languages with limited digital resources or tools available for research.

But there is no need to be too pessimistic! The Multilingual DH community is actively working on strategies to address these problems, and with the emergence of DH-related courses in Area Studies at some universities, there’s hope on the horizon. A great opportunity for exchange and discussion, prerequisites of any progress, arose in the workshop “Digital Literacy in der multilingualen und -skriptualen Lehre” (Digital Literacy in the multilingual and multiscriptual teaching) organized by the Staats- und Universitätsbibliothek Hamburg in cooperation with DHd AG Multilingual DH, Universitätsbibliothek and the Institute of Arabic Studies at the Freie Universität Berlin, as well as Philipps-Universität Marburg. The workshop took place on May 8, 2023, and brought together 14 participants from German, Austrian, and Swiss universities, who were tasked with creating a position paper that outlined the current status of digital training in multilingual humanities, identified factors preventing its progress, and proposed concrete solutions. The diversity of the workshop participants, including researchers, librarians, and students, allowed for a multifaceted exploration of the topic, ensuring that the needs and experiences of different academic groups were thoroughly considered.

Workshop Digital Literacy in der multilingualen und -skriptualen Lehre Source: M. Xenia Kudela

The workshop addressed a range of issues, from formal questions related to the integration of digital competencies into curricula and teaching infrastructures to practical considerations such as teaching methods, accessibility of teaching resources, and formulation of key competencies. However, the most discussed topics were firstly the problem of limited digital expertise among humanities researchers, resulting in a dependence on a few individuals with these skills for DH training and research, and secondly the challenges specific to the field of Multilingual DH such as limited data availability for low-resource languages, or insufficient functionality of existing programs for non-European datasets, which can make it difficult, and at times even impossible, to integrate digital methods into specific fields of study, even when students are already familiar with general DH practices.

Many potential solutions were proposed. Organizing online co-teaching and cross-university teach-ins was seen as a way to encourage cooperation between universities and to push back against institute-exclusive expertise. Switching to formats more appropriate for DH requirements like block seminars and hackathons, inviting technical support assistants to classes, establishing DH-help desks, enhancing cooperation with libraries and DH centers, and offering tool training for teachers, are just a few of many ideas the group came up with. In the context of multilinguality and DH in Aria Studies a big emphasis was put on collaboration and exchange with universities and researchers from the respective countries. Generally speaking, developing a solid ML DH network within and between institutes was seen as the point of departure for any change and improvement, and the workshop was meant as an initiative to establish such a community. Besides the finalization of the position paper, the participants will continue to work together to set a platform that will serve as a knowledge hub for ML DH, providing living handbooks and opportunities for exchange and collaboration.

As mentioned before, fortunately, DH is already making its way into teaching curricula. To name some examples, the University of Hamburg with its Custer of Excellence “Understanding Written Artefacts” became an important hub for DH in multilingual and cross-historical contexts, making training more accessible for students. Since 2021, the Institute for Africa-Asia Studies is offering a DH course covering an introduction to a range of digital tools and techniques, from programming and data collection to text analysis, mapping, and social network analysis. Another center for Multilingual DH is the Institute for Arabic Studies at Freie Universität Berlin, which has a long-term project on the digital edition of Kalila wa-Dimna called “Anonym Classics”. Here, students are provided with research-based DH training that directly benefits from the knowledge and expertise gained in the course of the project. In the summer semester of 2023, the researchers of “Anonym Classics” offered a course on “Scholarly Text Editing in Arabic”, which gave students a theoretical introduction to critical and digital editions as well as practical skills like XML and TEI, all in reference to the edition of Kalila wa-Dimna.

Overall, while being a humanist, learning and teaching qualitative and technology-based research methods can be very challenging, and doing so in a multilingual context at least doubles the challenge. However, it also doubles the fun and satisfaction of getting better! When combined with collaboration and a supportive community, transparency and open communication about one´s struggles can be a powerful motivator and source of empowerment, particularly for students, who may be very often intimidated by the intricacy of quantitative and digital methods. Knowing that their teachers are also walking the path of trial and error can only encourage them to try it out themselves.