Digital resources in the Social Sciences and Humanities OpenEdition Our platforms OpenEdition Books OpenEdition Journals Hypotheses Calenda Libraries OpenEdition Freemium Follow us

Generative AI and Low-Resource Languages: The Quest for Linguistic Equity

As an expat working in an international environment, I’m used to conducting nearly all my daily and professional tasks in a language that isn’t my own. As a linguist, I find that both fascinating and rewarding, yet it feels unsettling how this one particular language—English—has come to dominate almost every part of my life, distancing me from other languages and cultures that shaped me or that I’ve grown to love. This very blog post, the sources I consulted and questions I googled or asked ChatGPT in order to write it, but also my private searches for recipes or workout videos all happen in English. Most of the content I consume online is in English, despite not living in an English-speaking country. This quiet dominance reflects something larger than personal habit: the global linguistic hegemony of English, which mirrors deep social, political, and economic inequities. It privileges those of us fluent in it while excluding countless others—and even entire communities—from access to knowledge, participation in technology, and influence in the conversations that define our shared future. With the rapid spread of generative AI into nearly every aspect of daily life, this imbalance becomes even more amplified.

Even though there are over 7,000 languages spoken worldwide, the internet is primarily written in English and a small group of other mostly European or Asian languages, such as Spanish and Mandarin. Since generative AI tools are mostly trained on internet data—websites, books, Wikipedia, news articles, forums, code repositories, and social media—access to these tools may be limited for individuals or communities who are not fluent in these languages. To make matters worse, this limited access leads speakers of under-resourced languages to have a smaller digital footprint, making it less likely that their languages are included in web-scraped training data, resulting in a literal downward spiral: without sufficient data to train usable language-based systems, most of the world’s AI applications will under-represent billions of people, further deepening existing economic and political inequities.

Of all the languages spoken worldwide, only about twenty are considered “high-resource.” The rest—so, practically most of the world’s languages—are “low-resource.” This term refers not only to the limited amount of textual data available to train language-based systems effectively, but also to the lack of computational infrastructure needed to model and process these languages: things like keyboards, Unicode support, or even basic digital tools. It also includes the absence of researchers with relevant expertise and the lack of financial or political support for institutions researching and modeling these languages. Together, these factors pose a major challenge for building models that work well with low-resource languages and often lead to their significantly poorer performance.

Large-scale evaluations of generative AI reveal a striking imbalance in how well these systems handle the world’s languages. In the MEGA benchmark (Ahuja et al., 2023), which tested GPT-3.5, GPT-4, and other multilingual models across seventy languages, English and a few other high-resource languages consistently outperformed the rest by a wide margin. GPT-4, for instance, achieved over 96 percent accuracy on an English reasoning task but dropped to around 77 percent for Burmese, while similar disparities appeared for Tamil, Haitian Creole, and other low-resource or non-Latin-script languages. Even when prompts and translation strategies were adjusted, the gap persisted—showing how the dominance of English in training data continues to shape the very boundaries of what AI can understand. This inequity becomes even more pronounced within a single language family. In the SADID evaluation dataset (Abid, 2020), translation models handled Modern Standard Arabic—the formal, well-documented variety—far better than the spoken dialects of Egyptian or Levantine Arabic, where translation quality fell by more than half. These results expose a deeper layer of inequality: not only do low-resource languages struggle to be recognized by AI, but even within a shared linguistic tradition, the everyday voices of millions are systematically left out.

The results of this inequity, which is also known as digital language divide, go far deeper than making it harder for speakers of Indigenous or low-resource languages to navigate Internet as chatbots, translation devices, and voice assistants become a crucial way to do so. It has a direct impact on the quality, reliability, safety and even pricing of the generated content. Since it is very true that LLMs learn their values from their training data, this data should be carefully selected, filtered and curated to make a chatbot helpful, human-sounding, not-racist or non-sexist. Unfortunately, the texts readily available in low-resource languages are often of poor quality, badly translated or of limited use. For years, the main sources of text for many such low-resource languages in Africa were translations of the Bible or missionary websites, such as those from Jehovah’s Witnesses. While these texts are historically and linguistically of a great value, they are not the most useful base for an application that should help you to tutor your child, draft work memos, summarize books, conduct research, manage a calendar, book a vacation, fill out tax forms, surf the web, and so on.

Another issue lies in a vicious circle of models trained on incorrect data producing content of a questionable quality that is than used to train them further resulting in worse, unreliable or even unintelligible content. Since most websites make money through advertisements and subscriptions, which rely on attracting clicks and attention, an enormous portion of the web consists of content with limited literary or informational merit—an endless ocean of junk that exists only because it might be clicked on. For a wider outreach motivated by profit purposes this (poor) content is very often (poorly) machine translated to multiple languages by freely available AI programs of questionable accuracy. This same content is then scraped by AI developers to train their models further. Since there is still lot of high-quality data available for high-resource languages -especially for English given that half of internet websites are written in it – this problem is not that accentuated as in the case of low-resource languages, where data is generally scarce.  

Another issue is that the pricing of models—and even their usage limitations—are based on linguistic features of English. While ChatGPT is free, many other LLMs charge users according to the number of tokens. For morphologically rich languages such as Armenian or Burmese, which require more tokens than English to express the same meaning, the price for text generation will be much higher. The same applies to limitations on the length of prompts or responses. Due to prompt limits, some tasks that require more elaborate instructions might be impossible in languages like Malayalam, whose token usage is 15.69 times higher than that of English. GPT-3, for instance, can return only up to 4,000 tokens for the combined prompt and response. That token count might correspond to a short tweet in one language but a medium-sized blog post in another.

Recent research also shows that large language models are not only less capable in low-resource languages — they are also less safe. Many studies (Yong et al., 2023; Deng et al., 2023; Shen et al., 2024) reveal that models like GPT-4 are three times more likely to produce harmful or unsafe content when prompted in low-resource languages than in English or other high-resource ones. The reason lies in the unequal depth of training and alignment: safety fine-tuning and moderation data exist mostly for English, so the guardrails that prevent toxic or biased outputs in one language simply fail to generalize to others. In addition, moderation systems themselves are often designed for English-like morphology and miss toxic phrasing in languages with richer or more complex structures. The result is a troubling paradox — the very languages most underrepresented in AI are also the ones most exposed to harm when they are finally included.

When AI fails to function well in under-resourced tongues, the consequences go far beyond convenience and safety—they strike at identity. Many speakers of low-visibility languages already live in a digital invisibility, and younger generations may see no reason to learn a language that no app, chatbot, or search engine understands. For languages with rich literary and artistic traditions—like Arabic, Persian, or many Indigenous and African languages—this effect is magnified: if people turn online for music, poetry, storytelling, or cultural content, and those forms aren’t supported in their language, the culture itself becomes harder to access and perpetuate. Languages that seem invisible to AI risk being abandoned—and with them, the songs, poems, and narratives that give them life.

The onus is on AI developers to listen to the actual needs of speakers, not just to generalize “big-tech” solutions for everyone. True linguistic inclusion requires more than adding translation features or tokenizing more languages—it means engaging with the communities who speak them. Speakers know best how their languages live and change, how they are taught, sung, and used in everyday life. Without their participation, even well-intentioned AI efforts risk reproducing the same hierarchies they claim to overcome. Building models that genuinely support linguistic diversity means designing with context, not just for convenience—acknowledging that a language is not only a system of words but a vessel of culture, memory, and identity. Only by centering real voices, not abstract datasets, can AI become a tool that strengthens rather than erases the world’s linguistic richness.

People Behind the Interface: Sustainability as a Social Process in DH


People Sustain Projects—But Who Sustains the People?

In Digital Humanities (DH), sustainability is often discussed in terms of infrastructure: servers, standards, and repositories. But as I began to explore what sustains—or quietly erodes—a DH project over time, it became increasingly clear that infrastructure alone is not enough. Projects are not only built on servers, schemas, and software, but also on people: their communication, their turnover, their sense of purpose, and their evolving relationships with one another and with the work itself.

Recent scholarship, such as Claire Battershill’s The Stories We Tell, invites us to see DH projects as narrative and emotional spaces—held together not just by code and metadata, but by human intention and care. Taking inspiration from this perspective, I was eager to explore sustainability not as a technical checklist, but as a lived experience within a team.

To ground this exploration, I turned to a long-running DH project that combines rigorous philology with innovative digital practice. I spoke with three members of the team—each at a different career stage—to hear how they navigate continuity, change, and the question of what remains once the funding ends.

The following sections share insights from these three vantage points—not as isolated anecdotes, but as interdependent reflections on what it takes to sustain a project not only technically, but relationally.

Insights from the Principal Investigator: Leading for Longevity

The interview with the Principal Investigator (PI) illuminated how the sustainability of a Digital Humanities (DH) project is shaped not only by technical foresight, but also by leadership choices, institutional negotiation, and the evolving social fabric of a research team. From the outset, the project was grounded in a critical response to the underrepresentation of non-Western texts—particularly Arabic wisdom literature—within the prevailing frameworks of world literature. What began as a curricular intervention gradually developed into a long-term DH initiative, made possible by the collaborative funding structures of the German academic landscape.

While the PI brought a strong scholarly vision to the project, the transition into team leadership was initially marked by uncertainty. “I was scared,” they admitted. “I had never led a team before. I was learning on the job.” Over time, however, a leadership model emerged that prioritized intellectual trust and distributed agency. By deliberately avoiding micromanagement, the PI fostered an environment in which team members could internalize the project’s goals and develop their own methodological approaches. Weekly one-on-one meetings—often unstructured and occasionally lasting several hours—provided a consistent framework for communication, mentoring, and mutual learning.

Despite this intentional structure, staff turnover presented recurring challenges. The departure of key team members—whether for academic appointments or professional advancement—was described not only as a logistical concern but also as a form of emotional and epistemic loss. Individuals carried with them deeply embodied forms of project knowledge that could not easily be documented or replaced. The PI acknowledged the uniqueness of each team member’s contributions, noting, “There’s no one like anyone else. You can’t replace a person, only reconfigure the team.”

In response, the PI adopted a practice of promoting from within. Junior researchers were provided opportunities to assume greater responsibility, and in doing so, many exceeded expectations. What initially emerged from necessity evolved into a strategic approach that balanced continuity with capacity building. This form of internal promotion addressed immediate gaps and supported professional development across career stages—an outcome the PI framed as both intellectually and ethically valuable.

Yet human dynamics were only one axis of sustainability. The PI’s concern extended equally to the preservation of digital outputs—particularly in relation to non-Latin scripts and complex textual traditions. Standard editing models, such as TEI/XML, were deemed ill-suited to the structure of Arabic manuscripts. The team therefore opted to develop custom tools that responded more intuitively to the demands of right-to-left script encoding and fluid textuality. While such decisions enhanced usability and philological rigour during the project, they also introduced new risks regarding long-term interoperability and institutional adoption.

“I knew from the beginning,” the PI explained, “you don’t work on an 8th-century tradition and let it disappear after ten years.”

Although sustainability planning was not explicitly required by the funding agency at the time of application, the PI took proactive steps to secure the long-term viability of the project’s outputs. These included negotiating with the university for post-project hosting, advocating for the integration of the edition into broader research infrastructure, and insisting on detailed internal documentation—ranging from GitHub repositories to graduate theses outlining software architecture.

Nevertheless, structural limitations persist. Institutional uncertainty, shifting IT policies, and ongoing budget constraints complicate efforts to formalize preservation pathways. The PI described this situation with a tempered sense of realism: “It’s an uphill battle. But it’s one worth fighting.”

Perhaps most notably, the interview underscored a broad and inclusive definition of sustainability—one that encompasses intellectual, technical, and relational dimensions. The project has cultivated a diverse team in terms of disciplinary background, gender, and academic rank. Credit is shared generously, including co-authorships with student assistants. Mentorship is embedded in daily routines, and collaboration is understood as a mutual investment in both knowledge production and professional growth.

In this sense, the project has come to function not only as a research endeavour but also as a sustained community of practice. Its continuity does not rest solely on software or servers, but on the relationships, values, and adaptive strategies that allow it to evolve in response to change. 

Insights from the Research Associate: Carrying the Technical Legacy

The RA’s journey offers an on-the-ground view of how knowledge transfer, role expansion, and emotional investment intersect to shape project sustainability—often in invisible but critical ways.

Originally hired as a student assistant to convert TEI-encoded XML files and upload them to the platform, they eventually developed a deep familiarity with the technical infrastructure of the database. When a key collaborator—their closest programming partner—left the team, they faced a sudden and overwhelming shift in responsibility. Much of the work had not been documented, which necessitated intense daily meetings to transfer knowledge. While this gave the RA greater control and understanding of the project, it also diverted them from their regular tasks and concentrated essential knowledge in a single person: themselves.

The weight of this experience was not just technical but psychological. While the RA valued the learning process and grew into their expanded role, they now worry about what will happen when they, too, eventually leave. The project’s workflows have grown increasingly complex, making it difficult to break tasks into smaller, trainable components for newcomers. As they explained, “it was also very hard for me to create tasks for a new programmer because it’s now so interconnected.”

The RA’s reflections also touched on the broader infrastructure of digital sustainability. They noted that while the team had successfully published parts of the edition on the University server, true longevity would require a long-term maintenance strategy—something they felt was missing. Without clear institutional plans for post-project preservation, and without standardized workflows across DH projects, the future usability of the data remains uncertain.

Yet, despite these systemic limits, their narrative remain grounded in pragmatic optimism. The RA is currently mentoring a new student on programming tasks, trying to rebuild a more shareable, modular workflow. Their reflections call for a deeper institutional responsibility: “Maybe the university should provide tools or standardized ways to make these projects more sustainable—not just leave it to researchers who have to reinvent everything.”

Insights from the Student Assistant: Fragmentation and Emotional Distance

The perspective of the student assistant (SA) offers a valuable lens into how early-career contributors engage with and interpret sustainability from their unique position within a Digital Humanities project. Over the course of their three years on the team, their responsibilities centred around preparing and uploading XML files to the platform, based on the team’s transcriptions and segmentations of Arabic manuscripts. These contributions, though often seen as peripheral, play a crucial role in the overall functionality and accessibility of the digital edition.

What emerges from the SA’s experience is a sense of structured participation: while their tasks were clearly assigned and executed with autonomy, they remained distinct from the project’s overarching design or decision-making processes. Still, their account demonstrates the ways in which student assistants become deeply woven into the fabric of a project, particularly through interpersonal connections. They noted the strong bonds formed with earlier team members, and how their departure left a noticeable absence—emotionally and professionally. The early phase of their involvement was also marked by team-building activities, which facilitated a strong sense of belonging and familiarity among members. As those activities tapered off, newer team members came to play a smaller role in her day-to-day work, subtly reshaping the social dynamic.

Interestingly, while the concept of sustainability had not been explicitly discussed during team meetings that included them, the interview itself prompted them to reflect on its importance for the first time. When invited to imagine a scenario in which all project data vanished without lasting output or publication, they responded with genuine concern. The SA expressed a strong desire to see the project reach its conclusion and hoped to witness its results publicly realized—suggesting that a deeper sense of investment does exist, even if not always made visible in daily tasks.

Their reaction to the interview underscores an important insight: sustainability awareness often emerges not only from formal training or top-down directives but also through moments of dialogue, reflection, and contextualization. That they were thankful for the interview initiating such reflection illustrates the transformative potential of including all team members in discussions about a project’s long-term goals and outcomes.

Sustaining the Human Infrastructure

Across the three interviews, a clear picture emerged: the long-term sustainability of a DH project is never solely a technical or institutional issue. It is also deeply interpersonal. The PI emphasized vision, delegation, and the emotional labour of letting go; the RA described the burden of undocumented knowledge and the fragility of continuity in the face of staff turnover; the SA highlighted how task specialization and clear roles can support workflow efficiency while also expressing appreciation for a team culture that welcomes new ideas and encourages personal growth.

Taken together, their perspectives reveal a project that is held together not just by platforms or preservation plans, but by a human infrastructure—an evolving network of relationships, expectations, mentorships, and affective investments. When this infrastructure is strong and transparent, transitions become opportunities, not crises. When it is neglected, technical robustness alone cannot guarantee sustainability.

Digital Humanities often prides itself on collaboration, yet too often the emotional and structural conditions of that collaboration remain invisible. If we are serious about sustaining the outputs of our field, we must also sustain the people who produce them—through inclusive planning, reflexive leadership, and honest conversations about what will happen after the funding runs out.

In the end, digital sustainability is not simply a matter of keeping data alive. It is about making sure that the stories, skills, and people behind those data are not forgotten.

The Future of Our Digital Heritage?

Human Heritage in the Digital Era: A Fragile Legacy

In our rapidly advancing digital age, the concept of heritage—the legacy of physical artifacts and intangible attributes of a group or society passed down from previous generations—has taken on new dimensions. Traditionally, heritage encompasses tangible elements like monuments and artifacts, as well as intangible aspects such as traditions, languages, and knowledge. These elements serve as vital links to our past, shaping our identities and guiding our future.

Why Do We Produce Heritage?

Heritage production is an intrinsic human endeavor. It serves as a repository of collective memory, a means to understand our origins, and a foundation upon which we build our identities. By preserving stories, customs, and knowledge, we ensure that future generations have access to the wisdom and experiences of the past.

The Role of Heritage in Human Development

Heritage plays a pivotal role in the evolution of humankind. It fosters social cohesion, provides a sense of belonging, and contributes to economic development. Engagement with cultural heritage has been linked to improved mental health and well-being, highlighting its importance beyond mere historical interest.

The Heritage We Produce Today

In contemporary society, much of our heritage is being created and stored digitally. From social media interactions to digital art, scientific research, and governmental records, our collective memory is increasingly housed in digital formats. This shift presents both opportunities and challenges for preservation.

The Invisible Backbone of Modern Society

Our digital infrastructure functions as the transparent base of modern society. We rely on it for communication, commerce, education, and even the production of food and warmth. Yet, this infrastructure is largely invisible and often taken for granted. In our survival as a people, we rely—often without question—on invisible infrastructures, whether too distant to see or entirely digital. The knowledge and systems that underpin our daily lives are stored digitally, making them susceptible to various risks.

The Fragility of Digital Knowledge

Despite the conveniences of digital storage, the knowledge we depend on is at risk of disappearing. Factors as simple as the introduction of new data formats, hardware obsolescence, or cyber-attacks—and even less probable, yet still possible events like natural disasters—can render digital information inaccessible. While measures like backups and security systems are in place, they are not foolproof. The challenges of digital preservation include data loss, file format challenges, fragility of storage media, rapid technological evolution, and lack of funding.

This leads us to a surprisingly simple question: what has a greater chance of being discovered in 1,000 years—a meticulously stored digital dataset from a decade-long, two-million-euro research project, or a little girl’s diary?

Essential Digital Resources for Students of Arabic Studies

In the past, studying Arabic literature meant spending hours in archives, flipping through bulky dictionaries, and deciphering various calligraphic styles in the manuscripts. While physical books in the library remain invaluable, the rise of digital humanities has given us access to powerful tools that support research, enhance analysis, and provide deep insights into Arabic literary heritage. 

The research process in Arabic studies is not always linear—sometimes, you start with a text and then consult dictionaries, while other times, you might begin with a dictionary to clarify a word before selecting a text. Depending on your focus, you may also need computational tools for text analysis or reference works for historical and literary context. To help students navigate Arabic studies more effectively, this post introduces essential digital resources, including text repositories, dictionaries, digital analysis tools, and academic databases.

1. Text Resources (Poetry & Qur’ānic Texts) → Find and Explore Primary Sources

Before diving into linguistic details, we need access to primary sources—poetry, historical texts, and often the Qur’ān. These digital repositories make it easier to find and work with Arabic texts.

Poetry & Arabic Literary Texts

  • Al-Diwan: A comprehensive database of classical and modern Arabic poetry, allowing users to search by poet, theme, era, and country. It provides valuable insights into the evolution of Arabic poetic forms and styles, making it an essential resource for students of Arabic literature.
  • Al-Maktaba Al-Shāmila: A comprehensive digital library housing thousands of classical Arabic works, including poetry, prose, and Islamic texts. Its powerful search functionality allows users to locate specific texts, keywords, and phrases across a vast collection, making it an essential resource for academic research. The library is accessible both on computers and mobile devices, with categorised browsing options for Islamic sciences, literature, and historical sources, facilitating efficient navigation and exploration.
  • OpenITI: A digital corpus of pre-modern Arabic texts designed for computational research. It enables students to conduct large-scale text mining, linguistic analysis, and comparative studies across thousands of Arabic texts. OpenITI is especially valuable for those interested in digital humanities, allowing for the exploration of stylistic trends, textual variations, and intertextuality in Arabic literature and historical sources.

Qur’ānic Resources

Understanding and analysing the Qur’ān requires access to accurate, well-structured digital resources. Whether you are studying its linguistic features, theological interpretations, or textual variations, these platforms provide essential tools for both beginners and advanced researchers.

  • The Noble Qur’an: A widely used online platform offering translations, tafsīr, and recitations of the Qur’ān in multiple languages. This resource is beneficial for comparative studies and linguistic analysis.
  • Tanzil: A high-quality, verified digital Qur’ānic text that ensures accuracy for academic reference and software development. It provides both Uthmani and Imlaei script versions, allowing students to study different orthographic styles. Its advanced search functionality, inclusion of pause marks, and customisable diacritic options make it a highly flexible tool for Qur’ānic studies, Arabic linguistics, and digital humanities research.
  • Corpus Coranicum:  A comprehensive digital project that compiles Qur’ānic manuscripts, variant readings, and historical texts related to the Qur’ān. It features a searchable database with access to early manuscripts, transliterations, and variant readings, along with a philological commentary examining the historical development of the Qur’ānic text. 
  • Quranic Arabic Annotated Corpus: A linguistically annotated database that provides morphological and syntactic analysis of the Qur’ān, allowing users to explore grammatical structures and lexical patterns. It also features a syntactic treebank, semantic ontology, and detailed word-by-word analysis, which makes it easier to explore grammar, syntax, and Qur’anic linguistic structures.

2. Dictionaries & Lexicons → Understand Word Meanings

Once you have found a text, you may need dictionaries and lexicons to interpret complex words and phrases. These resources not only help with translation but also provide insights into historical meanings, etymology, and linguistic variations.

  • Ejtaal: A searchable collection of Arabic dictionaries, including both Hans Wehr, and Lane’s Lexicon. While navigating between dictionaries can be challenging, the platform allows users to compare definitions across multiple lexicons within a single interface.
  • Lane’s Lexicon: A historical Arabic-English dictionary that provides rich etymological insights. It is particularly useful for research dedicated to classical Arabic literature and historical texts. Two versions of this lexicon are available, allowing you to choose the one that best suits your needs. (link I, link II)
  • Almaany: A modern dictionary offering contextual meanings and quick translations. Its user-friendly interface and extensive database make it an excellent choice for students working with contemporary Arabic texts.

3. Digital Humanities Tools → Computational Approaches to Text Analysis

While digital tools are not always necessary, they provide valuable insights by identifying linguistic patterns, tracking word frequency, and analysing textual structures. Computational methods enable efficient comparative analysis, visualisation of textual relationships, and deeper engagement with Arabic texts. These tools are especially useful for students exploring digital humanities, computational linguistics, and advanced text analysis techniques.

  • Voyant Tools: A web-based text analysis and visualisation tool that allows users to explore word frequency, collocations, and thematic trends in Arabic texts. It provides visualisations such as word clouds, frequency graphs, and keyword-in-context analysis, supporting both quantitative and qualitative research approaches.
  • Farasa: A suite of natural language processing (NLP) tools for Arabic text analysis, developed by the Qatar Computing Research Institute (QCRI). It offers tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis, making it essential for Arabic text processing and linguistic research. The tools are accessible via an online demo page or through Web API services.
  • Arabic Romanization ALA-LC: A tool that converts Arabic script into standardised Latin transliteration, ensuring accuracy and consistency in academic work that involves transliterated Arabic terms. 

4. Reference Works & Academic Databases → Contextualise Research

Once a text is analysed, you may need secondary sources to support your arguments and understand historical, literary, or religious contexts.

  • Encyclopaedia of Islam II: a leading academic resource on Islamic history, culture, and linguistics, offering in-depth articles on historical figures, legal traditions, religious practices, and social structures. It provides authoritative, well-referenced information on Islamic civilization, with critical insights into both historical developments and contemporary interpretations.
  • Encyclopaedia of the Qur’ān: a comprehensive reference work covering Qur’ānic terms, concepts, personalities, place names, cultural history, and exegesis. It also includes essays on key themes, making it an essential resource for those exploring the Qur’ān’s content, historical context, and interpretative traditions.

Both Encyclopaedia of Islam II and Encyclopaedia of the Qur’ān can be accessed through the university network or via the VPN service.

  • Encyclopaedia Iranica: a comprehensive academic resource covering Iranian history, culture, literature, and its intersections with Arabic and Islamic studies. It is valuable for researching cross-cultural influences between Persian and Arabic traditions, providing in-depth articles by experts in the field on historical figures, literary movements, and cultural exchanges that shaped the region. 

Digital resources have transformed Arabic studies, making it easier to access, analyse, and contextualise Arabic texts. By integrating these tools into the workflow, you can enhance your understanding of Arabic literature, historical texts, and the Qur’ān.

Online Academic Projects

A better way of publishing

Academic publishing stands at a crossroads. The traditional system of scholarly journals, born in the age of printing presses and horse-drawn carriages, seems increasingly ill-suited to our digital world. While these venerable institutions have served scholarship well for centuries, they now represent a bottleneck in the free flow of academic knowledge. Digital projects can and should replace most traditional academic publishing, offering a more efficient, accessible, and transparent way to share scholarly work.

The Limits of Academic Publishing

The current academic publishing system is, to put it bluntly, ‌obsolete. Academic journals emerged in 1665, with the nearly simultaneous appearance of the Journal des sçavans on January 5th and the Philosophical Transactions of the Royal Society on March 6th. These publications filled a crucial need in their time, enabling scholars to communicate when no other reliable means existed. Before telegraphs, telephones, radio, television, or the internet, printed journals were indispensable for their ability to spread knowledge across borders and continents.

It’s worth remembering that early academic publishing was often the domain of wealthy hobbyists. Printing has always been an expensive endeavor, and the costs of production and distribution created natural barriers to entry. The Royal Society itself was initially composed largely of gentleman scientists, who could afford to pursue research as a passion rather than a profession. This historical context helps explain some of the traditions and structures that persist in academic publishing today, even though the social and technological landscape has changed dramatically. Today we thoughtlessly ape the behavior of noblemen from centuries past, but without the same goals and limitations that originally motivated it.

To be clear, there are still many cases where printing makes sense. No one is suggesting that we should stop printing books. But few scholars sit down to read academic articles as they would a novel. Instead, we jump from abstract to conclusion and back, skim for specific content and references, and generally dissect the paper as a data source instead of digesting it as a prose narrative. Digital formats are almost always sufficient for this purpose and often better suited to it. Carrying around reams of paper just to have access to the few pages relevant to one’s research is impractical, while an iPad can hold an entire library’s worth of academic papers in a smaller package than even one single print journal. More importantly, the ability to search, annotate, and cross-reference digital texts makes them better tools for research and scholarship than their printed counterparts.

More troublingly, the current system takes more than it gives. Academic publishing has evolved into a $25 billion industry with profit margins around 40%—numbers that would be impressive if not for the fact that most of the labor is provided for free. These profit margins exceed those of tech giants like Apple and Google, despite the publishers contributing relatively little to the actual creation of content. Researchers write articles without compensation (though one could argue they receive indirect benefits through career advancement), peer reviewers donate their time and expertise, and academics often handle editing duties themselves.

The structure of academic publishing is particularly problematic from an economic perspective. Most purchases are made by institutions on an ongoing basis without the opportunity to choose an alternative at any given moment, creating a market with limited to non-existent competition. University libraries, the primary customers for academic publishers, often find themselves locked into expensive subscription packages that eat up increasing portions of their budgets. Competition is crucial for market efficiency, yet decision-makers at institutions lack both the tools and incentives to reduce prices. The result is a system where prices continue to rise while services remain largely unchanged.

The traditional publishing model also inadvertently promotes fraud and questionable research practices. By making prose (with some figures and tables) the main focus, it becomes easier to bury unsupported hypotheses, engage in p-hacking, and in extreme cases, manipulate data. This problem is more widespread than many realize. A concerning number of published papers contain images that show signs of manipulation, statistical analyses that don’t stand up to scrutiny, or methodologies that are impossible to replicate. Oversight, where it exists at all, is thankless and often personally risky work. Short of outright fraud, the system encourages softer forms of academic misconduct, such as CV padding through unnecessary publication splitting, selective reporting of results, and the famous “least publishable unit” approach to research. Goodhart’s law applies to academics as well: “When a metric becomes a target, it ceases to be a good metric.” Gaming the system is rampant—openly practiced by many scholars and encouraged by a volume-driven publishing industry.

The emphasis on publishing in high-impact journals has created perverse incentives that work against good science. Researchers feel pressure to produce dramatic or counterintuitive results, knowing that null findings or incremental advances are less likely to be published. This pressure can lead to corner-cutting, overstatement of results, or worse. The replication crisis in psychology and other fields can be partly attributed to these systemic pressures.

The lack of transparency in traditional academic publishing is another significant concern. Research projects naturally develop over time, with false starts, dead ends, and changes in methodology. Single publications obscure this process, presenting a sanitized version that fails to capture the true nature of scientific inquiry. This “snapshot” approach to research documentation can make it difficult for other researchers to understand the full context of the work and can hide important details about how conclusions were reached.

We should remember that current publishing practices are not synonymous with good scientific practice, though they’re often presented this way. Peer review, now considered a cornerstone of academic publishing, wasn’t the norm until the mid-20th century. The review process and other forms of quality control don’t depend on paper articles or even the traditional article format. Alternative review processes exist and thrive in various contexts—consider the rigorous standards maintained by online communities like r/AskHistorians. Or they might be allowed to rise to prominence in a new structure. For instance, partial review processes developed for interdisciplinary work allow experts from multiple domains to evaluate research as a team effort, even when no single reviewer can address the entire topic alone. 

The history of peer review is itself instructive. Early scientific journals published submissions with minimal review, relying on post-publication discourse for validation. The modern peer review system arose gradually, primarily after World War II, as research became more specialized and the volume of submissions increased. This history reminds us that our current system is not the only possible approach to ensuring research quality.

Perhaps most frustratingly, the current system consumes vast amounts of grant money. Public funds support research that results in academic articles, which are then published by for-profit companies at high markups. In essence, we’re paying premium prices for services that websites can provide almost for free or with minimal advertising support. This represents a significant misallocation of resources that could be better spent on actual research, education, or public outreach.

The inefficiency extends beyond direct costs. Researchers spend countless hours reformatting papers to meet different journals’ submission requirements, navigating complex submission systems, and responding to reviews that sometimes seem more focused on formatting than content. All of this represents time that could be spent on actual research or other scholarly activities.

The Value of Digital Projects

It’s easy to complain about academic publishing, and we often do, but the best way to complain is to make things better. Digital academic projects offer a compelling alternative to traditional publishing. They can function exactly like academic papers when desired—consider ISAW Papers or, for that matter, open-access papers, which are already online digital publications. But they can also do so much more.

Digital projects can incorporate elements impossible in traditional publishing: non-sequential or higher dimensional structures for argumentation, hyperlinked text, multimedia content, interactive visualizations, and raw data repositories. Imagine a research paper where readers can interact with the data, running different analyses or exploring alternative hypotheses, in essence, replicating the results as they read about them. Or consider a historical study where primary sources are directly linked and searchable, allowing readers to verify claims and explore the material themselves.

The ability to update content continuously, while maintaining archives of previous versions, represents a fundamental improvement over traditional publishing. This approach aligns better with how research actually progresses—through iteration, refinement, and occasional correction. When errors are discovered or new data become available, digital publications can be updated quickly while maintaining a transparent record of changes. It also provides a way to “show one’s work” without cranking out small (often vacuous) papers, giving researchers more time to pursue big ideas, which naturally need time to develop.

The technical capabilities of modern web platforms far exceed the requirements for academic publishing. We can create rich, interactive experiences that enhance understanding while maintaining the rigorous standards of academic work. Digital projects can incorporate data repositories, computational notebooks, and interactive visualizations that make research more transparent and reproducible. Tools like Jupyter notebooks allow readers to not just read about methods but actually execute them, promoting reproducibility and deeper understanding.

Perhaps most importantly, digital projects can be made permanently accessible. Static webpages (of which Closing the Gap in Non-Latin-Script Data is one example) can already do everything an academic journal can do, and they require virtually no maintenance expenditure. Platforms like GitHub Pages make hosting scholarly content essentially free. This democratizes access to knowledge, allowing people from disadvantaged backgrounds or without institutional affiliations to access scholarly material.

The cost advantages of digital publishing are substantial. While there are still expenses involved in the scholarly communication process—copyediting, typesetting, server maintenance—these costs are orders of magnitude lower than traditional publishing. Moreover, many of these functions can be automated or streamlined using modern tools and workflows. (ProWritingAid is doing an excellent job of copyediting this blog post in real time as I write it.)

Digital publishing also enables new forms of scholarly communication that weren’t possible before. Researchers can share work in progress, receive feedback from a broader community, and iterate on their ideas in public. This more open approach to scholarship can lead to better research outcomes and more rapid advancement of knowledge.The potential for integration with other digital tools is another significant advantage. Digital publications can be easily indexed by search engines, making them more discoverable. They can incorporate modern reference management tools, making citation and bibliography management more efficient. (Hello Zotero, goodbye pedantic hair splitting over citation formatting.) They can include direct links to data and code repositories and other supplementary materials.

Looking to the Future

The transition to digital academic publishing won’t happen overnight, and some forms of traditional publishing will likely persist where they make sense. However, the bulk of academic communication can and should move to digital platforms. This shift would reduce costs, increase access, improve transparency, and better serve the primary goal of academic publishing: the advancement and dissemination of knowledge. (It may even better serve the secondary goal of academic publishing: the advancement and dissemination of academics.)

Several challenges need to be addressed in this transition. We need robust methods for preserving digital content over the long term. We need new ways of evaluating the impact and quality of digital scholarship. We need to ensure that digital publications are properly credited in academic hiring and promotion decisions. However, none of these challenges are insurmountable, and many organizations are already working on solutions. (There is much more to be said about best practices in digital publication, such as the importance of FAIRness, but this is not the subject of this post.)

The academic community has already begun this transition. The rise of preprint servers like arXiv and bioRxiv, the success of open access journals, and the growing acceptance of alternative forms of scholarly communication all point to a future where digital publishing is the norm rather than the exception. The humanities have been slow to catch up, but this is also changing.

Conclusion

The time has come for digital projects to replace most traditional academic publishing. The current system, born in the age of sailing ships and hand-set type, has served its purpose but now impedes rather than advances the spread of knowledge. Digital platforms offer all the benefits of traditional publishing—peer review, permanent archives, scholarly rigor—while adding capabilities that better serve modern research needs.

This transition represents more than just a change in format. It’s an opportunity to reimagine how scholarly communication works in the digital age. We can create systems that are more open, more efficient, and more effective at spreading knowledge. We can build tools that make research more reproducible and transparent. We can make scholarship more accessible to people around the world.

The technology exists. The benefits are clear. All that remains is for the academic community to embrace this change and begin building the publishing system of the future. The sooner we make this transition, the sooner we can redirect resources from maintaining an obsolete system to advancing the frontiers of knowledge.

HOW TO BE FAIR? 

MANAGING AND ARCHIVING DIGITAL CORPORA – THEORETICAL WORKFLOWS AND PRACTICAL EXAMPLES


The question of how to ensure the long-term FAIRness of our research data can never be discussed too much. The multitude of new technologies and a broad palette of tools and infrastructures make it easier to create more sustainable and interoperable solutions. However, opinions still differ significantly on which ones to implement and how exactly to do so. It is crucial to address these issues within the academic community to develop standardized solutions, and the workshop “Creating, Managing and Archiving Textual Corpora in Under-resourced Languages” was held with exactly this purpose in mind.

The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024. It brought together a large number of experts on multiple languages and various aspects of digital scholarship, including members of our project Closing the Gap in Non-Latin-Script Data (CtG), which resulted in the elaboration of standardized workflows for building, managing, archiving, and annotating multilingual corpora. Although the focus was on low-resource and endangered languages, these workflows are valuable to any scholar aiming to ensure the FAIRness of their work. The following paragraphs will outline the key aspects of the workflows, along with practical examples from our project.

  • Data source, preparation, and format

It is important to ensure that the research question you have in mind cannot be answered with data already available in digital form. If it cannot, you need to gather the materials and digitize them yourself, using methods such as OCR, HTR, or ASR. A crucial step is to check the ethical and legal considerations regarding your source materials and, if possible, select those that can be shared later, as this will enhance the reusability and impact of your research. Another key decision concerns the format of your data. Avoid using proprietary file formats like MS Office DOCX. From the outset, work with open-source file formats (e.g., TXT, JSON, CSV), which are widely supported and accessible without requiring specific software or licenses. Finally, regardless of the specific purpose or any modifications made to your data, always retain the basic textual data and metadata for archiving and documentation purposes.

  • Metadata 

Metadata is crucial for ensuring the FAIR principles because it makes data findable and accessible by clearly describing the content and context of your research. Good-quality metadata is also essential for interoperability and reusability, allowing others to understand, interpret, and reuse your data correctly. Be sure to create your metadata in a simple and consistent format, such as JSON or CSV. Ideally, it should encompass the following aspects: provenance, intellectual property, ethical issues, access and reuse (licensing), as well as structural, descriptive, and technical information.

Gathering extensive metadata is of great importance to the scope of our project. Each data entry consists of three metadata sections: 1) record metadata, containing the project-specific UUID, the name of the person who added the entry, as well as the dates of its creation and last modification; 2) project metadata, which includes technical and descriptive information for each added project, from basic details such as title, hosting institutions, project duration, and involved researchers, to detailed records on research objectives, methodologies, technology stack used, and licensing, 3) metadata on the relations of the project, i.e. titles and UUID of related entries. This comprehensive metadata approach is crucial for ensuring that the data remains well-documented, easily searchable, and fully transparent, which facilitates long-term accessibility, reproducibility, and collaborative research.

  • Documentation and versioning 

Develop comprehensive documentation (guidelines) for the entire corpus, individual texts, your annotation processes, research objectives, and the overall research project. Ensure that this document is readily accessible alongside your corpus during the archiving process, and retain a secure copy for future reference. Make sure to document all changes when updating the corpus and preserve previous versions to maintain a complete record of the evolution of your research, facilitating transparency and reproducibility.

The project implements a Git-based file database to manage and version control the corpus data. This approach ensures that all changes are tracked, providing a clear history of modifications and facilitating collaborative efforts. In addition, the project team conducts detailed documentation of all dependencies and technologies used, as well as any changes made to the dataset. The database is hosted on a public GitHub repository, promoting transparency by making all data and changes publicly accessible, which encourages community engagement. This openness not only builds trust but also allows for peer review and validation of the data and methodologies employed.

  • Standardized vocabularies

To prevent confusion regarding information, utilize controlled vocabularies and authority files connected to recognized community repositories. This approach will promote accurate analyses and enhance the clarity of your information.

The project extensively utilizes authority files to link all entities representing institutions, locations, and individuals to identifiers such as VIAF, Wikidata, GND, or Geonames. To ensure optimal searchability and future retrieval of the data, the project team has developed a taxonomy system encompassing all concepts relevant to NLS-specific research. Adhering to the principles of Open Data and Open Science, the taxonomy is grounded in existing controlled vocabularies, including the DHA Taxonomy and TaDiRAH.

  • Licensing 

Make your corpus available in the most open manner possible while respecting any necessary restrictions. Keeping the FAIR principles in mind throughout the process, license your data under Creative Commons (CC), provided that your data providers permit it from legal and ethical standpoints.

Our data sources are twofold: we either gather information that is openly accessible online or directly contact researchers to obtain more detailed insights through interviews, during which we request explicit permission to share the information with the scientific community. This approach enables us to make not only all our workflows but also the entire dataset available under open access for further use on GitHub, licensed under the CC BY 4.0 license.

  • Archiving

Once you have assembled your corpus, metadata, and comprehensive documentation, it’s time to focus on the long-term archiving of your data. You can opt for certified data centers (e.g., CLARIN B-centre), data repositories affiliated with your research institution, or inter-institutional repositories like Zenodo. Institutions like CLARIN offer a robust, distributed network of 70 centers across Europe, providing not only long-term archiving but also tools to ensure the FAIR principles are applied to research data. CLARIN centers, especially those certified as B-centers, host repositories that ensure data sustainability and accessibility for future research projects. Be aware that some institutional repositories may have specific format requirements, which could necessitate migrating your dataset. In such a case, again, steer clear of migrating it into proprietary file formats. Always ensure that you deposit the most recent version of your data and documentation.

Hosting the data on GitHub and employing a Git-based management system offers a lightweight solution that does not require extensive infrastructure or resources. This makes it an ideal choice for projects with limited funding or technical support. The sustainability of this approach is ensured through the use of widely adopted tools and platforms, which are likely to remain supported and updated in the long term. Additionally, regular snapshots via the Web Archive, releases, and backups to Zenodo, along with the decentralized nature of Git repositories, further enhance the reliability and durability of the archived data.

CtG provides an example of radical FAIRness and openness, which, in the case of projects working with more sensitive data, might not be entirely possible. Additionally, the lightweight structure of a file-based database may pose challenges for projects with more complex, relational data. Therefore, it is important that each project from the outset develops a detailed data management plan that maximizes openness and sustainability, given the nature of its data. There are many initiatives and organizations that support researchers in developing FAIR data management strategies. One such resource is the SSH Open Marketplace, a platform where researchers can access and share workflows, as well as create and customize their own workflows for specific research projects. This platform enhances the discoverability and contextualization of research tools, datasets, and workflows, fostering collaboration and knowledge-sharing within the digital humanities community. Researchers are furthermore encouraged to license their images using CC-BY, which allows free use with creator credit, or CC-0, which places the work in the public domain for unrestricted use without attribution. Another valuable resource is the DARIAH Transformations journal, which emphasizes the documentation of methodological and research activities in the arts and humanities. This journal provides a platform for detailed documentation of data gathering, processing, and annotation, ensuring transparency and comprehensive record-keeping. The journal also requires structured metadata to support proper archiving and reusability of data. Its overlay model ensures that research and accompanying documentation are immediately accessible through open repositories, enhancing the reliability and availability of scholarly work. 

More detailed information about the workflows for building, managing, and archiving multilingual corpora, as elaborated during the workshop, can be found here, here, and here.

Optimizing Research Workflows for Students: A Software-Based Approach

Note: This post is primarily directed at students. It is based on a presentation I made recently for the students of Dr. Theodore Beers at his digital humanities seminar, Digital Philological Methods, during the 2024 Summer semester.

___

Preliminary Notes

Optimizing research workflows through efficient software use influences the kind of research questions one decides to work on. For example, one might choose between analyzing the thematic configuration of a single prophetic ḥadīth or tracing the development of a specific theme across a large corpus of ḥadīth material. The former would involve a close reading of the text, perhaps with pencil and paper, yielding results that, though relevant in their own way, can hardly offer generalizations. The latter would cover a larger corpus, reveal thematic trends and their evolution (either as a function of time or genre), and offer a path toward more concrete general statements backed up by a large body of evidence.

Embarking on the second research question—in addition to requiring extensive use of specialized software—entails a structured, organized approach to research and analysis. Such structure cannot afford to be bogged down by inefficient research practices, which encompass various crucial but potentially mundane and cumbersome tasks, including:

  • Building research bibliographies and conducting literature reviews
  • Taking and managing notes 
  • Inserting correct citations, tracking them, and updating them according to various criteria
  • Compiling a correct final Bibliography
  • Consulting one’s own research in the future

If not managed properly, these tasks accrue “research debt.”[1] These are unaddressed aspects of the research process that accumulate over time and become cumbersome to address. This builds resentment and affects quality down the line. The degree of resentment and the amount of chaos a poorly managed system could introduce is a function of the size of the research question—and of bad habits! Students who would like to embark on ambitious projects should address this head-on and early in their career.

This post aims to summarize key aspects of developing good research practices and communicate general ideas about software-optimized research.

These tips are based on my experience. While many of the points mentioned below are common knowledge to experienced researchers, they may be unfamiliar to students, especially those studying in the so-called “Global South”. In these regions, formal instruction on proper research methods is often lacking. I personally faced this challenge and had to teach myself. If students are to have a smooth academic journey and if they are to pursue academia as a career choice, they need to optimize their workflows.

Optimizing Research Workflows

Optimizing one’s research workflows ensures that one frees up important mental bandwidth for more important tasks, without being bogged down by inefficient collateral tasks that waste time and build resentment.

Efficient research requires automating and managing various aspects of the process, including note-taking and citation management. This can be achieved using various tools that can be adopted and incorporated into one’s routine quite easily. If not managed properly, through a system that is well implemented and reproducible, research turns into mundane manual labor rather than fulfilling mental work. This affects the quality of the research. To avoid this, students may keep in mind the following principles:

  • All processes can be optimized for efficiency.
  • One can always save time and focus on what matters without losing control or quality.
  • Simple, reproducible systems are desired.
  • A simple system is designed to automate repetitive tasks and reduce the chaos they introduce. The system itself should not be complex or difficult to use.
  • A simple system is one you set up once and use repeatedly. 
  • A habit of creating systems and workflows is important to ensure quality and keep away resentment—in itself, this process is also fun and fulfilling.

Once a system is in place, students can minimize the number of tasks they need to actively manage. This allows them to iterate more quickly on ideas and their implementations, follow multiple threads, and catch more errors in their thinking. As a result, they can correct mistakes more easily and write more compellingly and confidently. The upfront time investment in designing such a system and building the correct habits to implement it is always justified in the long term.

Below is a brief proposal for a system that addresses some of the collateral tasks mentioned above.

Essential Tools for Efficient Research: A Simple System

By way of example, and to help students get started with optimizing their research workflows, I propose a toolkit of software that can easily be incorporated into their work. In my experience through interacting with fellow students over the years, few (if any) had a system in place to help them in their writing projects.

Software Toolkit:

Here I include only open-source and freely available software, focusing on reference management and note-taking. Zotero, for instance, offers a great solution for managing bibliographies and inserting citations into drafts. As a reference manager, Zotero comes with a browser integration (in the form of an extension that supports many browsers). The integration allows the user to add items easily into their libraries.

With time, Zotero evolves into a personal library. It is important, however, to develop a habit of reviewing every bibliographic entry as soon as it is imported into Zotero. Over time, uncaught errors accumulate. Any neglect in addressing these errors would defeat the purpose of implementing such a system. A correct bibliography serves the student well in their writing, in conjunction with using the official Zotero integration with MS Word or Google Docs which allows for quick insertion and updating of citations using different styles. Furthermore, Zotero doubles as a PDF reader and a repository of notes, helping students manage, in addition to references, both associated PDFs and notes. The notes can be inserted into one’s writing as one does a citation. These notes need to be properly managed as well. The Better Notes Add-on can help with that.

Note-Taking 

Note-taking is a critical part of the thinking process. Richard Feynman once made sure to correct his interviewer, who characterized a set of notebooks as a “record of the day-to-day work,” responding: “I actually did the work on the paper.” According to Feynman, the notes are not the “record, not really, it’s working. You have to work on paper and this is the paper. OK?”[2]

The above highlights the quality of writing—or in our case, note-taking—as an integral aspect of the thought process itself; notes are not a record of work already done—they themselves are the work. Therefore, organizing those notes ensures organized thinking and efficient retrieval of information when writing a first draft, and particularly when revising and editing the final draft. 

There are various ways one can optimize their note-taking habits, such as the Zettelkasten method. Obsidian can be used to digitally implement some of the principles Niklas Luhmann popularized. One key principle is focusing on one idea, thought, argument, or example per note. Obsidian’s linking capabilities can be leveraged to create (using tags and backlinks) a network of interconnected notes and visualize them through the graph feature.

This method, or any method for that matter, can be integrated right into the reading process in Zotero. Using the Better Notes Add-on for Zotero, any notes taken while reading the PDF file of a resource can easily be synced with Obsidian and vise-versa, as both Obsidian use the Markdown file format to store the notes locally on the computer. To get the integration working, you only need to point both Obsidian and Better Notes to the same folder or markdown file.

The implementation of a system that uses the three components above is straightforward.

Implementation

Here’s a brief outline to guide you in setting up an optimized research workflow based on the principles described above. I encourage students to be curious and follow through with each point individually by reading the relevant documentation and watching tutorials shared by numerous researchers on YouTube.

For this context, here is an actionable plan:

  • Set up Zotero:
    • Install Zotero and the browser connectors.
    • Create a folder structure that mirrors your research topics.
    • Develop a habit of immediately reviewing and organizing new entries.
  • Integrate Obsidian:
  • Set up a vault for your research project.
  • Settle on a note-taking method that works for you
  • Understand the linking capabilities of Obsidian and incorporate them into your note-taking habits
  • Integrate Better Notes:
  • Install the Add-on inside Zotero
  • Sync your notes with Obsidian
  • Regular Maintenance and review:
  • Schedule regular reviews of your Zotero library and Obsidian vault to keep them organized and ensure that they adhere to your system.
  • Conduct regular more in-depth reviews, looking for connections between notes and potential research directions.
  • Be consistent, but not obsessive.

By consistently applying these practices, you’ll develop a robust, efficient research workflow that allows you to focus on what truly matters—generating insights and advancing your field of study. 


[1] I employ this term here in a way that is analogous to what software engineers refer as “technical debt.” https://en.wikipedia.org/wiki/Technical_debt

[2] https://www.aip.org/history-programs/niels-bohr-library/oral-histories/5020-5

How to Check Thousands of URLs for Broken Links

Introduction

At the core of “Closing the Gap” is a database of JSON files, each containing information relating to a research project in the Multilingual/Non-Latin-Script Digital Humanities. The files look like the following (a snippet; see the full file if you’re interested):

{
  "schema_version": "0.2.3",
  "record_metadata": {
    "uuid": "d1e6d69b-5e9a-4b4a-85ad-09aac56ed2d9",
    "record_created_on": "2021-11-08",
    "record_created_by": "Kudela, Xenia Monika",
    "last_edited_on": "2022-02-18"
  },
  "project": {
    "title": "Kalila and Dimna – AnonymClassic",
    "abbr": "",
    "type": "project",
    "ref": [],
    "date": [
      {
        "from": "2018-01-01",
        "to": "2022-12-31"
      }
    ],
    "maintained": null,
    "websites": [
      "https://www.geschkult.fu-berlin.de/en/e/kalila-wa-dimna/index.html",
      "https://kalila-and-dimna.fu-berlin.de/"
    ],
...

There are many places in our database schema where URLs are recorded. Projects have websites. We also include information about host institutions, the researchers who lead or are affiliated with each project, etc.—and their website URLs are generally listed.

With our project database already containing over 165 entries, the number of URLs can be expected to be several times larger. This is in fact the case: as will be described below, we have currently over 1,300 unique URLs. A challenge is that these links are scattered across different parts of a bunch of JSON files.

If we want to check the URLs in our database automatically for breakage—which we do—then we need some way of extracting them, along with, obviously, sending HTTP requests to all of them and tracking the responses. That is what this blog post is meant to explain: our usage of a tool called lychee to find and check links.

Sidebar: Why do this?

It is worth asking why we go to the trouble of checking hundreds of URLs and replacing them as they break. After all, one of the harsh realities of the Internet is that no URL will remain valid forever. With 1,300+ links to check (and counting), we find at least some breakage on a weekly basis. It often seems to be the case that, within a few years of the end of the funding period of a given DH project, the host university will allow the project website to go offline. With this in mind, why bother struggling against entropy?

Our answer is that part of the mission of “Closing the Gap” is to contribute to the improvement of standards of practice in the Digital Humanities. Research projects, and the institutions that host them, should strive to assign stable URLs and to maintain their validity for as long as is practical. How long is long enough? This is a matter of some subjectivity, but we think it is safe to say that any DH project that was active within the last decade should have an accessible website. (Relatedly, we advocate the use of static websites and web apps, which are easier to keep online over longer periods—and easier for the Internet Archive to snapshot.)

There have been cases in which we noticed that URLs at a given institution were breaking at high rates, and we were able to notify them and to see an actual improvement in the situation. So there is some method to the madness. We go through all the URLs in our database on a regular basis; find broken links; fix/replace them to the best of our ability; and notify site owners when we see larger problems. This allows us to keep our data relatively clean and to perform a kind of community service among DH researchers.

Extracting URLs

lychee is a modern, performant command-line utility for checking links. It is implemented in Rust, a relatively new and popular programming language that is designed to make it easier for developers to write correct, memory-safe software without sacrificing performance. There are so many excellent CLI tools written in Rust that even non-programmers might benefit from interpreting its use as a suggestion of high quality.

At any rate, the easiest use case for link-checking is an HTML document, since one can at least parse the HTML and look for URLs in <a> elements. lychee does this nicely. It can, in fact, be pointed at a webpage online, wherein it will find all links and check them. We can use as an example the homepage of the Kalīla and Dimna Project at the Freie Universität Berlin (and you will notice a few broken links if you run this command):

lychee https://www.geschkult.fu-berlin.de/en/e/kalila-wa-dimna/

The lychee documentation explains more about the options that can be set, the file formats supported, etc. But, again, a simple approach will not work for the “Closing the Gap” database. We have links in hundreds of JSON files, sometimes deeply nested. There are also many duplicate URLs. What we need is to iterate over the files, extract links from each, and generate a unified list, which can then be checked.

There would no doubt be many different ways of accomplishing this. The approach that we chose is to couple lychee to another CLI tool (also implemented in Rust), fd-find. The following command, when run in the root of our repository, recursively identifies all JSON files:

fd -e json

And the other preparatory command that we need, with lychee, takes a file (JSON or otherwise) and “dumps” a list of all URLs found therein. The list can then be written to an output file, which we will use later to check the links:

lychee --dump [some_file.json] > links_list.md

We can connect the “find” command to the “dump all links” command by using the -x flag in fd-find. That is, we ask for all JSON files, and for the links to be extracted from each, collecting them in a single list:

fd -e json -x lychee --dump > links_list.md

Now, as you can imagine, it will be easy to point lychee’s link-checking function at this list. In the case of the “Closing the Gap” database, the “dump” process yields a list of nearly 2,200 URLs. By the time that duplicates are weeded out and various invalid links are skipped over (see below), we end up with more like 1,325 URLs to check.

Checking Links

This is, in a way, the easy part. Assuming that we’re in the root of the “Closing the Gap” repository and have generated the links_list.md file, we can run the following command (with a few options set, to be explained below):

lychee --max-concurrency 16 -m 3 links_list.md

The option --max-concurrency 16 tells lychee not to attempt to check more than 16 URLs at a time. We set this limit after finding that, by default, lychee would try to work too quickly, generating spurious errors. You can feel free to remove this option or adjust the value to something that works on your machine. Just be on the lookout for link-checking errors caused by the submission of too many HTTP requests at once.

As for the option -m 3, it sets a limit on the number of times that lychee will allow for a link to be redirected. This is somewhat subjective. It is normal for one URL to redirect to another, and even for a chain of several redirects to occur before the web client is given a final, substantive response. At the same time, we have found that a large number of redirects is sometimes indicative of an actual problem. e.g., the website of a DH project may have been taken offline, but the host university set things up so that links to that site are redirected to the university homepage. These are in fact broken links, but they will be considered valid by lychee because they eventually lead to a successful response (albeit for a different resource). We can test for problems like this, to an extent, by limiting redirects.

How much redirection is too much? Again, this is subjective, but for the links in our database, we have found that we can fairly easily set a limit of 3. If we lower this to 2, lychee errors on more innocuous redirects—but we can manage this by updating the URLs in question. In fact, we have been using a limit of 2 internally, since we prefer to err on the side of strictness and to accept the occasional tedium that it produces. A limit of 3 is what we can more comfortably recommend to others.

Dealing with Errors

If you manage a website or a database that contains a substantial number of links, then you will soon encounter “errors that aren’t really errors,” or “errors that aren’t our fault.” Pages sometimes go down temporarily. An HTTP request can fail for any of a huge variety of reasons. And, it seems to us, a growing number of servers are configured to blanket-deny requests from command-line tools, presumably for security reasons and/or to make scraping more difficult. (The same libraries that allow lychee to check hundreds or thousands of URLs in a matter of seconds, could be used by bad actors to launch DDOS attacks.) So there are links that you will find impractical to check programmatically.

There are ways to mitigate this problem—basically, to avoid sending requests to URLs that are guaranteed to fail, and to spare yourself the hassle of being bombarded with error messages that you have no way of fixing. With lychee, you can specify in a configuration file which links it should skip over. Since this post is already more than long enough, we will not go into the details here; but you can read about this in the lychee documentation, and look at the lychee.toml file in the “Closing the Gap” repository. You will see, for example, that we do not bother checking any links to the website of the British Library, since it denies access to command-line tools.

The inevitability of encountering a large number and variety of errors with link-checking also makes it challenging to automate the process. For the time being, we are checking links in the “Closing the Gap” database manually—that is, manually triggering a command, which then runs automatically and reports a list of errors. As long as we make sure to do this once in a while, we can keep up with actual cases of link breakage, while ignoring errors that are not actionable. We do hope to add, at some point, a further degree of automation to this process: lychee can be run in GitHub Actions. Even then, however, we would have the link-checking workflow run at a certain interval (weekly?), simply to generate a list of errors. It would still be up to us to determine which errors we can fix, and to do so.

Conclusions

As has been explained above, our dual modus operandi in checking URLs is to keep our database clean and up to date (to the extent feasible), and to contribute to best practices in the community of Multilingual DH researchers. The Internet will always be a chaotic world, and that’s ok. We should just invest a bit of effort so that the projects on which we work remain findable and accessible over a long enough period that the public can benefit from them.

We encourage you to follow the work of “Closing the Gap in Non-Latin-Script Data” via our GitHub repository, our website, and our page under the website of the Seminar for Semitic and Arabic Studies at the Freie Universität Berlin. This post was written by Dr. Theodore Beers, who sometimes discusses topics relevant to Digital Humanities research on X/Twitter.

Event Report: Exploratory Visualizations of Cultural Heritage. Introduction and Hands-on.

Cultural heritage is much more than mere objects placed side by side for isolated viewing in a display case or their digital counterparts presented together in online collections. Their value largely lies in something not directly visible: their relations to each other, mutual influences, tangled origins, and parallel developments. In short, in the stories they tell. The question is, how can one offer viewers adequate access to this history? How does one visualize what cannot be directly seen? The workshop “Explorative Visualisierungen von Kulturgut. Einführung und Hands-on” (Exploratory Visualizations of Cultural Heritage), held on January 14 at the SUB Hamburg, addressed precisely this question.

Source: Sabine de Günther

But first, let’s take a few steps back. Online catalogs as such are already a small revolution. They have enabled cultural participation for all by making countless collections permanently accessible to a wide audience. However, it is the method of isolated object representation that is increasingly subject to criticism. Lack of scale perception, loss of relations between objects, limited contextualization, and often lacking narrative are some points that can affect the interpretation of cultural heritage.

But the good news is that many are already trying to do better. One example was the British Museum with its “Museum of the World” (via Wayback Machine), an animated timeline (developed in collaboration with Weir+Wong and technological support from the Google Cultural Institute), which allowed interactive exploration of collections along parameters of geography and time. Another example is the Museum of Modern Art in New York with its virtual exhibition on abstract art “Inventing Abstraction 1910-1925,” where relationships between individual artists were made tangible through interactive networks. Not only museums themselves but also research institutes are working on innovative solutions. This includes the Fachhochschule Potsdam and the “Vikus Viewer”, a web-based visualization tool developed there, which arranges cultural artifacts on a dynamic canvas and supports the exploration of thematic and temporal patterns in large collections. What connects these approaches is the attempt to unlock cultural heritage in its continuity and interconnectedness and to enable viewers to freely explore this integrity.

This innovative approach is also followed by the project “Restaging Fashion” (ReFa), dedicated to the cultural history of clothing, in which the workshop speaker Dr. Sabine de Günther participates as a research associate. In the project, vestimentary sources are presented in a graph-based visualization that combines narration with exploration, aiming to offer the audience a guided collection entry as well as a freely explorable collection view along the thematic connections.

Illustration from ReFa

Building on the experiences gathered in the project, Dr. de Günther accompanied the workshop participants through the process of creating visualizations, from conveying the basic principles and analyzing existing examples to the collaborative design process of their own visualizations. Hundreds of notes with images of historical clothing from the project’s collection were distributed on the tables, a materialized dataset on which the participants could unleash their visualization creativity. Unlike other workshops in the “Digital Humanities – Wie geht das?” series, the tasks were to be tackled analog and without computers, as the focus was on understanding the logic of the visualization process and on the development of creative visualization approaches. Participants were, for example, asked to collage the mock-up visualization of an aspect of the collection that interested them the most or to find their own unique approach to the collection and make it visually understandable for others.

Source: Sabine de Günther
Source: Sabine de Günther

The impressive variety of ideas and approaches clearly demonstrated that there is much potential in unconventional data representation. How can one depict the temporal development of male headwear without losing sight of regional differences? How can one tell the story of a piece of jewelry that appears in several images simultaneously? Or more generally, how does one present a dataset to allow the viewer to freely explore the network of information according to their interests? Even if the answer seems complex, the workshop showed the variety of creative solutions that exploratory visualization of cultural heritage offers.

This contribution was simultaneously published in German on the blog “DH³ – Digital Humanities in der Hansestadt Hamburg” operated by the Referat für Digitale Forschungsdienste of the Staats- und Universitätsbibliothek (SUB) Hamburg.

Workshop: How to Preserve Diverse Data in a Monolingual Environment: Introducing the Project Closing the Gap in Non-Latin-Script Data (14.02)

Wednesday, February 14, 9:30 – 12:30
at HG154 (Vortragsraum), VMP3

Registration: [email protected]

In our era of vast technological developments, digital methods have unlocked a broad spectrum of new research possibilities, not only in the natural sciences but also in the social sciences and the humanities. Digital preservation, new tools for distant reading, and quantitative text analysis have revolutionized knowledge extraction from texts. However, as these fields are largely dominated by the Global North, research involving materials in languages from beyond that sphere often faces limitations that hinder the utilization of novel technologies.

The workshop “How to Preserve Diverse Data in a Monolingual Environment,” to be held on February 14 at the Staats- und Universitätsbibliothek Hamburg Carl von Ossietzky, is part of an initiative to address this asymmetry. The research project Closing the Gap in Non-Latin-Script Data (based at the Freie Universität Berlin), in cooperation with the Referat für Digitale Forschungsdienste at the SUB Hamburg, has been conducting a survey and analysis of the field of Digital Humanities with a focus on low-resource and non-Latin-script (NLS) languages. The aim is to identify technical and structural limitations that may arise across various stages of projects working with such languages, particularly in terms of data analysis and sustainable data preservation. Furthermore, Closing the Gap strives to set an example for multilingual DH research aligned with FAIR principles, offering its workflows and solutions as guidelines for the community.

The goal of this workshop is twofold. First, members of the Closing the Gap team will present some of the data that the project has collected and the workflows that have been developed, as well as preliminary insights from this research—thereby providing an overview of challenges that are commonly faced in multilingual DH. Second, the workshop is intended to create a space for open discussion and exchange of ideas among DH practitioners, librarians, and others who are interested in improving the conditions for working with NLS textual data.