workflow – Closing the Gap in Non-Latin-Script Data

In the past, studying Arabic literature meant spending hours in archives, flipping through bulky dictionaries, and deciphering various calligraphic styles in the manuscripts. While physical books in the library remain invaluable, the rise of digital humanities has given us access to powerful tools that support research, enhance analysis, and provide deep insights into Arabic literary heritage.

The research process in Arabic studies is not always linear—sometimes, you start with a text and then consult dictionaries, while other times, you might begin with a dictionary to clarify a word before selecting a text. Depending on your focus, you may also need computational tools for text analysis or reference works for historical and literary context. To help students navigate Arabic studies more effectively, this post introduces essential digital resources, including text repositories, dictionaries, digital analysis tools, and academic databases.

1. Text Resources (Poetry & Qur’ānic Texts) → Find and Explore Primary Sources

Before diving into linguistic details, we need access to primary sources—poetry, historical texts, and often the Qur’ān. These digital repositories make it easier to find and work with Arabic texts.

Poetry & Arabic Literary Texts

Al-Diwan: A comprehensive database of classical and modern Arabic poetry, allowing users to search by poet, theme, era, and country. It provides valuable insights into the evolution of Arabic poetic forms and styles, making it an essential resource for students of Arabic literature.
Al-Maktaba Al-Shāmila: A comprehensive digital library housing thousands of classical Arabic works, including poetry, prose, and Islamic texts. Its powerful search functionality allows users to locate specific texts, keywords, and phrases across a vast collection, making it an essential resource for academic research. The library is accessible both on computers and mobile devices, with categorised browsing options for Islamic sciences, literature, and historical sources, facilitating efficient navigation and exploration.
OpenITI: A digital corpus of pre-modern Arabic texts designed for computational research. It enables students to conduct large-scale text mining, linguistic analysis, and comparative studies across thousands of Arabic texts. OpenITI is especially valuable for those interested in digital humanities, allowing for the exploration of stylistic trends, textual variations, and intertextuality in Arabic literature and historical sources.

Qur’ānic Resources

Understanding and analysing the Qur’ān requires access to accurate, well-structured digital resources. Whether you are studying its linguistic features, theological interpretations, or textual variations, these platforms provide essential tools for both beginners and advanced researchers.

The Noble Qur’an: A widely used online platform offering translations, tafsīr, and recitations of the Qur’ān in multiple languages. This resource is beneficial for comparative studies and linguistic analysis.
Tanzil: A high-quality, verified digital Qur’ānic text that ensures accuracy for academic reference and software development. It provides both Uthmani and Imlaei script versions, allowing students to study different orthographic styles. Its advanced search functionality, inclusion of pause marks, and customisable diacritic options make it a highly flexible tool for Qur’ānic studies, Arabic linguistics, and digital humanities research.
Corpus Coranicum: A comprehensive digital project that compiles Qur’ānic manuscripts, variant readings, and historical texts related to the Qur’ān. It features a searchable database with access to early manuscripts, transliterations, and variant readings, along with a philological commentary examining the historical development of the Qur’ānic text.
Quranic Arabic Annotated Corpus: A linguistically annotated database that provides morphological and syntactic analysis of the Qur’ān, allowing users to explore grammatical structures and lexical patterns. It also features a syntactic treebank, semantic ontology, and detailed word-by-word analysis, which makes it easier to explore grammar, syntax, and Qur’anic linguistic structures.

2. Dictionaries & Lexicons → Understand Word Meanings

Once you have found a text, you may need dictionaries and lexicons to interpret complex words and phrases. These resources not only help with translation but also provide insights into historical meanings, etymology, and linguistic variations.

Ejtaal: A searchable collection of Arabic dictionaries, including both Hans Wehr, and Lane’s Lexicon. While navigating between dictionaries can be challenging, the platform allows users to compare definitions across multiple lexicons within a single interface.
Lane’s Lexicon: A historical Arabic-English dictionary that provides rich etymological insights. It is particularly useful for research dedicated to classical Arabic literature and historical texts. Two versions of this lexicon are available, allowing you to choose the one that best suits your needs. (link I, link II)
Almaany: A modern dictionary offering contextual meanings and quick translations. Its user-friendly interface and extensive database make it an excellent choice for students working with contemporary Arabic texts.

3. Digital Humanities Tools → Computational Approaches to Text Analysis

While digital tools are not always necessary, they provide valuable insights by identifying linguistic patterns, tracking word frequency, and analysing textual structures. Computational methods enable efficient comparative analysis, visualisation of textual relationships, and deeper engagement with Arabic texts. These tools are especially useful for students exploring digital humanities, computational linguistics, and advanced text analysis techniques.

Voyant Tools: A web-based text analysis and visualisation tool that allows users to explore word frequency, collocations, and thematic trends in Arabic texts. It provides visualisations such as word clouds, frequency graphs, and keyword-in-context analysis, supporting both quantitative and qualitative research approaches.
Farasa: A suite of natural language processing (NLP) tools for Arabic text analysis, developed by the Qatar Computing Research Institute (QCRI). It offers tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis, making it essential for Arabic text processing and linguistic research. The tools are accessible via an online demo page or through Web API services.
Arabic Romanization ALA-LC: A tool that converts Arabic script into standardised Latin transliteration, ensuring accuracy and consistency in academic work that involves transliterated Arabic terms.

4. Reference Works & Academic Databases → Contextualise Research

Once a text is analysed, you may need secondary sources to support your arguments and understand historical, literary, or religious contexts.

Encyclopaedia of Islam II: a leading academic resource on Islamic history, culture, and linguistics, offering in-depth articles on historical figures, legal traditions, religious practices, and social structures. It provides authoritative, well-referenced information on Islamic civilization, with critical insights into both historical developments and contemporary interpretations.
Encyclopaedia of the Qur’ān: a comprehensive reference work covering Qur’ānic terms, concepts, personalities, place names, cultural history, and exegesis. It also includes essays on key themes, making it an essential resource for those exploring the Qur’ān’s content, historical context, and interpretative traditions.

Both Encyclopaedia of Islam II and Encyclopaedia of the Qur’ān can be accessed through the university network or via the VPN service.

Encyclopaedia Iranica: a comprehensive academic resource covering Iranian history, culture, literature, and its intersections with Arabic and Islamic studies. It is valuable for researching cross-cultural influences between Persian and Arabic traditions, providing in-depth articles by experts in the field on historical figures, literary movements, and cultural exchanges that shaped the region.

Digital resources have transformed Arabic studies, making it easier to access, analyse, and contextualise Arabic texts. By integrating these tools into the workflow, you can enhance your understanding of Arabic literature, historical texts, and the Qur’ān.

MANAGING AND ARCHIVING DIGITAL CORPORA – THEORETICAL WORKFLOWS AND PRACTICAL EXAMPLES

The question of how to ensure the long-term FAIRness of our research data can never be discussed too much. The multitude of new technologies and a broad palette of tools and infrastructures make it easier to create more sustainable and interoperable solutions. However, opinions still differ significantly on which ones to implement and how exactly to do so. It is crucial to address these issues within the academic community to develop standardized solutions, and the workshop “Creating, Managing and Archiving Textual Corpora in Under-resourced Languages” was held with exactly this purpose in mind.

The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024. It brought together a large number of experts on multiple languages and various aspects of digital scholarship, including members of our project Closing the Gap in Non-Latin-Script Data (CtG), which resulted in the elaboration of standardized workflows for building, managing, archiving, and annotating multilingual corpora. Although the focus was on low-resource and endangered languages, these workflows are valuable to any scholar aiming to ensure the FAIRness of their work. The following paragraphs will outline the key aspects of the workflows, along with practical examples from our project.

Data source, preparation, and format

It is important to ensure that the research question you have in mind cannot be answered with data already available in digital form. If it cannot, you need to gather the materials and digitize them yourself, using methods such as OCR, HTR, or ASR. A crucial step is to check the ethical and legal considerations regarding your source materials and, if possible, select those that can be shared later, as this will enhance the reusability and impact of your research. Another key decision concerns the format of your data. Avoid using proprietary file formats like MS Office DOCX. From the outset, work with open-source file formats (e.g., TXT, JSON, CSV), which are widely supported and accessible without requiring specific software or licenses. Finally, regardless of the specific purpose or any modifications made to your data, always retain the basic textual data and metadata for archiving and documentation purposes.

CtG conducts meta-research on the field of Multilingual Digital Humanities, addressing the challenges of managing and preserving diverse linguistic data, particularly in non-Latin scripts. The meta-corpus consists of data on various digital projects that focus on Arabic and similar languages. The core principle of the project is to offer all its data in open access for further reuse. To this end, it is ensured that all information included in the corpus—whether sourced online or obtained through interviews—can be shared openly. The data is stored in JSON format, which is open source and easily human-readable, making it understandable to people with varying levels of technical expertise. This approach allows users to easily access, modify, and contribute to the corpus, fostering a collaborative research environment. Moreover, JSON is a widely used format, supported by most systems, ensuring the sustainability and interoperability of the data.

Metadata

Metadata is crucial for ensuring the FAIR principles because it makes data findable and accessible by clearly describing the content and context of your research. Good-quality metadata is also essential for interoperability and reusability, allowing others to understand, interpret, and reuse your data correctly. Be sure to create your metadata in a simple and consistent format, such as JSON or CSV. Ideally, it should encompass the following aspects: provenance, intellectual property, ethical issues, access and reuse (licensing), as well as structural, descriptive, and technical information.

Gathering extensive metadata is of great importance to the scope of our project. Each data entry consists of three metadata sections: 1) record metadata, containing the project-specific UUID, the name of the person who added the entry, as well as the dates of its creation and last modification; 2) project metadata, which includes technical and descriptive information for each added project, from basic details such as title, hosting institutions, project duration, and involved researchers, to detailed records on research objectives, methodologies, technology stack used, and licensing, 3) metadata on the relations of the project, i.e. titles and UUID of related entries. This comprehensive metadata approach is crucial for ensuring that the data remains well-documented, easily searchable, and fully transparent, which facilitates long-term accessibility, reproducibility, and collaborative research.

Documentation and versioning

Develop comprehensive documentation (guidelines) for the entire corpus, individual texts, your annotation processes, research objectives, and the overall research project. Ensure that this document is readily accessible alongside your corpus during the archiving process, and retain a secure copy for future reference. Make sure to document all changes when updating the corpus and preserve previous versions to maintain a complete record of the evolution of your research, facilitating transparency and reproducibility.

The project implements a Git-based file database to manage and version control the corpus data. This approach ensures that all changes are tracked, providing a clear history of modifications and facilitating collaborative efforts. In addition, the project team conducts detailed documentation of all dependencies and technologies used, as well as any changes made to the dataset. The database is hosted on a public GitHub repository, promoting transparency by making all data and changes publicly accessible, which encourages community engagement. This openness not only builds trust but also allows for peer review and validation of the data and methodologies employed.

Standardized vocabularies

To prevent confusion regarding information, utilize controlled vocabularies and authority files connected to recognized community repositories. This approach will promote accurate analyses and enhance the clarity of your information.

The project extensively utilizes authority files to link all entities representing institutions, locations, and individuals to identifiers such as VIAF, Wikidata, GND, or Geonames. To ensure optimal searchability and future retrieval of the data, the project team has developed a taxonomy system encompassing all concepts relevant to NLS-specific research. Adhering to the principles of Open Data and Open Science, the taxonomy is grounded in existing controlled vocabularies, including the DHA Taxonomy and TaDiRAH.

Licensing

Make your corpus available in the most open manner possible while respecting any necessary restrictions. Keeping the FAIR principles in mind throughout the process, license your data under Creative Commons (CC), provided that your data providers permit it from legal and ethical standpoints.

Our data sources are twofold: we either gather information that is openly accessible online or directly contact researchers to obtain more detailed insights through interviews, during which we request explicit permission to share the information with the scientific community. This approach enables us to make not only all our workflows but also the entire dataset available under open access for further use on GitHub, licensed under the CC BY 4.0 license.

Archiving

Once you have assembled your corpus, metadata, and comprehensive documentation, it’s time to focus on the long-term archiving of your data. You can opt for certified data centers (e.g., CLARIN B-centre), data repositories affiliated with your research institution, or inter-institutional repositories like Zenodo. Institutions like CLARIN offer a robust, distributed network of 70 centers across Europe, providing not only long-term archiving but also tools to ensure the FAIR principles are applied to research data. CLARIN centers, especially those certified as B-centers, host repositories that ensure data sustainability and accessibility for future research projects. Be aware that some institutional repositories may have specific format requirements, which could necessitate migrating your dataset. In such a case, again, steer clear of migrating it into proprietary file formats. Always ensure that you deposit the most recent version of your data and documentation.

Hosting the data on GitHub and employing a Git-based management system offers a lightweight solution that does not require extensive infrastructure or resources. This makes it an ideal choice for projects with limited funding or technical support. The sustainability of this approach is ensured through the use of widely adopted tools and platforms, which are likely to remain supported and updated in the long term. Additionally, regular snapshots via the Web Archive, releases, and backups to Zenodo, along with the decentralized nature of Git repositories, further enhance the reliability and durability of the archived data.

CtG provides an example of radical FAIRness and openness, which, in the case of projects working with more sensitive data, might not be entirely possible. Additionally, the lightweight structure of a file-based database may pose challenges for projects with more complex, relational data. Therefore, it is important that each project from the outset develops a detailed data management plan that maximizes openness and sustainability, given the nature of its data. There are many initiatives and organizations that support researchers in developing FAIR data management strategies. One such resource is the SSH Open Marketplace, a platform where researchers can access and share workflows, as well as create and customize their own workflows for specific research projects. This platform enhances the discoverability and contextualization of research tools, datasets, and workflows, fostering collaboration and knowledge-sharing within the digital humanities community. Researchers are furthermore encouraged to license their images using CC-BY, which allows free use with creator credit, or CC-0, which places the work in the public domain for unrestricted use without attribution. Another valuable resource is the DARIAH Transformations journal, which emphasizes the documentation of methodological and research activities in the arts and humanities. This journal provides a platform for detailed documentation of data gathering, processing, and annotation, ensuring transparency and comprehensive record-keeping. The journal also requires structured metadata to support proper archiving and reusability of data. Its overlay model ensures that research and accompanying documentation are immediately accessible through open repositories, enhancing the reliability and availability of scholarly work.

More detailed information about the workflows for building, managing, and archiving multilingual corpora, as elaborated during the workshop, can be found here, here, and here.

Tag: workflow

Essential Digital Resources for Students of Arabic Studies