MANAGING AND ARCHIVING DIGITAL CORPORA – THEORETICAL WORKFLOWS AND PRACTICAL EXAMPLES
The question of how to ensure the long-term FAIRness of our research data can never be discussed too much. The multitude of new technologies and a broad palette of tools and infrastructures make it easier to create more sustainable and interoperable solutions. However, opinions still differ significantly on which ones to implement and how exactly to do so. It is crucial to address these issues within the academic community to develop standardized solutions, and the workshop “Creating, Managing and Archiving Textual Corpora in Under-resourced Languages” was held with exactly this purpose in mind.
The workshop was conceived by DARIAH Working Groups Research Data Management and Multilingual DH, financed by DARIAH-EU Funding Scheme for Working Group Activities 2023-25, and hosted by the University of Hamburg on 28th to 30th August 2024. It brought together a large number of experts on multiple languages and various aspects of digital scholarship, including members of our project Closing the Gap in Non-Latin-Script Data (CtG), which resulted in the elaboration of standardized workflows for building, managing, archiving, and annotating multilingual corpora. Although the focus was on low-resource and endangered languages, these workflows are valuable to any scholar aiming to ensure the FAIRness of their work. The following paragraphs will outline the key aspects of the workflows, along with practical examples from our project.
- Data source, preparation, and format
It is important to ensure that the research question you have in mind cannot be answered with data already available in digital form. If it cannot, you need to gather the materials and digitize them yourself, using methods such as OCR, HTR, or ASR. A crucial step is to check the ethical and legal considerations regarding your source materials and, if possible, select those that can be shared later, as this will enhance the reusability and impact of your research. Another key decision concerns the format of your data. Avoid using proprietary file formats like MS Office DOCX. From the outset, work with open-source file formats (e.g., TXT, JSON, CSV), which are widely supported and accessible without requiring specific software or licenses. Finally, regardless of the specific purpose or any modifications made to your data, always retain the basic textual data and metadata for archiving and documentation purposes.
CtG conducts meta-research on the field of Multilingual Digital Humanities, addressing the challenges of managing and preserving diverse linguistic data, particularly in non-Latin scripts. The meta-corpus consists of data on various digital projects that focus on Arabic and similar languages. The core principle of the project is to offer all its data in open access for further reuse. To this end, it is ensured that all information included in the corpus—whether sourced online or obtained through interviews—can be shared openly. The data is stored in JSON format, which is open source and easily human-readable, making it understandable to people with varying levels of technical expertise. This approach allows users to easily access, modify, and contribute to the corpus, fostering a collaborative research environment. Moreover, JSON is a widely used format, supported by most systems, ensuring the sustainability and interoperability of the data.
- Metadata
Metadata is crucial for ensuring the FAIR principles because it makes data findable and accessible by clearly describing the content and context of your research. Good-quality metadata is also essential for interoperability and reusability, allowing others to understand, interpret, and reuse your data correctly. Be sure to create your metadata in a simple and consistent format, such as JSON or CSV. Ideally, it should encompass the following aspects: provenance, intellectual property, ethical issues, access and reuse (licensing), as well as structural, descriptive, and technical information.
Gathering extensive metadata is of great importance to the scope of our project. Each data entry consists of three metadata sections: 1) record metadata, containing the project-specific UUID, the name of the person who added the entry, as well as the dates of its creation and last modification; 2) project metadata, which includes technical and descriptive information for each added project, from basic details such as title, hosting institutions, project duration, and involved researchers, to detailed records on research objectives, methodologies, technology stack used, and licensing, 3) metadata on the relations of the project, i.e. titles and UUID of related entries. This comprehensive metadata approach is crucial for ensuring that the data remains well-documented, easily searchable, and fully transparent, which facilitates long-term accessibility, reproducibility, and collaborative research.
- Documentation and versioning
Develop comprehensive documentation (guidelines) for the entire corpus, individual texts, your annotation processes, research objectives, and the overall research project. Ensure that this document is readily accessible alongside your corpus during the archiving process, and retain a secure copy for future reference. Make sure to document all changes when updating the corpus and preserve previous versions to maintain a complete record of the evolution of your research, facilitating transparency and reproducibility.
The project implements a Git-based file database to manage and version control the corpus data. This approach ensures that all changes are tracked, providing a clear history of modifications and facilitating collaborative efforts. In addition, the project team conducts detailed documentation of all dependencies and technologies used, as well as any changes made to the dataset. The database is hosted on a public GitHub repository, promoting transparency by making all data and changes publicly accessible, which encourages community engagement. This openness not only builds trust but also allows for peer review and validation of the data and methodologies employed.
- Standardized vocabularies
To prevent confusion regarding information, utilize controlled vocabularies and authority files connected to recognized community repositories. This approach will promote accurate analyses and enhance the clarity of your information.
The project extensively utilizes authority files to link all entities representing institutions, locations, and individuals to identifiers such as VIAF, Wikidata, GND, or Geonames. To ensure optimal searchability and future retrieval of the data, the project team has developed a taxonomy system encompassing all concepts relevant to NLS-specific research. Adhering to the principles of Open Data and Open Science, the taxonomy is grounded in existing controlled vocabularies, including the DHA Taxonomy and TaDiRAH.
- Licensing
Make your corpus available in the most open manner possible while respecting any necessary restrictions. Keeping the FAIR principles in mind throughout the process, license your data under Creative Commons (CC), provided that your data providers permit it from legal and ethical standpoints.
Our data sources are twofold: we either gather information that is openly accessible online or directly contact researchers to obtain more detailed insights through interviews, during which we request explicit permission to share the information with the scientific community. This approach enables us to make not only all our workflows but also the entire dataset available under open access for further use on GitHub, licensed under the CC BY 4.0 license.
- Archiving
Once you have assembled your corpus, metadata, and comprehensive documentation, it’s time to focus on the long-term archiving of your data. You can opt for certified data centers (e.g., CLARIN B-centre), data repositories affiliated with your research institution, or inter-institutional repositories like Zenodo. Institutions like CLARIN offer a robust, distributed network of 70 centers across Europe, providing not only long-term archiving but also tools to ensure the FAIR principles are applied to research data. CLARIN centers, especially those certified as B-centers, host repositories that ensure data sustainability and accessibility for future research projects. Be aware that some institutional repositories may have specific format requirements, which could necessitate migrating your dataset. In such a case, again, steer clear of migrating it into proprietary file formats. Always ensure that you deposit the most recent version of your data and documentation.
Hosting the data on GitHub and employing a Git-based management system offers a lightweight solution that does not require extensive infrastructure or resources. This makes it an ideal choice for projects with limited funding or technical support. The sustainability of this approach is ensured through the use of widely adopted tools and platforms, which are likely to remain supported and updated in the long term. Additionally, regular snapshots via the Web Archive, releases, and backups to Zenodo, along with the decentralized nature of Git repositories, further enhance the reliability and durability of the archived data.
CtG provides an example of radical FAIRness and openness, which, in the case of projects working with more sensitive data, might not be entirely possible. Additionally, the lightweight structure of a file-based database may pose challenges for projects with more complex, relational data. Therefore, it is important that each project from the outset develops a detailed data management plan that maximizes openness and sustainability, given the nature of its data. There are many initiatives and organizations that support researchers in developing FAIR data management strategies. One such resource is the SSH Open Marketplace, a platform where researchers can access and share workflows, as well as create and customize their own workflows for specific research projects. This platform enhances the discoverability and contextualization of research tools, datasets, and workflows, fostering collaboration and knowledge-sharing within the digital humanities community. Researchers are furthermore encouraged to license their images using CC-BY, which allows free use with creator credit, or CC-0, which places the work in the public domain for unrestricted use without attribution. Another valuable resource is the DARIAH Transformations journal, which emphasizes the documentation of methodological and research activities in the arts and humanities. This journal provides a platform for detailed documentation of data gathering, processing, and annotation, ensuring transparency and comprehensive record-keeping. The journal also requires structured metadata to support proper archiving and reusability of data. Its overlay model ensures that research and accompanying documentation are immediately accessible through open repositories, enhancing the reliability and availability of scholarly work.
More detailed information about the workflows for building, managing, and archiving multilingual corpora, as elaborated during the workshop, can be found here, here, and here.
OpenEdition suggests that you cite this post as follows:
Aibaniz Alieva, Joudy Sido Bozan, Jonas Müller-Laackman, M. Xenia Kudela (November 14, 2024). HOW TO BE FAIR? . Closing the Gap in Non-Latin-Script Data. Retrieved April 5, 2026 from https://ctg.hypotheses.org/343