QUERTY

A bit more than one year ago, we (Giovanni Colavizza, Gianmarco Spinaci, and myself) wanted to measure, to some extent, the relationship that Digital Humanities (DH) research has with other scholarly disciplines. We based our work on quantitative data from several bibliographic and citation data sources (including OpenCitations). We discovered that DH publications in journals are connected primarily to Computer Science, Linguistics, Psychology, and Pedagogical & Educational Research, and we also speculated that DH research might act as a hub for CS methods to diffuse in Social Sciences and Humanities (SSH) disciplines.

I do not want to expand on this discussion here. Still, if you are curious, the article introducing the whole work has been published in open access in Digital Scholarship in the Humanities. However, there is one collateral outcome of such research that I want to re-introduce here: the list we created to identify DH publications (in journals), i.e., the List of DH Journals.

As of 27 April 2024, this list (shown below) contains 145 journals classified according to four distinct categories: exclusively (meaning that the journal is wholly devoted to publishing DH articles), significantly (a significant set of publications dealing with DH), marginally (only a tiny set of publications are DH), and mega-journal (for those vast venues that may publish articles relevant to the DH but not primarily).

The screenshot of the first seven items of the List of DH Journals, available at https://dhjournals.github.io/list/.

The list has been created through three iterations. First, DH experts participated in a crowdsourcing activity to define an initial set of DH journals. Then, an enrichment conveyed via citation links followed, and finally, a final merge with additional lists available online.

While it is a valuable asset, e.g., as a place to search for a suitable venue to publish DH research, the list has not been updated in a year or so. Still, it may miss some relevant journals since, for instance, some local journals may not be highlighted in the analysis, considering that, besides the crowdsourcing activity, the majority of the journals have been added via an automatic process looking at data in bibliographic and citation indexes that we know miss several relevant venues, in particular in the SSH domain.

Thus, here is my call to the DH community: what do you think about extending it with more titles by creating a new issue on the GitHub repository?

(And, honestly, why not push some international DH associations, e.g. ADHO, to adopt and maintain it?)

In the scholarly ecosystem, a bibliographic citation is a conceptual directional link from a citing entity to a cited entity, used to acknowledge or ascribe credit for the contribution made by the author(s) of the cited entity. Citations are one of the core elements of scholarly communication. They enable the integration of our independent research endeavours into a global graph of relationships that can be used, for instance, to analyse how scholarly knowledge develops over time, assess scholars’ influence, and make wise decisions about research investment.

A copyrighted fact

However, as citation data, i.e. pieces of factual information aiming at identifying entities and relationships among them, are of great value to the scholarly community, it has been a “scandal” that they have not been recognised as part of the commons. Indeed, only recently we have seen some efforts – such as the Initiative for Open Citations (I4OC) – that have tried to change the behind-the-paywall status quo enforced by the companies controlling the major citation indexes used worldwide, convincing scholarly publishers to support the unrestricted availability of scholarly citation data by publishing them in suitable open infrastructures, such as Crossref and DataCite.

Of course, as for many other kinds of data, putting bibliographic and citation data behind a paywall is a thread to enabling the full reproducibility of research studies based on them (e.g. in bibliometrics, scientometrics, and science of science domains), even when such studies are published in open access articles. For instance, the results of a recent open access article published on Digital Scholarship in the Humanities, which aimed to analyse the citation behaviour of Digital Humanities (DH) research across different proprietary and open citation databases, are not fully reproducible since the majority of the databases used – namely Scopus, Web of Science, and Dimensions – do not make their bibliographic and citation data openly available.

In addition, the coverage of publications and related citations in specific disciplines, particularly those within the Social Sciences and the Humanities (SSH), is inadequate compared to other fields. Usually, this is due to the limited availability of born-digital publications accompanied by a wide variety of publication languages, publication types (e.g. monographs), and complex referencing practices that may limit their automatic processing and citation extraction. As a side effect, such a partial coverage may result in a considerable bias when analysing SSH disciplines compared to STEM disciplines that usually have better coverage in existing citation databases.

Reforming research assessment

All these scenarios have at least another negative effect on the area strictly concerned with the research assessment, which often uses quantitative metrics based on citation data to evaluate articles, people, and institutions. Indeed, the unavailability and partial coverage of bibliographic and citation data create an artificial barrier to the transparency of the processes used to decide the careers of scholars in terms of research, funding, and promotions.

In the past years, several initiatives around the world have highlighted the importance of reforming research assessment exercises, such as those summarised in the following figure: the Frech National Plan for Open Science, the San Francisco Declaration on Research Assessment, the Leiden Manifesto for Research Metrics, and the recent proposal for a reform of the research assessment system by the European Commission. All these initiatives agree on a few essential characteristics necessary for having a trustful assessment system:

to be open and transparent by providing machine-readable, unrestricted and reusable data and methods for calculating the metrics used in research assessment exercises, and
to leave to the research community, instead of commercial players, the control and ownership of the crucial infrastructures and tools used to retrieve, use and analyse such data within research assessment systems.

Thus, the leading guideline that can be abstracted is to follow Open Science practices even when assessing research and not only when performing research.

Some initiatives pushing for reforming the principles behind research assessment systems.

Introducing OpenCitations

Within this context, OpenCitations (full disclosure: I am one of its directors) plays an important role, acting as a key infrastructure component for global Open Science, and pushes for actively involving universities, scholarly libraries and publishers, infrastructures, governments and international organisations, research funders, developers, academic policy-makers, independent scholars and ordinary citizens. The mission of OpenCitations is to harvest and openly publish accurate and comprehensive metadata describing the world’s academic publications and the scholarly citations that link them, with the greatest possible global coverage and subject scope, encompassing both traditional and non-traditional publications, and with a breadth and depth that surpasses existing sources of such metadata, while maintaining the highest standards of accuracy and accompanying all its records with rich provenance information, and providing this information, both in human-readable form and in interoperable machine-readable Linked Open Data formats, under open licenses at zero cost and without restriction for third-party analysis and re-use.

For OpenCitations, open is the crucial value and the final purpose. It is the distinctive mark and founding principle that everything OpenCitations provide – data, services and software – is open and free and will always remain so. OpenCitations fully espouses the aims and vision of the UNESCO Recommendations on Open Science, complies with the FAIR data principles, and promotes and practises the Initiative for Open Citations recommendation that citation data, in particular, should be Structured, Separable, and Open.

The most important collection of such open citation data is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. The last release, dated August 2022, contains more than 1.36 billion citation links between more than 75 million bibliographic entities that can be accessed programmatically using its REST API, queried via the related SPARQL endpoint, and downloaded in full as dumps in different formats (CSV, JSON, and RDF).

Collaborations between OpenCitations and other Open Science infrastructures and services.

In addition to the publication of citation data, a considerable effort has been dedicated to collaborating with other Open Science infrastructures working in the scholarly ecosystem, as summarised in the figure above. Since 2020, OpenCitations has significantly benefited from the scholarly community that resulted from the 2019 selection by the Global Sustainability Coalition for Open Science Services (SCOSS) of OpenCitations as a scholarly infrastructure worthy of financial support. The community funding permitted the appointments of people dedicated to the administration, communication, community development, and maintenance and improvement of the OpenCitations software and the computational infrastructure on which it runs. In addition, OpenCitations started its involvement with OpenAIRE and the European Open Science Cloud (EOSC), and it is collaborating with other funded projects project such as RISIS2, OutCite, OPTIMETA and B!SON.

While OpenCitations is currently providing a good set of citation data, which is already approaching parity with other commercial citation databases and that has already been used in a few studies for research purposes, there is still a margin for improvement. Currently, the citations included in the OpenCitations Indexes come mainly from Crossref data, one of the biggest open reference providers. However, Crossref does not cover all the publishers of DOI-based resources. Indeed, other DOI providers, in some cases, expose citation relations in their metadata, such as DataCite. In addition, DOI-based publications represent just a limited set of all the bibliographic entities published in the scholarly ecosystem. Other identifier schemas have been used to identify bibliographic entities – and, for some publications, there not exist identifiers at all!

Thus, to address these two issues, OpenCitations is working on expanding its coverage according to two different directions. On the one hand, OpenCitations is developing two new citation indexes of open references based on the holdings of DataCite and the National Institute of Health Open Citation Collection, which, together with COCI, will be cross-searchable through the Unifying OpenCitations REST API.

On the other hand, OpenCitations has started working to create a new database entitled OpenCitations Meta, which will provide three major benefits. First, it will permit storing in-house bibliographic metadata for the citing and cited entities involved in all OpenCitations Indexes, including author identifiers using ORCID and VIAF identifier schemes where available. Second, it will provide better query performance than the present API system, which obtains bibliographic metadata on-the-fly by live API calls to external services, such as Crossref and DataCite APIs. Finally, it will permit indexing citations involving entities lacking DOIs, by providing them OpenCitations Meta Identifiers.

This last collection, combined with automatic tools for citation extraction from digital formats, is crucial for increasing the coverage of underrepresented disciplines and fields in bibliographic databases, such as SSH publications. One of the OpenCitations’ goals is to reduce this gap in citation coverage by setting up crowdsourcing workflows for ingesting missing citation data from the scholarly community (e.g. libraries and publishers). In the future, another contribution will be to set up tools for automatic extraction of citations that can also support small and local publishers, crucial assets for SSH research, that may find difficulties in carrying out citation extraction tasks on their own since using and maintaining a tool (or paying a company addressing those tasks on behalf of the publisher) requires extra costs beyond publishers’ finances.

To conclude: OpenCitations is one piece of a puzzle that is working to change existing scholarly practices to create an open and inclusive future for science and research in which the scholarly community owns and is responsible for its own data.

Disclaimer: This text was created during the TRIPLE Booksprint “The role of open metadata in the SSH scholarly communication” and is intended to be a contribution to the “Guidelines on the research data in the humanities” deliverable (8.5) of the TRIPLE project which will be made publicly available under the CC-BY 4.0 license.

Musings from the rabbit hole

W-log #1: research information, Aldrovandi, Open Science

Support and enrich a list of Digital Humanities journals

Transparency meets open citations

A copyrighted fact

Reforming research assessment

Introducing OpenCitations