Skip to content

Change Dataverse / Dublin Core mapping to improve OAI-PMH harvesting #8129

@philippconzett

Description

@philippconzett

Note: dc:rights is being handled in #5920 and #4176 but the original description of this issue has been preserved.

Based on a semi-systematic survey of how DataverseNO metadata is harvested in Bielefeld Academic Search Engine (BASE; https://www.base-search.net/Search/Advanced), a major search engine for research outputs, we have noticed some issues related to the way the Dataverse software provides Dublin Core metadata for OAI-PMH harvesting.

dc:type
BASE harvests multiple types of research output, e.g. publications and datasets. Searching BASE you can filter/limit the search result to only include datasets by selecting Dataset in the Document Type section of advanced search:
image

However, only very few metadata records harvested directly from DataverseNO are marked as Document Type = Dataset.
It seems that in the oai_dc format, which BASE uses for harvesting, Document Type is based on the dc:type field. According to the Dataverse Metadata Crosswalk, dc:type corresponds to the Dataverse metadata field Kind of Data. But this field may contain very different values, e.g., “survey data”, “survey”, “observations” etc. Dublin Core (see https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/type) recommends “to use a controlled vocabulary such as the DCMI Type Vocabulary” for dc:type. The DCMI Type Vocabulary has “dataset” as one of its values. I therefore suggest changing the Dataverse / DC Element (oai_dc) mapping, so that dc:type is hard-coded as “dataset” for all dataset metadata in Dataverse.

dc:date
The Dataverse metadata field Publication Date is available as dcterms:issued, but it doesn’t seem to be among the oai_dc fields Dataverse exposes for OAI-PMH harvesting. According to the Dataverse Metadata Crosswalk, dc:date corresponds to the Dataverse metadata field Deposit Date, but all the random samples I tested in BASE indicate that dc:date, which BASE uses as input for their metadata field Year of Publication, corresponds to the Dataverse field Date of Production. I suggest changing the Dataverse / DC Element (oai_dc) mapping, so that dc:date is mapped with Publication Date. This is also in line with citation recommendations. The publication date is the preferred date when citing research data; see, e.g., page 12 in The Tromsø Recommendations for Citation of Research Data in Linguistics; https://doi.org/10.15497/rda00040.

dc:rights
For some of the sources included in BASE, there is an indication of the degree of Open Access. Among them are some Dataverse-based repositories. On the other side, for DataverseNO and other Dataverse-based repositories, this information is not available / unknown (“unbekannt”):
image

The Open Access information in BASE is based on the Dublin Core field dc:rights. Dataverse does not provide the field dc:rights. A correct value in this field would enable BASE to indicate the degree of Open Access (see more information at https://www.base-search.net/about/en/faq_oai.php#dc-rights). For datasets without access restriction, the dc:rights field could look like this: info:eu-repo/semantics/openAccess (see more information at https://guidelines.openaire.eu/en/latest/data/field_rights.html#rightsuri-ma).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Interested

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions