Skip to content

Add support for OAI-harvesting from DataCite #10909

@landreev

Description

@landreev

DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)

There is a couple of issues that must be addressed before our OAI client implementation is able to do that.

  1. The oai_dc import code in Dataverse expects the metadata fragment to be self-contained, and, most importantly have the main persistent identifier (the DOI in this case) to be present in the <dc:identifier> field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the <identifier> field, like this:
<record>
<header>
      <identifier>doi:10.7910/dvn/tjclkp</identifier>
      <datestamp>2023-01-03T21:08:00Z</datestamp>
      <setSpec>HARVARDU</setSpec>
      <setSpec>GDCC.HARVARD-DV</setSpec>
</header>
<metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Open Source at Harvard</dc:title>
         <dc:creator>Durbin, Philip</dc:creator>
         <dc:publisher>Harvard Dataverse</dc:publisher>
         <dc:date>2017</dc:date>
         <dc:date>Issued: 2017</dc:date>
         <dc:description>The tabular file contains information ...</dc:description>
         <dc:contributor>Durbin, Philip</dc:contributor>
         <dc:type>Dataset</dc:type>
     </oai_dc:dc>
</metadata>
</record>

Without the <dc:identifier>, our code in its current form is failing to import the record above.
All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester).

  1. DataCite OAI implementation offers a very promising feature of accepting arbitrary search queries as the OAI set names (https://support.datacite.org/docs/datacite-oai-pmh#arbitrary-queries). This would make it possible to harvest individual records by the DOIs (something we've been asked for specifically) or any possible subsets of their offerings.
    Example:
echo "doi%3A10.7910/DVN/TJCLKP" | base64 
ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==

Now you can harvest this "set" made up of one dataset above, as in
https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
Unfortunately for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way.
We should go ahead and implement support for harvesting using ListRecords (it should be faster, if nothing else; we handle it via ListIdentifiers then GetRecord, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)

Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the /api/harvest/clients API to take advantage of these improvements should be both useful and sufficient.

Metadata

Metadata

Assignees

Labels

FY25 Sprint 11FY25 Sprint 11 (2024-11-20 - 2024-12-04)FY25 Sprint 15FY25 Sprint 15 (2025-01-15 - 2025-01-29)FY25 Sprint 16FY25 Sprint 16 (2025-01-29 - 2025-02-12)FY25 Sprint 17FY25 Sprint 17 (2025-02-12 - 2025-02-26)FY25 Sprint 18FY25 Sprint 18 (2025-02-26 - 2025-03-12)FY25 Sprint 8FY25 Sprint 8 (2024-10-09 - 2024-10-23)FY25 Sprint 9FY25 Sprint 9 (2024-10-23 - 2024-11-06)GREI 3Search and BrowseNIH CAFEIssues related to and/or funded by the NIH CAFE projectSize: 30A percentage of a sprint. 21 hours. (formerly size:33)

Type

No type

Projects

Status

Done 🧹

Status

Interested

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions