Add support for OAI-harvesting from DataCite

DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)

There is a couple of issues that must be addressed before our OAI client implementation is able to do that. 

1. The oai_dc import code in Dataverse expects the metadata fragment to be self-contained, and, most importantly have the main persistent identifier (the DOI in this case) to be present in the `<dc:identifier>` field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the `<identifier>` field, like this:  
```
<record>
<header>
      <identifier>doi:10.7910/dvn/tjclkp</identifier>
      <datestamp>2023-01-03T21:08:00Z</datestamp>
      <setSpec>HARVARDU</setSpec>
      <setSpec>GDCC.HARVARD-DV</setSpec>
</header>
<metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
         <dc:title>Open Source at Harvard</dc:title>
         <dc:creator>Durbin, Philip</dc:creator>
         <dc:publisher>Harvard Dataverse</dc:publisher>
         <dc:date>2017</dc:date>
         <dc:date>Issued: 2017</dc:date>
         <dc:description>The tabular file contains information ...</dc:description>
         <dc:contributor>Durbin, Philip</dc:contributor>
         <dc:type>Dataset</dc:type>
     </oai_dc:dc>
</metadata>
</record>
```
  Without the `<dc:identifier>`, our code in its current form is failing to import the record above. 
All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester). 

2. DataCite OAI implementation offers a very promising feature of accepting arbitrary search queries as the OAI set names (https://support.datacite.org/docs/datacite-oai-pmh#arbitrary-queries). This would make it possible to harvest individual records by the DOIs (something we've been asked for specifically) or any possible subsets of their offerings. 
Example: 
```
echo "doi%3A10.7910/DVN/TJCLKP" | base64 
ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
```
Now you can harvest this "set" made up of one dataset above, as in
https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
_Unfortunately_ for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way. 
We should go ahead and implement support for harvesting using `ListRecords` (it should be faster, if nothing else; we handle it via `ListIdentifiers` then `GetRecord`, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)

Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the `/api/harvest/clients` API to take advantage of these improvements should be both useful and sufficient. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for OAI-harvesting from DataCite #10909

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for OAI-harvesting from DataCite #10909

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions