Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) #10567

johannes-darms · 2024-05-17T14:33:12Z

What this PR does / why we need it:

Please refer to this issue:

Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) #10501

Which issue(s) this PR closes:

Closes Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) #10501

Special notes for your reviewer/ Suggestions on how to test this:**:
Configure two PID Provider via docker-compose-file
Alter this line:

  -Ddataverse.pid.providers=fake,perma

add those lines to the dataverse env:

      
        -Ddataverse.pid.perma.type=perma
        -Ddataverse.pid.perma.label=PermaProvider
        -Ddataverse.pid.perma.permalink.base-url=https://example.org/
        -Ddataverse.pid.perma.permalink.separator=-
        -Ddataverse.pid.perma.authority=identifier

Start up dataverse.
Create a new dataset with a datafile.
Then change the pidProvider of the dataset.

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile/perma" \
    -H "X-Dataverse-key: $KEY" -d 'perma'

Then reconcile the pid via API call:

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile" \
    -H "X-Dataverse-key: $KEY"

Verify that the PID changes in the UI and a notification apprears.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:

Included.

Preview docs at https://dataverse-guide--10567.org.readthedocs.build/en/10567/api/native-api.html#reconcile-the-pid-of-a-dataset-if-multiple-pid-providers-are-enabled

…D of an unpublished dataset.

coveralls · 2024-05-17T14:43:02Z

coverage: 22.831% (+0.08%) from 22.751%
when pulling ee4fdee on johannes-darms:feat/reconcilepid
into 6be4f20 on IQSS:develop.

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java

src/main/java/edu/harvard/iq/dataverse/DvObject.java

src/main/java/edu/harvard/iq/dataverse/DataFile.java

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java

src/test/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcilePIDCommandTest.java

qqmyers · 2024-05-21T18:11:54Z

@johannes-darms - I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

src/test/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcilePIDCommandTest.java

johannes-darms · 2024-05-22T06:06:08Z

I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

Thanks for the explanation I wasn't aware of the distinction. I'll update the code accordingly.

…nd not effectivePidProvider as suggested by qqmeyers.

…e PidProvider as another API can be used for this.

johannes-darms · 2024-06-05T06:58:20Z

@qqmyers : I've adapted the PR according to your feedback. Could you have another look?

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java

pdurbin · 2025-02-07T14:06:57Z

For the record, this run yesterday was successful: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/12/testReport/ . All API tests passed in about 17 minutes.

pdurbin · 2025-02-07T14:49:55Z

Ok, yes, in the latest run at https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/13/console I'm seeing this:

[ERROR] Errors: 
[ERROR]   XmlMetadataTemplateTest.testDataCiteXMLCreationAllFields:248 » NullPointer Cannot invoke "String.equals(Object)" because "name" is null
[ERROR]   XmlMetadataTemplateTest.testDataCiteXMLCreationAllFieldsMultipleGeoLocations:313 » NullPointer Cannot invoke "String.equals(Object)" because "name" is null

So yeah, same territory as this PR:

Flaky XmlMetadataTemplateTest fix #11225

pdurbin

Just some quick feedback. For context, I'm responding to @johannes-darms saying that this PR is high priority (2 of 3) in the discussion on Zulip about which PRs we should consider for Dataverse 6.6: https://dataverse.zulipchat.com/#narrow/channel/375707-community/topic/Release.206.2E6.20Timeline/near/498207634

I gave this PR a size of 20 rather than 10 because I think it still needs some thinking and design work buy in from the core team. Maybe I'll move it down once I understand the PR better. I'm leaving some questions below.

Overall it seems like a nice feature! I'm curious... what's the real world use case? I assume it's something like "We give all of our datasets Permalinks by default but along the way we decide that some datasets deserve an actual DOI. We want to move these datasets to a collection where DOIs are configured. Once moved, we give them a DOI (and keep the old Permalink around as an alternative PID)." Is that close?

pdurbin · 2025-02-07T14:51:41Z

doc/release-notes/10567-feat-reconcilepid.md

@@ -0,0 +1,4 @@
+Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new


I don't love "PID reconcile" for the name of this feature. Maybe "PID swap"? "PID switch?" "PID update" or PID change"? ("Update" and "change" are the words used in the bundle.)

I'm also happy with any another wording. However, this API call does not take any argument and just synchronises/reconcile the current configuration.

Let's stick with reconcile. "Reassign PID" might also work. 🤷

pdurbin · 2025-02-07T14:54:01Z

doc/release-notes/10567-feat-reconcilepid.md

This pull request says it closes the following issue:

Feature Request/Idea: Move draft Datasets should update their PID configuration #10501

However, if one moves a dataset, does the PID get updated? From a quick look at the code, it seems like two steps are needed:

move the dataset

run the PID reconcile command

Am I missing something?

You're right. Initially, the idea was to combine this action with the move command. However, after discussions with @qqmyers (see issue comments), I decided to implement it as a separate command. This command only performs an action if the configured pidGenerator differs from the existing PID of the dataset or datafile.

This means that if a superuser changes the PID provider using the existing API call, they can run this command to ensure the new pidGenerator is applied—effectively reconciling the PID generator with the actual PID configuration.

The workflow is as follows:

Change the PID Provider of a dataset (https://guides.dataverse.org/en/latest/api/native-api.html#configure-the-pid-generator-a-dataset-uses-if-enabled)

Reconcile the PID (basically this PR)

The additional docs you added help a lot. Thanks.

pdurbin · 2025-02-07T14:55:29Z

doc/sphinx-guides/source/api/native-api.rst

+a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an
+alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published.
+(It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the
+old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to


Having to move the files manually seems like a pretty big deal to me. Perhaps this could be emphasized more. 🤷

That makes sense. What about something like:
"Warning: This change does not affect the storage repository, where the old PID is still in use. A technical administrator could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."

Better, let me try to iterate on what you wrote:

"Warning: This change does not affect the storage repository, where the old PID is still used in the name of where files are stored for the dataset. If you want to remove the PID from the name used in storage, you could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."

I dunno, maybe your version is better. 😅 I do like having more explanation of some sort.

I've integrated your snipped and tried to improve the explanation above.

The extra docs are great. Thanks.

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java

johannes-darms · 2025-02-10T07:02:38Z

Overall it seems like a nice feature! I'm curious... what's the real world use case? I assume it's something like "We give all of our datasets Permalinks by default but along the way we decide that some datasets deserve an actual DOI. We want to move these datasets to a collection where DOIs are configured. Once moved, we give them a DOI (and keep the old Permalink around as an alternative PID)." Is that close?

We offer different PID systems for our users, but asking them to choose one upfront often leads to confusion and may result in the data capture or publication process being abandoned. To address this, we have moved the selection to the final step before publication. (See screenshot below) At this stage, users can choose the appropriate PID system, after which the system moves the dataset into the corresponding collection, updates the PID, and initiates the publication request.

Initially, I considered this as a reconciliation operation that should be part of the move process to ensure dataset consistency with the collection configuration. However, after discussing it with Jim, I revised the approach. Now, the PR focuses solely on reconciliation without moving datasets.

Another use case would be test or demo instances that start with a fake DOI or permalink and later need to transition to a proper PID system. Currently, this is not possible because certain safeguards prevent the reconciliation of already published records. However, this functionality could be easily implemented if needed.

# Conflicts: # doc/sphinx-guides/source/api/native-api.rst

pdurbin · 2025-03-20T19:30:35Z

@johannes-darms phew! 6.6 is out! It looks like @vera merged the latest from develop (thanks!) but some of the questions from my last review are unanswered. Can you please take a look? Thanks!

pdurbin

Test are passing. I haven't run the code myself but it makes sense and I'm moving this to "ready for QA".

I found one typo in the doc and made a tweak to the release notes.

Also, the issue was called "[-]Feature Request/Idea: Move draft Datasets should update their PID configuration" but I renamed it and the PR to "Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled)" because the PR is closing the issue and the PR does not do what the original issue title was asking for. You have to move the dataset and then run the reconcile command.

doc/sphinx-guides/source/api/native-api.rst

pdurbin · 2025-04-08T13:14:32Z

doc/release-notes/10567-feat-reconcilepid.md

@@ -0,0 +1,4 @@
+Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new


Let's stick with reconcile. "Reassign PID" might also work. 🤷

pdurbin · 2025-04-08T13:15:15Z

doc/release-notes/10567-feat-reconcilepid.md

The additional docs you added help a lot. Thanks.

pdurbin · 2025-04-08T13:17:59Z

doc/sphinx-guides/source/api/native-api.rst

+a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an
+alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published.
+(It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the
+old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to


The extra docs are great. Thanks.

doc/release-notes/10567-feat-reconcilepid.md

ofahimIQSS · 2025-04-15T17:43:25Z

Looks good, merging.

feature(API.PID.reconcile): Added command and API to reconcile the PI…

2d216ea

…D of an unpublished dataset.

johannes-darms mentioned this pull request May 17, 2024

Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) #10501

Closed

feature(API.PID.reconcile): Added release notes and API documentation

95f3e40

johannes-darms changed the title ~~WIP PID reconcile command~~ PID reconcile command May 21, 2024

johannes-darms commented May 21, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java Outdated Show resolved Hide resolved

johannes-darms commented May 21, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/DvObject.java Outdated Show resolved Hide resolved

johannes-darms commented May 21, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/DataFile.java Outdated Show resolved Hide resolved

fix(API.PID.reconcile.doc): Fixed documentation

71f84ee