-
Notifications
You must be signed in to change notification settings - Fork 531
Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) #10567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…D of an unpublished dataset.
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
src/test/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcilePIDCommandTest.java
Outdated
Show resolved
Hide resolved
src/test/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcilePIDCommandTest.java
Outdated
Show resolved
Hide resolved
|
@johannes-darms - I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc. |
src/test/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcilePIDCommandTest.java
Outdated
Show resolved
Hide resolved
Thanks for the explanation I wasn't aware of the distinction. I'll update the code accordingly. |
…nd not effectivePidProvider as suggested by qqmeyers.
…e PidProvider as another API can be used for this.
|
@qqmyers : I've adapted the PR according to your feedback. Could you have another look? |
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/AbstractDatasetCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Outdated
Show resolved
Hide resolved
|
For the record, this run yesterday was successful: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/12/testReport/ . All API tests passed in about 17 minutes. |
|
Ok, yes, in the latest run at https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/13/console I'm seeing this: So yeah, same territory as this PR: |
pdurbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some quick feedback. For context, I'm responding to @johannes-darms saying that this PR is high priority (2 of 3) in the discussion on Zulip about which PRs we should consider for Dataverse 6.6: https://dataverse.zulipchat.com/#narrow/channel/375707-community/topic/Release.206.2E6.20Timeline/near/498207634
I gave this PR a size of 20 rather than 10 because I think it still needs some thinking and design work buy in from the core team. Maybe I'll move it down once I understand the PR better. I'm leaving some questions below.
Overall it seems like a nice feature! I'm curious... what's the real world use case? I assume it's something like "We give all of our datasets Permalinks by default but along the way we decide that some datasets deserve an actual DOI. We want to move these datasets to a collection where DOIs are configured. Once moved, we give them a DOI (and keep the old Permalink around as an alternative PID)." Is that close?
| @@ -0,0 +1,4 @@ | |||
| Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love "PID reconcile" for the name of this feature. Maybe "PID swap"? "PID switch?" "PID update" or PID change"? ("Update" and "change" are the words used in the bundle.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also happy with any another wording. However, this API call does not take any argument and just synchronises/reconcile the current configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with reconcile. "Reassign PID" might also work. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request says it closes the following issue:
Feature Request/Idea: Move draft Datasets should update their PID configuration #10501
However, if one moves a dataset, does the PID get updated? From a quick look at the code, it seems like two steps are needed:
- move the dataset
- run the PID reconcile command
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Initially, the idea was to combine this action with the move command. However, after discussions with @qqmyers (see issue comments), I decided to implement it as a separate command. This command only performs an action if the configured pidGenerator differs from the existing PID of the dataset or datafile.
This means that if a superuser changes the PID provider using the existing API call, they can run this command to ensure the new pidGenerator is applied—effectively reconciling the PID generator with the actual PID configuration.
The workflow is as follows:
- Change the PID Provider of a dataset (https://guides.dataverse.org/en/latest/api/native-api.html#configure-the-pid-generator-a-dataset-uses-if-enabled)
- Reconcile the PID (basically this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional docs you added help a lot. Thanks.
| a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an | ||
| alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published. | ||
| (It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the | ||
| old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to move the files manually seems like a pretty big deal to me. Perhaps this could be emphasized more. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. What about something like:
"Warning: This change does not affect the storage repository, where the old PID is still in use. A technical administrator could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, let me try to iterate on what you wrote:
"Warning: This change does not affect the storage repository, where the old PID is still used in the name of where files are stored for the dataset. If you want to remove the PID from the name used in storage, you could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."
I dunno, maybe your version is better. 😅 I do like having more explanation of some sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've integrated your snipped and tried to improve the explanation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra docs are great. Thanks.
src/main/java/edu/harvard/iq/dataverse/engine/command/impl/ReconcileDatasetPidCommand.java
Show resolved
Hide resolved
We offer different PID systems for our users, but asking them to choose one upfront often leads to confusion and may result in the data capture or publication process being abandoned. To address this, we have moved the selection to the final step before publication. (See screenshot below) At this stage, users can choose the appropriate PID system, after which the system moves the dataset into the corresponding collection, updates the PID, and initiates the publication request. Initially, I considered this as a reconciliation operation that should be part of the move process to ensure dataset consistency with the collection configuration. However, after discussing it with Jim, I revised the approach. Now, the PR focuses solely on reconciliation without moving datasets. Another use case would be test or demo instances that start with a fake DOI or permalink and later need to transition to a proper PID system. Currently, this is not possible because certain safeguards prevent the reconciliation of already published records. However, this functionality could be easily implemented if needed. |
# Conflicts: # doc/sphinx-guides/source/api/native-api.rst
|
@johannes-darms phew! 6.6 is out! It looks like @vera merged the latest from develop (thanks!) but some of the questions from my last review are unanswered. Can you please take a look? Thanks! |
pdurbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test are passing. I haven't run the code myself but it makes sense and I'm moving this to "ready for QA".
I found one typo in the doc and made a tweak to the release notes.
Also, the issue was called "[-]Feature Request/Idea: Move draft Datasets should update their PID configuration" but I renamed it and the PR to "Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled)" because the PR is closing the issue and the PR does not do what the original issue title was asking for. You have to move the dataset and then run the reconcile command.
| @@ -0,0 +1,4 @@ | |||
| Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with reconcile. "Reassign PID" might also work. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional docs you added help a lot. Thanks.
| a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an | ||
| alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published. | ||
| (It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the | ||
| old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra docs are great. Thanks.
|
Looks good, merging. |

What this PR does / why we need it:
Please refer to this issue:
Which issue(s) this PR closes:
Special notes for your reviewer/ Suggestions on how to test this:**:
Configure two PID Provider via docker-compose-file
Alter this line:
add those lines to the dataverse env:
Start up dataverse.
Create a new dataset with a datafile.
Then change the pidProvider of the dataset.
Then reconcile the pid via API call:
Verify that the PID changes in the UI and a notification apprears.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Included.
Preview docs at https://dataverse-guide--10567.org.readthedocs.build/en/10567/api/native-api.html#reconcile-the-pid-of-a-dataset-if-multiple-pid-providers-are-enabled