Skip to content

Conversation

@johannes-darms
Copy link
Contributor

@johannes-darms johannes-darms commented May 17, 2024

What this PR does / why we need it:

Please refer to this issue:

Which issue(s) this PR closes:

Special notes for your reviewer/ Suggestions on how to test this:**:
Configure two PID Provider via docker-compose-file
Alter this line:

  -Ddataverse.pid.providers=fake,perma

add those lines to the dataverse env:

      
        -Ddataverse.pid.perma.type=perma
        -Ddataverse.pid.perma.label=PermaProvider
        -Ddataverse.pid.perma.permalink.base-url=https://example.org/
        -Ddataverse.pid.perma.permalink.separator=-
        -Ddataverse.pid.perma.authority=identifier

Start up dataverse.
Create a new dataset with a datafile.
Then change the pidProvider of the dataset.

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile/perma" \
    -H "X-Dataverse-key: $KEY" -d 'perma'

Then reconcile the pid via API call:

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile" \
    -H "X-Dataverse-key: $KEY"

Verify that the PID changes in the UI and a notification apprears.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:

Included.

Preview docs at https://dataverse-guide--10567.org.readthedocs.build/en/10567/api/native-api.html#reconcile-the-pid-of-a-dataset-if-multiple-pid-providers-are-enabled

@coveralls
Copy link

coveralls commented May 17, 2024

Coverage Status

coverage: 22.831% (+0.08%) from 22.751%
when pulling ee4fdee on johannes-darms:feat/reconcilepid
into 6be4f20 on IQSS:develop.

@johannes-darms johannes-darms changed the title WIP PID reconcile command PID reconcile command May 21, 2024
@qqmyers
Copy link
Member

qqmyers commented May 21, 2024

@johannes-darms - I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

@johannes-darms
Copy link
Contributor Author

I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

Thanks for the explanation I wasn't aware of the distinction. I'll update the code accordingly.

@johannes-darms
Copy link
Contributor Author

@qqmyers : I've adapted the PR according to your feedback. Could you have another look?

@pdurbin
Copy link
Member

pdurbin commented Feb 7, 2025

For the record, this run yesterday was successful: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/12/testReport/ . All API tests passed in about 17 minutes.

@pdurbin pdurbin added the Size: 20 A percentage of a sprint. 14 hours. label Feb 7, 2025
@pdurbin
Copy link
Member

pdurbin commented Feb 7, 2025

Ok, yes, in the latest run at https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-10567/13/console I'm seeing this:

[ERROR] Errors: 
[ERROR]   XmlMetadataTemplateTest.testDataCiteXMLCreationAllFields:248 » NullPointer Cannot invoke "String.equals(Object)" because "name" is null
[ERROR]   XmlMetadataTemplateTest.testDataCiteXMLCreationAllFieldsMultipleGeoLocations:313 » NullPointer Cannot invoke "String.equals(Object)" because "name" is null

So yeah, same territory as this PR:

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick feedback. For context, I'm responding to @johannes-darms saying that this PR is high priority (2 of 3) in the discussion on Zulip about which PRs we should consider for Dataverse 6.6: https://dataverse.zulipchat.com/#narrow/channel/375707-community/topic/Release.206.2E6.20Timeline/near/498207634

I gave this PR a size of 20 rather than 10 because I think it still needs some thinking and design work buy in from the core team. Maybe I'll move it down once I understand the PR better. I'm leaving some questions below.

Overall it seems like a nice feature! I'm curious... what's the real world use case? I assume it's something like "We give all of our datasets Permalinks by default but along the way we decide that some datasets deserve an actual DOI. We want to move these datasets to a collection where DOIs are configured. Once moved, we give them a DOI (and keep the old Permalink around as an alternative PID)." Is that close?

@@ -0,0 +1,4 @@
Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love "PID reconcile" for the name of this feature. Maybe "PID swap"? "PID switch?" "PID update" or PID change"? ("Update" and "change" are the words used in the bundle.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also happy with any another wording. However, this API call does not take any argument and just synchronises/reconcile the current configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with reconcile. "Reassign PID" might also work. 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request says it closes the following issue:

Feature Request/Idea: Move draft Datasets should update their PID configuration #10501

However, if one moves a dataset, does the PID get updated? From a quick look at the code, it seems like two steps are needed:

  • move the dataset
  • run the PID reconcile command

Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Initially, the idea was to combine this action with the move command. However, after discussions with @qqmyers (see issue comments), I decided to implement it as a separate command. This command only performs an action if the configured pidGenerator differs from the existing PID of the dataset or datafile.

This means that if a superuser changes the PID provider using the existing API call, they can run this command to ensure the new pidGenerator is applied—effectively reconciling the PID generator with the actual PID configuration.

The workflow is as follows:

  1. Change the PID Provider of a dataset (https://guides.dataverse.org/en/latest/api/native-api.html#configure-the-pid-generator-a-dataset-uses-if-enabled)
  2. Reconcile the PID (basically this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional docs you added help a lot. Thanks.

a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an
alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published.
(It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the
old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to move the files manually seems like a pretty big deal to me. Perhaps this could be emphasized more. 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. What about something like:
"Warning: This change does not affect the storage repository, where the old PID is still in use. A technical administrator could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, let me try to iterate on what you wrote:

"Warning: This change does not affect the storage repository, where the old PID is still used in the name of where files are stored for the dataset. If you want to remove the PID from the name used in storage, you could manually move the files offline and remove the old identifier from the database (by setting storagelocationdesignator to false for the old identifier in the alternativepersistentidentifier table). However, this step is not required for Dataverse to function correctly."

I dunno, maybe your version is better. 😅 I do like having more explanation of some sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've integrated your snipped and tried to improve the explanation above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra docs are great. Thanks.

@johannes-darms
Copy link
Contributor Author

johannes-darms commented Feb 10, 2025

Overall it seems like a nice feature! I'm curious... what's the real world use case? I assume it's something like "We give all of our datasets Permalinks by default but along the way we decide that some datasets deserve an actual DOI. We want to move these datasets to a collection where DOIs are configured. Once moved, we give them a DOI (and keep the old Permalink around as an alternative PID)." Is that close?

We offer different PID systems for our users, but asking them to choose one upfront often leads to confusion and may result in the data capture or publication process being abandoned. To address this, we have moved the selection to the final step before publication. (See screenshot below) At this stage, users can choose the appropriate PID system, after which the system moves the dataset into the corresponding collection, updates the PID, and initiates the publication request.

Initially, I considered this as a reconciliation operation that should be part of the move process to ensure dataset consistency with the collection configuration. However, after discussing it with Jim, I revised the approach. Now, the PR focuses solely on reconciliation without moving datasets.

Another use case would be test or demo instances that start with a fake DOI or permalink and later need to transition to a proper PID system. Currently, this is not possible because certain safeguards prevent the reconciliation of already published records. However, this functionality could be easily implemented if needed.

409592737-6bdfba2d-ec88-44e5-b2d0-3e47734955b7

# Conflicts:
#	doc/sphinx-guides/source/api/native-api.rst
@pdurbin
Copy link
Member

pdurbin commented Mar 20, 2025

@johannes-darms phew! 6.6 is out! It looks like @vera merged the latest from develop (thanks!) but some of the questions from my last review are unanswered. Can you please take a look? Thanks!

@johannes-darms johannes-darms changed the title PID reconcile command change PID command Apr 1, 2025
@pdurbin pdurbin added this to the 6.7 milestone Apr 2, 2025
@pdurbin pdurbin moved this to In Review 🔎 in IQSS Dataverse Project Apr 2, 2025
@pdurbin pdurbin self-assigned this Apr 2, 2025
@cmbz cmbz added the FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) label Apr 3, 2025
@pdurbin pdurbin changed the title change PID command Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled) Apr 8, 2025
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test are passing. I haven't run the code myself but it makes sense and I'm moving this to "ready for QA".

I found one typo in the doc and made a tweak to the release notes.

Also, the issue was called "[-]Feature Request/Idea: Move draft Datasets should update their PID configuration" but I renamed it and the PR to "Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled)" because the PR is closing the issue and the PR does not do what the original issue title was asking for. You have to move the dataset and then run the reconcile command.

@@ -0,0 +1,4 @@
Added a new API for persistent identifier reconciliation. An unpublished dataset can be updated with a new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with reconcile. "Reassign PID" might also work. 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional docs you added help a lot. Thanks.

a draft dataset's PID by creating a new PID supported by the PID Provider and assigning the original PID as an
alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published.
(It does not make any changes to any PID Provider.) Note that this change does not affect the storage repository where the
old identifier is still used. (An administrator could move the files manually and set the storagelocationdesignator to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra docs are great. Thanks.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Apr 8, 2025
@pdurbin pdurbin removed their assignment Apr 8, 2025
@cmbz cmbz added the FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) label Apr 9, 2025
@ofahimIQSS ofahimIQSS self-assigned this Apr 14, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Apr 14, 2025
@ofahimIQSS
Copy link
Contributor

Looks good, merging.

@ofahimIQSS ofahimIQSS merged commit 8599993 into IQSS:develop Apr 15, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project Apr 15, 2025
@ofahimIQSS ofahimIQSS removed their assignment Apr 15, 2025
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 20 FY25 Sprint 20 (2025-03-26 - 2025-04-09) FY25 Sprint 21 FY25 Sprint 21 (2025-04-09 - 2025-04-23) Size: 20 A percentage of a sprint. 14 hours. Type: Feature a feature request

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

Reconcile the PID of a Dataset (If Multiple PID Providers Are Enabled)

7 participants