Skip to content

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Feb 12, 2025

What this PR does / why we need it: This PR fixes a bug in the ExternalIdentifier class that caused ORCIDs in URL form to not be recognized, breaking part of the external vocab functionality w.r.t. ORCID.

Which issue(s) this PR closes:

Special notes for your reviewer: As the fix is generic, I reverted changes in the ROR recognition to avoid having two separate versions. Also added tests for formatting or ORCID and ROR.
Also - I noted that it looks like some tests in DatasetFieldValueValidatorTest duplicate those in ExternalIdentifier. If a reviewer confirms that, I'd be happy to remove them (and the isValidAuthorIdentifier() method that is only used in those tests).

Suggestions on how to test this: Verify that publishing a dataset with an ORCID in numeric form or URL form as an author ID, the ORCID appears in the DataCite XML. Verify that either form of ORCID/ROR show up as valid links in the page display (without external vocab scripts turned on).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included.

Additional documentation:

@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 12, 2025

public enum ExternalIdentifier {
ORCID("ORCID", "https://orcid.org/%s", "^\\d{4}-\\d{4}-\\d{4}-(\\d{4}|\\d{3}X)$"),
ORCID("ORCID", "https://orcid.org/%s", "^(https:\\/\\/orcid.org\\/)?\\d{4}-\\d{4}-\\d{4}-(\\d{4}|\\d{3}X)$"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of b9d0146 we say the following:

When entering author identifiers, select the type from the dropdown (e.g. "ORCID") and under "Identifier" enter just the unique identifier (e.g. "0000-0002-1825-0097") rather than the full URL (e.g. "https://orcid.org/0000-0002-1825-0097").

Should this be re-written?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will support both forms, but if you aren't using the external vocab scripts, the short form may still be preferable? That's why I didn't change it, but happy to change it if we want to allow a choice for manual entry. (AFAIK, there is no validation for input so people can still enter whatever they want.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all. @pdurbin pinged me about this.

The short form, or the non-URL form, is preferable for ORCIDs that users type in or paste in the Author Identifier field, because that's what makes the link clickable, right? So the same will hold true for RORs that users type in or paste in the Author Identifier field, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier fix changed ROR so the short form would be a link. This PR makes it so that either form of ORCID or ROR should result in a valid link.

Copy link
Contributor

@jggautier jggautier Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Also, when you wrote "happy to change it if we want to allow a choice for manual entry", I thought we can already allow a choice for manual entry of identifiers. That's how it's set up on Demo Dataverse and planned for Harvard Dataverse.

Or did you mean when installations configure an external vocab script to allow for manual entry, that is that users can enter the identifier in the Identifier field that appears when they enter their own name instead of selecting one from the suggestions?

About the user guide text being added in b9d0146, it's written to apply to all identifiers, so I'm testing on Demo Dataverse to review how the other identifiers that users enter in the Author Identifier field are displayed, especially when the identifier has a URL form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created that GitHub issue in the ext-vocab repo. It's at gdcc/dataverse-external-vocab-support#33. There are some details I couldn't be sure of because:

  • I'm testing on Demo Dataverse and the change won't happen on Demo Dataverse's Author fields until v6.6 is released and applied to Demo Dataverse, like how RORs are displayed on the dataset page
  • the Author fields on Demo Dataverse are behaving in ways I didn't expect, like how the dataset page displays an ORCID logo even when the user enters the ORCID by using the Author Identifier Type and Author Identifier fields
  • I'm not sure about what changes were made so that the dataset page displays IDs as links when users enter IDs in Author fields where the cvoc script isn't applied. This is what I'm trying to clarify in my last comment.

I can help update that GitHub issue as I understand more.

Copy link
Member Author

@qqmyers qqmyers Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the identifier types have both a short and URI form (FWIW DAI was not handled before - I've added it). Entering either as the author identifier will cause it to display as a link in the metadata display (except DAI which is not a URL). The DataCite XML generator adds the identifier, in URI form, regardless of the identifier type, so yes they are all included regardless of the short or URI input form (including DAI).

Copy link
Contributor

@jggautier jggautier Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay thanks @qqmyers!

So @pdurbin, so should we rewrite what's written in the User Guides, at b9d0146, about how to enter identifiers in the Author Identifier field?

Or maybe that addition can be removed, since identifiers that Dataverse can display as link will be displayed as links when the cvoc script is used and when it's not used, and when users enter the short forms of the IDs and when they enter the URL forms. If all of the ways that we expect users to enter identifiers in that field will be supported - displayed as links and included in the expected metadata exports - then we don't need to guidance about how to enter identifiers in that field, right?

This PR is "Ready for review" but should it be moved back to "In progress" until we agree on what to do about that addition to the User Guides?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something should be written somewhere about this confusing situation.

What if you run Dataverse for several years with out the cvoc scripts? Your database has lots of "just identifiers" values in it?

Then, you turn on cvoc scripts and run for another couple years. Now your database has a mix of (older) "just identifiers" and newer "full URL" values?

Do I have this right? Is this what we want? Shouldn't the database be consistent?

Apologies if I'm misunderstanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For installation managers and curators who expect at least some users to have to enter or paste identifiers into the Author Identifier field, I think they would be happy to know about this change. I imagine that they'll no longer have to tell their users to enter the identifiers in a particular way, since there will be better support for both ways that we expect users to enter identifiers when they have to enter them.

So maybe a release note could help those installation managers and curators know this? If we agree, I'd be happy to write one or help write one.

What if you run Dataverse for several years with out the cvoc scripts? Your database has lots of "just identifiers" values in it?

Then, you turn on cvoc scripts and run for another couple years. Now your database has a mix of (older) "just identifiers" and newer "full URL" values?

Yeah, I agree. And this has been the situation with Dataverse installations for a while, even those whose managers haven't used the cvoc scripts. Their databases have combinations of "just identifiers" (or the "short forms"), "full URL" values, and the wrong Author Identifier Type chosen or no Author Identifier Type chosen.

Do I have this right? Is this what we want? Shouldn't the database be consistent?

From my understanding, the inconsistency in the database has complicated our ability to ensure that when most of the identifiers are displayed on the dataset page, the identifiers are links that users can click on to learn more about the person or organization. And its complicated our ability to include identifiers in metadata exports in order to improve discovery and re-use of those identifiers in Dataverse repositories and in other systems, like DataCite.

From @qqmyers's work in this PR, I've gotten the impression that this complication, having a mix of "just identifiers" and "full URL" values in the database, has been manageable, at least for the Author Identifier field, even if the inconsistency isn't ideal.

In other cases, like licenses and terms of use metadata, I think we've decided that inconsistencies in the database aren't manageable, so we've added field validation, "multiple license support", and included ways for repositories to make existing values in their databases more consistent, like the "Flyway scripts".

I think of field validation and ways to change database values in bulk as ways to address inconsistencies in other parts of the database.

The cvoc scripts, and the ability to associate ORCIDs with more user accounts and prefill Author Identifier fields with those ORCIDs, also address this by making it less likely that users will need to enter anything in the Author Identifier field, and giving installation managers more control over what gets in the database. Those things are new, and won't work all of the time, like in cases where users need and are allowed to use other types of identifiers.

Maybe before considering other ways of ensuring that the identifiers are displayed as links and help make deposits more findable within and outside of Dataverse repositories, it might be helpful to learn how well what's been done so far is working.

@qqmyers qqmyers added this to the 6.6 milestone Feb 12, 2025
@qqmyers qqmyers added the GDCC:ORCID Priority for the ORCID Global Participation Fund Grant label Feb 18, 2025
@coveralls
Copy link

coveralls commented Feb 18, 2025

Coverage Status

coverage: 22.729% (+0.002%) from 22.727%
when pulling d6731a6 on QualitativeDataRepository:IQSS/11242-ExtVocabFix
into f5f2c72 on IQSS:develop.

@cmbz cmbz added the FY25 Sprint 17 FY25 Sprint 17 (2025-02-12 - 2025-02-26) label Feb 26, 2025
@sekmiller sekmiller self-assigned this Feb 26, 2025
@sekmiller sekmiller moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Feb 26, 2025

/* The next 7 tests below appear to duplicate tests in the ExternalIdentifierTest class.
* The ones here use the isValidAuthorIdentifier method which is only used in testing (and probabkly could be static).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these tests duplicate those in the ExternalIdentifierTest and can probably be removed. The isValidAuthorIdentifier method is actually used in the DatasetAuthor class (around line 100).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DatasetAuthor uses isValidIdentifier() without 'Author' . I'll go ahead and remove the tests.

@cmbz cmbz added the FY25 Sprint 18 FY25 Sprint 18 (2025-02-26 - 2025-03-12) label Feb 27, 2025
@jggautier
Copy link
Contributor

@qqmyers I saw that in the release notes you wrote that "the shorter form is still recommended when doing manual entry."

But on Tuesday I wrote that I thought you're changes meant that installation managers and curators "will no longer have to tell their users to enter the identifiers in a particular way, since there will be better support for both ways that we expect users to enter identifiers when they have to enter them."

But I'm not sure why the shorter form is still recommended when doing manual entry. Could you write more about that here?

@qqmyers
Copy link
Member Author

qqmyers commented Feb 27, 2025

Just depends on what you want to see:
image

or
image

I assumed the former is still preferable, but feel free to edit the instructions as you see fit.

@jggautier
Copy link
Contributor

Ah thanks. So when a user adds an Author with ORCID "manually," on the dataset page the ORCID logo is displayed, followed by either the "short form" or the URL form, depending on what the user entered in the Author Identifier field, like your screenshots show.

And when the cvos script it used, only the ORCID logo is displayed, like what I see on Demo Dataverse now:
Screenshot 2025-02-27 at 11 47 35 AM

Is that right?

@jggautier
Copy link
Contributor

Folks from ORCID recommend displaying their IDs as their URL forms, as well as folks from ROR about ROR IDs and folks from DataCite about DOIs.

So I've thought that if we have to display these other identifier strings, and the identifiers have URL forms, we should display their URL forms. And because we're showing what users type into the Author Identifier field, we would prefer that depositors enter the URL forms so that it's the URL forms that are displayed. I've thought and heard from some that it's easier for most depositors to enter the URL forms, too, or at least for folks who aren't curators or don't work in repositories.

Does that make sense? If it does, I'm happy to change the release note to read that the URL form is recommended when doing manual entry.

There are other things we can do of course, but I think they would increase the scope of this issue even more. And we'll learn more during usability testing.

@qqmyers
Copy link
Member Author

qqmyers commented Mar 3, 2025

I don't have a strong opinion on what the recommendation should be. That said:

FWIW: ORCID also says the short form should be used "When human readability is more important than technical correctness. ". One of the criticisms of the short form is that "omitting the URL scheme and domain can lead to confusion elsewhere" but we display the type of the identifier and the logo, and we include the long form in machine-readable formats. All of them appear to be concerned that the short forms, when displayed as text, hide the fact that these are links, but our UI is HTML and shows them as links and "Copy URL" works to get the URL form.

Also FWIW: The ORCID/ROR scripts use the inline form, and, for ORCIDs, it's also true that all of our entries are unauthenticated and should probably be displayed in that format instead.

We also have legacy records that are in the short form (as Phil pointed out), which means changing the recommendation to long form will result in differences between old/new records.

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Mar 3, 2025
@sekmiller sekmiller removed their assignment Mar 3, 2025
@jggautier
Copy link
Contributor

Thanks @qqmyers!

FWIW: ORCID also says the short form should be used "When human readability is more important than technical correctness. ".

I think they mean that when the page can't display the full URL, it should display the short form, like 0000-0002-2771-9344, instead of displaying part of the full URL, like https://orcid.org/0000-000.

One of the criticisms of the short form is that "omitting the URL scheme and domain can lead to confusion elsewhere" but we display the type of the identifier and the logo, and we include the long form in machine-readable formats. All of them appear to be concerned that the short forms, when displayed as text, hide the fact that these are links, but our UI is HTML and shows them as links and "Copy URL" works to get the URL form.

I agree. I'm glad we're getting more specific about the why.

I've also assumed that displaying the full URL makes it easy for others to copy and paste it to other places, and that's why a lot of depositors have been pasting full URL forms of identifiers into several identifiers fields. That's been my experience when I've had to copy and paste ORCIDs and DOIs. Although we could also say that because on Dataverse pages the text are links, people can just as easily copy the text links' URLs to paste them in other places.

Also FWIW: The ORCID/ROR scripts use the inline form, and, for ORCIDs, it's also true that all of our entries are unauthenticated and should probably be displayed in that format instead.

This is interesting! But I think outside of the scope of this PR, right?

We also have legacy records that are in the short form (as Phil pointed out), which means changing the recommendation to long form will result in differences between old/new records.

Thanks yeah. There are already differences in the records in Dataverse repositories, which I wrote is unfortunate but manageable.

I think that the only reason why we've recommended the short form was because Dataverse depended on that form to display IDs and include them in metadata exports. That's changed in this PR, so I think we should recommend the URL form to support what many people are already doing and because the dataset page has the room to display the full URL forms.

@jggautier
Copy link
Contributor

@ofahimIQSS, in my last comment I'm suggesting that we change what we write in this PR's release notes. I don't think we've agreed so I'm waiting to hear back from @qqmyers and possibly @sekmiller and @pdurbin before this is merged.

@ofahimIQSS ofahimIQSS self-assigned this Mar 3, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Mar 3, 2025
@ofahimIQSS
Copy link
Contributor

I tried testing this PR this morning in internal but am getting internal server error on creation/accessing datasets/collections.
image

@ofahimIQSS
Copy link
Contributor

Had an observation with this PR. ROR url is not displayed on datacite after it is added on a second publish.

Steps to reproduce:

  1. Create a new dataset with all required fields, include just the ORCID.
  2. Publish the Dataset
  3. Check the datacite XML file. The ORCID is displayed
  4. Now edit the metadata, add a ROR ID
  5. Publish, check datacity XML File
    The ROR ID is missing from the XML.

image

Test Data: https://dataverse-internal.iq.harvard.edu/api/datasets/export?exporter=Datacite&persistentId=doi%3A10.70122/FK2/G1HUDB

@qqmyers
Copy link
Member Author

qqmyers commented Mar 5, 2025

Step 5 hasn't been done/didn't work - the dataset is still draft.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Mar 5, 2025

Step 5 hasn't been done/didn't work - the dataset is still draft.

Looks like I just forgot to hit refresh before checking the Datacite XML. PR looks good.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 17 FY25 Sprint 17 (2025-02-12 - 2025-02-26) FY25 Sprint 18 FY25 Sprint 18 (2025-02-26 - 2025-03-12) GDCC:ORCID Priority for the ORCID Global Participation Fund Grant Size: 3 A percentage of a sprint. 2.1 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

Dataverse External Author Vocabulary error when sending ORCID values to DataCite

7 participants