Skip to content

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Sep 26, 2025

What this PR does / why we need it:
This PR adds a Big Data Admin guide that tries to gather information from other parts of the guides into a more coherent guide for managing a Dataverse instance being used for larger data files, more files per dataset, and/or more datasets.

A work in progress, but hopefully useful.

Preview at https://dataverse-guide--11850.org.readthedocs.build/en/11850/admin/big-data-administration.html

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@qqmyers qqmyers added the TDL of interest to the Texas Digital Library label Sep 26, 2025
@qqmyers qqmyers marked this pull request as ready for review November 12, 2025 15:11
@qqmyers qqmyers moved this to Ready for Triage in IQSS Dataverse Project Nov 12, 2025
@qqmyers qqmyers added this to the 6.9 milestone Nov 12, 2025
@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Nov 12, 2025
@scolapasta scolapasta moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Nov 12, 2025
@pdurbin pdurbin moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Nov 13, 2025
@pdurbin pdurbin self-assigned this Nov 13, 2025
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this guide is incredible. 🎉

I know I'm leaving many, many annoying nitpicky comment but I hope they are outweighed by some useful ones! 71 comments total! Sorry! 😅

Docs like this set Dataverse apart from other platforms. How are people supposed to use you software if you don't tell them how??! Great job!

- DatasetChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum aggregate dataset size, above which validation will not be done.
- DataFileChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum file size, above which validation will not be done.
- FilePIDsEnabled - false is recommended when datasets have many files. Related settings allow file PIDS to be enabled/disabled per collection and per file
- CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.
- CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app

for consistency

- DataFileChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum file size, above which validation will not be done.
- FilePIDsEnabled - false is recommended when datasets have many files. Related settings allow file PIDS to be enabled/disabled per collection and per file
- CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.
- WebloaderUrl - enables use of an installed DVWebloader (by specifying it's web location) which is more efficient for uploading many files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- WebloaderUrl - enables use of an installed DVWebloader (by specifying it's web location) which is more efficient for uploading many files
- WebloaderUrl - enables use of an installed DVWebloader (by specifying its web location) which is more efficient for uploading many files

- DisableSolrFacets - disables facets, which are costly to generate, in search results (including the main collection page)
- DisableSolrFacetsForGuestUsers - only disable facets for guests
- DisableSolrFacetsWithoutJsession - disables facets for users who have disabled cookies (e.g. for bots)
- DisableUncheckedTypesFacet -only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- DisableUncheckedTypesFacet -only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)
- DisableUncheckedTypesFacet - only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)

- Investigate performance tuning options for Payara, Solr, and Postgres
- Coordinate with others in the community - there is a lot of aggregate knowledge
- Consider contributing to software design changes - Dataverse scaling has improved dramatically over the past several years, but more can be done
- Watch for the new single page application (SPA) front-end for Dataverse. It includes features such as infinite scrolling through files with much faster initial page load times
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, we could have a section at the end called "Resources" or "Getting Involved" that links to https://www.gdcc.io/working-groups/large-data-support.html and #large-data. It could also invite people to contribute to this guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something to keep centralized?

@qqmyers qqmyers removed their assignment Nov 13, 2025
@qqmyers
Copy link
Member Author

qqmyers commented Nov 13, 2025

Thanks for the detailed read. Hopefully I addressed everything in some way or other.

@cmbz cmbz added the FY26 Sprint 10 FY26 Sprint 10 (2025-11-05 - 2025-11-19) label Nov 20, 2025
@cmbz cmbz added the FY26 Sprint 11 FY26 Sprint 11 (2025-11-20 - 2025-12-03) label Nov 22, 2025
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Merging! Great work, @qqmyers!

@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Nov 25, 2025
@pdurbin pdurbin merged commit 79f5cf5 into IQSS:develop Nov 25, 2025
7 of 8 checks passed
@github-project-automation github-project-automation bot moved this from Ready for QA ⏩ to Merged 🚀 in IQSS Dataverse Project Nov 25, 2025
@pdurbin pdurbin removed their assignment Nov 25, 2025
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY26 Sprint 10 FY26 Sprint 10 (2025-11-05 - 2025-11-19) FY26 Sprint 11 FY26 Sprint 11 (2025-11-20 - 2025-12-03) Size: 3 A percentage of a sprint. 2.1 hours. TDL of interest to the Texas Digital Library

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

3 participants