TDL: Provide guidance for site admins w.r.t. big data #11850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

pdurbin merged 22 commits into IQSS:develop from GlobalDataverseCommunityConsortium:TDL-BigDataDocs

Nov 25, 2025

Member

qqmyers commented Sep 26, 2025 •

edited by pdurbin

Loading

What this PR does / why we need it:
This PR adds a Big Data Admin guide that tries to gather information from other parts of the guides into a more coherent guide for managing a Dataverse instance being used for larger data files, more files per dataset, and/or more datasets.

A work in progress, but hopefully useful.

Preview at https://dataverse-guide--11850.org.readthedocs.build/en/11850/admin/big-data-administration.html

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

qqmyers added 5 commits

September 8, 2025 14:37


          initial edits

56e91ff


          more edits about S3

d0c656e


          remote, globus info

46f170b


          more changes

3f63e27


          add a big data admin guide

67b0fcb

qqmyers added the TDL label

qqmyers added 5 commits

September 26, 2025 15:04


          restore big data dev guide

67977c2


          fix lists /errs in rst

158ea2a


          fix strategy table

75de0ea


          typos, updates, references

fbd4ed2


          Merge remote-tracking branch 'IQSS/develop' into TDL-BigDataDocs

c9134c6

qqmyers added this to IQSS Dataverse Project

qqmyers marked this pull request as ready for review

November 12, 2025 15:11

qqmyers moved this to Ready for Triage in IQSS Dataverse Project

qqmyers added this to the 6.9 milestone

qqmyers added the Size: 3 label

scolapasta moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project

pdurbin moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project

pdurbin self-assigned this

pdurbin reviewed

View reviewed changes

Member

pdurbin left a comment

Wow, this guide is incredible. 🎉

I know I'm leaving many, many annoying nitpicky comment but I hope they are outweighed by some useful ones! 71 comments total! Sorry! 😅

Docs like this set Dataverse apart from other platforms. How are people supposed to use you software if you don't tell them how??! Great job!

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated Show resolved Hide resolved

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated Show resolved Hide resolved

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated Show resolved Hide resolved

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated Show resolved Hide resolved

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated Show resolved Hide resolved

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated

    
              - DatasetChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum aggregate dataset size, above which validation will not be done.

              - DataFileChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum file size, above which validation will not be done.

              - FilePIDsEnabled - false is recommended when datasets have many files. Related settings allow file PIDS to be enabled/disabled per collection and per file

              - CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.

Member

pdurbin Nov 13, 2025

Suggested change

      
            - CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.
          
            - CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app

for consistency

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated

    
              - DataFileChecksumValidationSizeLimit - by default, Dataverse checks fixity (assuring the file contents match the recorded checksum) as part of publication. This setting specifies a maximum file size, above which validation will not be done.

              - FilePIDsEnabled - false is recommended when datasets have many files. Related settings allow file PIDS to be enabled/disabled per collection and per file

              - CustomZipDownloadServiceUrl - allows use of a separate process/machine to handle zipping up multi-file downloads. Requires installation of the separate Zip Download app.

              - WebloaderUrl - enables use of an installed DVWebloader (by specifying it's web location) which is more efficient for uploading many files

Member

pdurbin Nov 13, 2025

Suggested change

      
            - WebloaderUrl - enables use of an installed DVWebloader (by specifying it's web location) which is more efficient for uploading many files 
          
            - WebloaderUrl - enables use of an installed DVWebloader (by specifying its web location) which is more efficient for uploading many files

doc/sphinx-guides/source/admin/big-data-administration.rst Outdated

    
              - DisableSolrFacets - disables facets, which are costly to generate, in search results (including the main collection page)

              - DisableSolrFacetsForGuestUsers - only disable facets for guests

              - DisableSolrFacetsWithoutJsession - disables facets for users who have disabled cookies (e.g. for bots)

              - DisableUncheckedTypesFacet -only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)

Member

pdurbin Nov 13, 2025

Suggested change

      
            - DisableUncheckedTypesFacet -only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)
          
            - DisableUncheckedTypesFacet - only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)

doc/sphinx-guides/source/admin/big-data-administration.rst

    
              - Investigate performance tuning options for Payara, Solr, and Postgres

              - Coordinate with others in the community - there is a lot of aggregate knowledge

              - Consider contributing to software design changes - Dataverse scaling has improved dramatically over the past several years, but more can be done

              - Watch for the new single page application (SPA) front-end for Dataverse. It includes features such as infinite scrolling through files with much faster initial page load times

Member

pdurbin Nov 13, 2025

Just a thought, we could have a section at the end called "Resources" or "Getting Involved" that links to https://www.gdcc.io/working-groups/large-data-support.html and #large-data. It could also invite people to contribute to this guide.

Member Author

qqmyers Nov 13, 2025

Seems like something to keep centralized?

doc/sphinx-guides/source/admin/index.rst Show resolved Hide resolved

pdurbin assigned qqmyers

qqmyers and others added 5 commits

November 13, 2025 15:31


          Merge remote-tracking branch 'IQSS/develop' into TDL-BigDataDocs

9ba3977


          Apply suggestions from code review

2c5b5b0

Co-authored-by: Philip Durbin <[email protected]>


          Merge branch 'TDL-BigDataDocs' of https://github.com/GlobalDataverseC…

cbb64ce

…ommunityConsortium/dataverse.git into TDL-BigDataDocs


          updates per review

1e719cb


          missed ,

db72308

qqmyers removed their assignment

Member Author

qqmyers commented Nov 13, 2025

Thanks for the detailed read. Hopefully I addressed everything in some way or other.


          add /index to doc link

758f13f

cmbz added the FY26 Sprint 10 label

pdurbin mentioned this pull request

docs: update link to dataverse-globus externaltool #11993

Closed


          update link

e47a947

cmbz added the FY26 Sprint 11 label

pdurbin added 5 commits

November 25, 2025 12:33


          add cross links

1c08eeb


          Merge branch 'develop' into TDL-BigDataDocs

c8488d0


          missed a feature flag


          typo

86fbca9


          remove dataverse.exports.schema-dot-org.max-files-for-download-entries

252c9b3

Not supported yet. It's in this fork:
https://github.com/QualitativeDataRepository/dataverse/blob/1493abdad47eee23b208616bfa8e040cccd72b3a/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java#L2132

pdurbin approved these changes

View reviewed changes

Member

pdurbin left a comment

Looks good! Merging! Great work, @qqmyers!

github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project

pdurbin merged commit 79f5cf5 into IQSS:develop

7 of 8 checks passed

github-project-automation bot moved this from Ready for QA ⏩ to Merged 🚀 in IQSS Dataverse Project

pdurbin removed their assignment

scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY26 Sprint 10 FY26 Sprint 11 Size: 3 TDL