-
Notifications
You must be signed in to change notification settings - Fork 531
External Search Refactor #11281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Search Refactor #11281
Conversation
pdurbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a couple more hard-coded "solr" instances.
Also, I played around with trying to keep the "oldies" and "oddly" stuff out of the production Docker app. See #11281 (comment)
src/main/java/edu/harvard/iq/dataverse/search/SolrSearchServiceBean.java
Outdated
Show resolved
Hide resolved
Conflicts: doc/sphinx-guides/source/installation/config.rst
pdurbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are passing. I resolved merge conflicts in the docs. I didn't test the feature at all but I think this is ready for QA. Thanks for resolving the issue I found where the example search services were being loaded by default in Docker.
|
Testing passed - merging |
* update dvObject featured item response * fix doc * let doc writers know to pip install new requirements IQSS#10618 * improve release note IQSS#10618 * move file count limit from dvObject to dvObjectContainer * make use of "version" string in "making releases" doc IQSS#10618 * various tweaks to docs IQSS#10618 * typo IQSS#10618 * mention alpha tag is going away IQSS#10618 * clarify * IQSS#11542 get all type counts after first page * ci(ct): re-enable conditional workflows in upstream repository Removed temporary testing condition for `gdcc/wip-base-image` and reinstated the `if` condition to restrict workflow runs to the upstream `IQSS` repository. * docs(ct): add README for backports directory and usage guidelines Introduced documentation detailing the structure, purpose, and usage of the `src/backports` directory to support consistent patch management across multiple Dataverse release versions. * Added: getSuperuserOrAssigneeDataverseRolesFor method to RoleAssigneeServiceBean * Refactor: RoleAssigneeServiceBean error message extracted to Bundle string * Added: getAssignedRoles API endpoint * keep initial dataverse on invalid input * release not change according to code * adding datasetFileUploadsAvailable to response * fix flyway script name * fix replace * fix null pointer * fix superuser ignore limits * Added: tweaks and IT for getUserSelectableRoles * Added: IT cases for testGetUserSelectableRoles * Added: docs for roles/userSelectable * Added: docs for IQSS#11434 * add missing params to api doc * add checks for restricted files and unpublished dataverse and dataset * fix unit test * fix it test by piblishing dataverse and dataset * allow for limit of 0 to make dataset effectivly read only * fix unit test * fix(ct): don't push images with buildx we haven't build * fix(ct): backporting skip push tags to old releases * fix(ct): remove typos in patch file headers * add ui integration with file count limit * Update src/main/java/edu/harvard/iq/dataverse/util/json/JsonPrinter.java Co-authored-by: Guillermo Portas <[email protected]> * fix datafile not found error and added tests * add comments to new IT tests * build(deps): bump org.postgresql:postgresql in /modules/dataverse-parent Bumps [org.postgresql:postgresql](https://github.com/pgjdbc/pgjdbc) from 42.7.4 to 42.7.7. - [Release notes](https://github.com/pgjdbc/pgjdbc/releases) - [Changelog](https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md) - [Commits](pgjdbc/pgjdbc@REL42.7.4...REL42.7.7) --- updated-dependencies: - dependency-name: org.postgresql:postgresql dependency-version: 42.7.7 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * add dvObject display name to json * add new tests * change flyway script from 6.6.02 to 6.6.01 * rename flyway script from 6.6.0.1 to 6.6.0.2 * Migration from forgemia.inra.fr to forge.inrae.fr * fix: add datasetType to JSON serialization of DatasetVersion THe reason to add this here is that some client tools might request specific versions of a dataset, but need the information about the type, too. For example some client might want to display a specific version of a software instead of a dataset differently. It is debatable if this should be included here, as so far there is no consensus yet wether or not the type of a dataset should be allowed to change between versions of it. On the other hand, the global ID should also never change, but it is included anyway for the sake of getting back to the dataset via the API if necessary. This should be revisited as a JSON API design issue at a later point, especially as when creating a dataset version, the type is at the root level and the version itself is at a sublevel, rendering the API inconsistent from the start. * Update apps.rst Added dataverse-metadata-crawler (https://github.com/scholarsportal/dataverse-metadata-crawler) to the API Guide/Apps for the documentation. * fix picky sphinx error * Create 11581-metadata-crawler.md * add crawler to reporting tools * use WSL not cmd.exe for Windows dev, simplify IQSS#10606 * add release note IQSS#10606 * fix links IQSS#11549 * IQSS#11542 update comments * re-work the deleteion of datafile featured items when publishing dataset * fix tests by setting publication date on datafile * fix typo in doc * changes per review * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> * Changes per review * adding limit check on multi file upload * address code review feedback: add OPS, say "3.x" IQSS#11518 * add reference to LC mdb * remove unused imports, info->fine per review * improve release note IQSS#11360 * add note about the executor resource * add note about XML serialization changes * improve Search Services docs IQSS#11281 * Update doc/release-notes/TKLabels.md Co-authored-by: Omer Fahim <[email protected]> * Update doc/release-notes/TKLabels.md Co-authored-by: Philip Durbin <[email protected]> * fix after class name change * Update src/main/java/edu/harvard/iq/dataverse/search/SearchService.java Co-authored-by: Philip Durbin <[email protected]> * PIDs in addition to DOIs * add defaultService to API * static solr string * add settings to release note/docs * more name changes * test: add unit test for datasetType inclusion in DatasetVersion JSON serialization * docs: clarify dataset fields consistency across dataset versions in API guide IQSS#11573 Added a note explaining the constancy of `datasetId`, `datasetPersistentId`, and `datasetType` fields for all dataset versions in the API documentation. The first two have been around since v4.18, see also issue IQSS#6397 * partial update * fix broken page - name change, remove legacy section * exclude ext search classes * IQSS#11546 fix compare separator logic * use constant in more places * Remove the doc/sphinx-guides/HOWTO-SPHINX-INSTALL.txt file, because it is out-of-date and the updated information is part of the docs at doc/sphinx-guides/source/contributor/documentation.md. * Add a feature flag to optionally enable permission check in LC api * disable/don't load any non-solr search services in war * fix indenting IQSS#11281 * add comments to keep excludes in sync IQSS#11281 * tweak release note IQSS#11281 * finish windows instructions * fix list * minor tweaks IQSS#10606 * combine snippets (so far, with issue numbers) into 6.7 release notes IQSS#11367 * fix heading level * start reording, other tweaks * Add WSL preparation steps and package installation instructions * Update WSL installation instructions for package management and SDKMAN usage * Update WSL installation instructions to use powershell code blocks for package installation * Clarify WSL preparation steps and SDKMAN usage in documentation * minor clean-up * reordering and tweaking IQSS#11367 * more reordering and tweaking, add highlights IQSS#11367 * rework through end of API updates IQSS#11367 * edited up to upgrade steps IQSS#11367 * clean up upgrade instructions IQSS#11367 * update guides links to 6.7 IQSS#11367 * tweaks IQSS#11367 * mention MCP IQSS#11367 * add Rclone to list of integrations IQSS#11608 * make limit update a superuser only function * tweak highlights IQSS#11367 * rework video subtitles section IQSS#11367 * reword settings IQSS#11367 * curation labels updates, Solr indexing IQSS#11367 * various tweaks IQSS#11367 * add the rest of backward incompats IQSS#11367 * throw exception if non superuser changes the limit * throw exception if non superuser changes the limit * fix tests * consolidate creating DocumentBuilder * cleanup * missing import * handle XMLReader * update XMLInputStream, final closes * refactor to reduce dupe code * remove unused imports * tests * style issue * typo from earlier update * simple release note * add release note and link to announcement and PR IQSS#11608 * avoid parallel testing with cache * transformer, validator updates * avoid props/attr that don't exist in our versions * move Rclone from "getting data in" to "analysis and computation" IQSS#11608 * clarify Rclone support is read only IQSS#11608 * Update doc/release-notes/6.7-release-notes.md typo fix * Update doc/release-notes/6.7-release-notes.md typo fix * Update doc/release-notes/6.7-release-notes.md fixed typo * Update doc/release-notes/6.7-release-notes.md added a space * add windows dev to 6.7 releases notes IQSS#11367 * add configurable search services to 6.7 release notes IQSS#11367 * add Security Updates section IQSS#11367 * Updated the locally-built xoai jars with the cherry-picked commit securing b0ea5e5 XMLInputFactory there. * rebuilt the xoai jar files with a modified version, to make the project build on Jenkins without breaking all the other PR builds. * ... and the updated pom file to use the new local version * add xml parsing update IQSS#11367 * add rclone support to release notes IQSS#11367 * clarify that changes can be made later IQSS#11367 * simplify IQSS#11367 * one line per paragraph IQSS#11367 * add note about Solr dir for Docker users IQSS#11367 * Update doc/release-notes/6.7-release-notes.md typo fix * typo: fix URL Co-authored-by: Omer Fahim <[email protected]> * bump to 6.7 and update base.image.version IQSS#11371 * Update doc/release-notes/6.7-release-notes.md added word * switch base.image.version back to non-revision IQSS#11651 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Steven Winship <[email protected]> Co-authored-by: Philip Durbin <[email protected]> Co-authored-by: Omer Fahim <[email protected]> Co-authored-by: Stephen Kraffmiller <[email protected]> Co-authored-by: Oliver Bertuch <[email protected]> Co-authored-by: GPortas <[email protected]> Co-authored-by: jo-pol <[email protected]> Co-authored-by: qqmyers <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Leonid Andreev <[email protected]> Co-authored-by: Steven Ferey <[email protected]> Co-authored-by: Ken Lui <[email protected]> Co-authored-by: qqmyers <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Jan van Mansum <[email protected]> Co-authored-by: Juan Pablo Tosca Villanueva <[email protected]>
* let doc writers know to pip install new requirements IQSS#10618 * improve release note IQSS#10618 * move file count limit from dvObject to dvObjectContainer * make use of "version" string in "making releases" doc IQSS#10618 * various tweaks to docs IQSS#10618 * typo IQSS#10618 * mention alpha tag is going away IQSS#10618 * clarify * IQSS#11542 get all type counts after first page * ci(ct): re-enable conditional workflows in upstream repository Removed temporary testing condition for `gdcc/wip-base-image` and reinstated the `if` condition to restrict workflow runs to the upstream `IQSS` repository. * docs(ct): add README for backports directory and usage guidelines Introduced documentation detailing the structure, purpose, and usage of the `src/backports` directory to support consistent patch management across multiple Dataverse release versions. * Added: getSuperuserOrAssigneeDataverseRolesFor method to RoleAssigneeServiceBean * Refactor: RoleAssigneeServiceBean error message extracted to Bundle string * Added: getAssignedRoles API endpoint * keep initial dataverse on invalid input * release not change according to code * adding datasetFileUploadsAvailable to response * fix flyway script name * fix replace * fix null pointer * fix superuser ignore limits * Added: tweaks and IT for getUserSelectableRoles * Added: IT cases for testGetUserSelectableRoles * Added: docs for roles/userSelectable * Added: docs for IQSS#11434 * add missing params to api doc * add checks for restricted files and unpublished dataverse and dataset * fix unit test * fix it test by piblishing dataverse and dataset * allow for limit of 0 to make dataset effectivly read only * fix unit test * fix(ct): don't push images with buildx we haven't build * fix(ct): backporting skip push tags to old releases * fix(ct): remove typos in patch file headers * add ui integration with file count limit * Update src/main/java/edu/harvard/iq/dataverse/util/json/JsonPrinter.java Co-authored-by: Guillermo Portas <[email protected]> * fix datafile not found error and added tests * add comments to new IT tests * build(deps): bump org.postgresql:postgresql in /modules/dataverse-parent Bumps [org.postgresql:postgresql](https://github.com/pgjdbc/pgjdbc) from 42.7.4 to 42.7.7. - [Release notes](https://github.com/pgjdbc/pgjdbc/releases) - [Changelog](https://github.com/pgjdbc/pgjdbc/blob/master/CHANGELOG.md) - [Commits](pgjdbc/pgjdbc@REL42.7.4...REL42.7.7) --- updated-dependencies: - dependency-name: org.postgresql:postgresql dependency-version: 42.7.7 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * add dvObject display name to json * add new tests * change flyway script from 6.6.02 to 6.6.01 * rename flyway script from 6.6.0.1 to 6.6.0.2 * Migration from forgemia.inra.fr to forge.inrae.fr * fix: add datasetType to JSON serialization of DatasetVersion THe reason to add this here is that some client tools might request specific versions of a dataset, but need the information about the type, too. For example some client might want to display a specific version of a software instead of a dataset differently. It is debatable if this should be included here, as so far there is no consensus yet wether or not the type of a dataset should be allowed to change between versions of it. On the other hand, the global ID should also never change, but it is included anyway for the sake of getting back to the dataset via the API if necessary. This should be revisited as a JSON API design issue at a later point, especially as when creating a dataset version, the type is at the root level and the version itself is at a sublevel, rendering the API inconsistent from the start. * Update apps.rst Added dataverse-metadata-crawler (https://github.com/scholarsportal/dataverse-metadata-crawler) to the API Guide/Apps for the documentation. * fix picky sphinx error * Create 11581-metadata-crawler.md * add crawler to reporting tools * use WSL not cmd.exe for Windows dev, simplify IQSS#10606 * add release note IQSS#10606 * fix links IQSS#11549 * IQSS#11542 update comments * re-work the deleteion of datafile featured items when publishing dataset * fix tests by setting publication date on datafile * fix typo in doc * changes per review * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> * Changes per review * adding limit check on multi file upload * address code review feedback: add OPS, say "3.x" IQSS#11518 * add reference to LC mdb * remove unused imports, info->fine per review * improve release note IQSS#11360 * add note about the executor resource * add note about XML serialization changes * improve Search Services docs IQSS#11281 * Update doc/release-notes/TKLabels.md Co-authored-by: Omer Fahim <[email protected]> * Update doc/release-notes/TKLabels.md Co-authored-by: Philip Durbin <[email protected]> * fix after class name change * Update src/main/java/edu/harvard/iq/dataverse/search/SearchService.java Co-authored-by: Philip Durbin <[email protected]> * PIDs in addition to DOIs * add defaultService to API * static solr string * add settings to release note/docs * more name changes * test: add unit test for datasetType inclusion in DatasetVersion JSON serialization * docs: clarify dataset fields consistency across dataset versions in API guide IQSS#11573 Added a note explaining the constancy of `datasetId`, `datasetPersistentId`, and `datasetType` fields for all dataset versions in the API documentation. The first two have been around since v4.18, see also issue IQSS#6397 * partial update * fix broken page - name change, remove legacy section * exclude ext search classes * IQSS#11546 fix compare separator logic * use constant in more places * Remove the doc/sphinx-guides/HOWTO-SPHINX-INSTALL.txt file, because it is out-of-date and the updated information is part of the docs at doc/sphinx-guides/source/contributor/documentation.md. * Add a feature flag to optionally enable permission check in LC api * disable/don't load any non-solr search services in war * fix indenting IQSS#11281 * add comments to keep excludes in sync IQSS#11281 * tweak release note IQSS#11281 * finish windows instructions * fix list * minor tweaks IQSS#10606 * combine snippets (so far, with issue numbers) into 6.7 release notes IQSS#11367 * fix heading level * start reording, other tweaks * Add WSL preparation steps and package installation instructions * Update WSL installation instructions for package management and SDKMAN usage * Update WSL installation instructions to use powershell code blocks for package installation * Clarify WSL preparation steps and SDKMAN usage in documentation * minor clean-up * reordering and tweaking IQSS#11367 * more reordering and tweaking, add highlights IQSS#11367 * rework through end of API updates IQSS#11367 * edited up to upgrade steps IQSS#11367 * clean up upgrade instructions IQSS#11367 * update guides links to 6.7 IQSS#11367 * tweaks IQSS#11367 * mention MCP IQSS#11367 * add Rclone to list of integrations IQSS#11608 * make limit update a superuser only function * tweak highlights IQSS#11367 * rework video subtitles section IQSS#11367 * reword settings IQSS#11367 * curation labels updates, Solr indexing IQSS#11367 * various tweaks IQSS#11367 * add the rest of backward incompats IQSS#11367 * throw exception if non superuser changes the limit * throw exception if non superuser changes the limit * fix tests * consolidate creating DocumentBuilder * cleanup * missing import * handle XMLReader * update XMLInputStream, final closes * refactor to reduce dupe code * remove unused imports * tests * style issue * typo from earlier update * simple release note * add release note and link to announcement and PR IQSS#11608 * avoid parallel testing with cache * transformer, validator updates * avoid props/attr that don't exist in our versions * move Rclone from "getting data in" to "analysis and computation" IQSS#11608 * clarify Rclone support is read only IQSS#11608 * Update doc/release-notes/6.7-release-notes.md typo fix * Update doc/release-notes/6.7-release-notes.md typo fix * Update doc/release-notes/6.7-release-notes.md fixed typo * Update doc/release-notes/6.7-release-notes.md added a space * add windows dev to 6.7 releases notes IQSS#11367 * add configurable search services to 6.7 release notes IQSS#11367 * add Security Updates section IQSS#11367 * Updated the locally-built xoai jars with the cherry-picked commit securing b0ea5e5 XMLInputFactory there. * rebuilt the xoai jar files with a modified version, to make the project build on Jenkins without breaking all the other PR builds. * ... and the updated pom file to use the new local version * add xml parsing update IQSS#11367 * add rclone support to release notes IQSS#11367 * clarify that changes can be made later IQSS#11367 * simplify IQSS#11367 * one line per paragraph IQSS#11367 * add note about Solr dir for Docker users IQSS#11367 * Update doc/release-notes/6.7-release-notes.md typo fix * typo: fix URL Co-authored-by: Omer Fahim <[email protected]> * bump to 6.7 and update base.image.version IQSS#11371 * Update doc/release-notes/6.7-release-notes.md added word * switch base.image.version back to non-revision IQSS#11651 * catch failing pubworkflow * Fix for curation status indication. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Philip Durbin <[email protected]> Co-authored-by: Omer Fahim <[email protected]> Co-authored-by: Steven Winship <[email protected]> Co-authored-by: Stephen Kraffmiller <[email protected]> Co-authored-by: Oliver Bertuch <[email protected]> Co-authored-by: GPortas <[email protected]> Co-authored-by: jo-pol <[email protected]> Co-authored-by: qqmyers <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Leonid Andreev <[email protected]> Co-authored-by: Steven Ferey <[email protected]> Co-authored-by: Ken Lui <[email protected]> Co-authored-by: qqmyers <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Jan van Mansum <[email protected]> Co-authored-by: Juan Pablo Tosca Villanueva <[email protected]> Co-authored-by: Florian Fritze <[email protected]>
What this PR does / why we need it: This PR implements an experimental capability to have several configurable search services. It's a minimalistic implementation of the general design that's been discussed, but, with the example GetExternalSearchServiceBean and PostExternalSearchServiceBean classes it is intended to support developing/demonstrating a separate tool, and specifically the AI dataset search coming from Harvard.
Which engine is used is controlled by a Mprofile/jvm option: -Ddataverse.search.default-service=
where name is 'getExternalSearch', 'postExternalSearch', etc. as defined in the classes.
(The *ExternalSearchServiceBeans send all of the search parameters from Dataverse to a third party web service (via GET or POST) and expect to get back an ordered JsonArray of matching objects/datasets with "DOI"s and "Distance"s. These classes then send a query to Solr via Dataverse's internal SolrSearchServiceBean that will generate the output needed for the display. (Solr doesn't preserve the order in the query. The beans reorder the results based on Distance (smallest first).)
Both of the *ExternalSearchServiceBean classes also rely on a new/temporary
:<Get/Post>ExternalSearchUrlparameter (easier to change than a JvmSetting) and an:<Get/Post>ExternalSearchNamesetting.The GetExternalSearchServiceBean has been 'tested' with :GetExternalSearchUrl pointed at a static page returning a list of valid DOIs and Distances - can see in the logs that all the search params are sent as query params, can verify that final the search results contain only those entities, that the order is the same as what's sent, etc. The POST version is being tested on the QA machine with @siacus's AI search via POST.
There are other example beans: goldenOldies that only returns entries with entityIds <1000, and oddlyEnough that only returns entityIds that are odd. Neither of these is practical, but they allow some testing w/o any external service and, for oddlyEnough, show how one can deal partially deal with paging, changing the number of total results, etc. that come from solr.
This PR also restores the query_entities parameter in the search api. It remains defaulted to true (it was hardcoded true previously), but setting query_entities=false will now avoid the post-search lookup of additional metadata in postgres.
Which issue(s) this PR closes:
Special notes for your reviewer:
The ability to find/enable SearchService implementations packaged in separate jars works, but they can't be built separately from the main Dataverse code (since they use core classes like Dataverse, SolrSearchServiceBean, SettingsService, etc.).
They way maven is comfigured now, the war doesn't include any new search services. To build the separate jars, you can call, for example,
mvn clean package -DskipTests=true -Pexternal-search-get -Pexternal-search-post -Ptrivial-search-examplesI'm not sure how/where the separate jars should be distributed (additional resources for the release?) Hopefully someone with more maven skills than me can separate jar builds more cleanly (even if we don't immediately drop references to internal Dataverse classes).
The SolrSearchServiceBean is @stateless - as it has been. In early dev, I switched it to @singleton since @stateless is injected as a proxy (which implements SearchService but is not a SolrSearchServiceBean instance.) I think the current code is handling that OK.
Suggestions on how to test this: The main test would be to try this on the QA machine which is configured it to use Stefano's AI search. That can either be tested by API, soon via the SPA, or by setting the default to the postExternalSearch service. However, the latter, since it excludes collections and files, will mess up the JSF UI in ways that might be confusing. (If the SPA is deployed there, the UI should allow you to switch between installed engines.)
Aside from validating that that works, some regression/performance testing should be done to verify the design doesn't break or slow current solr searches (shouldn't break, could slow).
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: the api change is documented in the api guide, there's a new search-services developer page with details on building/configuring.