Skip to content

Fix author search name sort#12809

Merged
cdrini merged 5 commits into
internetarchive:masterfrom
tfmorris:12794/fix/solr-author-name-sort
May 27, 2026
Merged

Fix author search name sort#12809
cdrini merged 5 commits into
internetarchive:masterfrom
tfmorris:12794/fix/solr-author-name-sort

Conversation

@tfmorris
Copy link
Copy Markdown
Contributor

Closes #12794. Repairs damage from #11211

fix

Technical

Uses a solr.ICUCollationField for sorting instead of a string, which has the advantage of being compact, since only the collation key is stored, and broadly internationalized, to the extent that can be done in a cross language fashion. The primary strength collator will ignore both case and diacritics.

This requires the analysis-extras module which is configured inconsistently across the environments. I made that more consistent in 37881245e4a65a9cbea128575803376ab84450b8 in my original PR, but have omitted it here to streamline things.

Testing

Screenshot

Screenshot 2026-05-25 at 19 26 03

Stakeholders

@cdrini

Copilot AI review requested due to automatic review settings May 25, 2026 23:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new Solr sort field for author names (name_sort) using ICU collation, and updates author search sorting to use the new field (with a temporary fallback during reindex).

Changes:

  • Add name_sort to Solr schema and Solr document typing.
  • Add an ICU collation-based field type for internationalized sorting and copy name into name_sort.
  • Update author search scheme to sort by name_sort when using “next” Solr, otherwise fall back to name_str.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
openlibrary/solr/solr_types.py Adds name_sort (and TODOs) to the typed Solr document shape for authors.
openlibrary/plugins/worksearch/schemes/authors.py Switches the "name" sort to prefer name_sort when using next Solr.
conf/solr/conf/managed-schema.xml Defines name_sort, adds ICU collation field type, and updates copyField/sort-related schema entries.

Comment thread openlibrary/solr/solr_types.py Outdated
text: Optional[list[str]]
seed: Optional[list[str]]
name: Optional[str]
# TODO: name_str can be removed when removed when reindex is complete
Comment thread openlibrary/plugins/worksearch/schemes/authors.py Outdated
Comment thread conf/solr/conf/managed-schema.xml
Comment thread openlibrary/plugins/worksearch/schemes/authors.py Outdated
@mekarpeles
Copy link
Copy Markdown
Member

Thanks for the PR, @tfmorris!

Copilot has been assigned for an initial review.

@cdrini is assigned to this PR and currently has:

  • 29 open PR(s) of equal or higher priority to review first

Possible improvements for this PR

  • CI is currently failing — please check the failing checks and resolve them before review. The pre-commit guide covers common fixes.
PR triage checklist (maintainers / Pam)
  • PR description — not empty; explains what the change does and how to verify it
  • References an issue — PR body contains a #NNN reference
    • Linked issue is triaged — has a Priority: * label (not just Needs: Triage)
    • Linked issue is assigned — has at least one assignee
  • Commit history clean — no WIP/fixup/conflict noise; commit messages are meaningful
  • CI passing — no failing check-runs
  • Test cases present — if the change touches substantive logic, test coverage exists or is explained
  • Proof of testing — PR body includes a description of what was tested, a screenshot, or a video

Note

This comment was automatically generated by Pam, Open Library's Project AI Manager, on behalf of @mekarpeles. Pam is designed to provide status visibility, perform basic project management functions and relevant codebase research, and provide actionable feedback so contributors aren't left waiting.

@mekarpeles mekarpeles added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 25, 2026
tfmorris added 2 commits May 25, 2026 20:21
- create a new type text_international_sort of type ICUCollationField
  which is not stored
- create a name_sort field which uses this type
- update search schema to switch from name_str to name_sort
- delete obsolete name_str field
- also add TODOs for post-reindex tasks
@tfmorris tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from deeeb0d to 31a8451 Compare May 26, 2026 00:24
@github-actions github-actions Bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 26, 2026
@tfmorris tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from cbd5412 to e3d3348 Compare May 26, 2026 02:44
@tfmorris tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from e3d3348 to 8bd33c9 Compare May 26, 2026 03:02
@tfmorris
Copy link
Copy Markdown
Contributor Author

@cdrini I think this is ready for review, but I can't tell what's going on with the failing tests. They don't seem to have anything to do with code that I touched. Any ideas?

@github-actions github-actions Bot added the Needs: Response Issues which require feedback from lead label May 26, 2026
@cdrini
Copy link
Copy Markdown
Collaborator

cdrini commented May 26, 2026

How odd... oh I think I see the issue; calling get_solr_next here is causing the conf/openlibrary.yml file to be loaded where previously it wasn't -- resulting in the amazon tag being read-in. That's a bit of a tangle ; updating the test to include the tag seems like the easiest fix.

@tfmorris
Copy link
Copy Markdown
Contributor Author

Thanks for the tip. I'll have a look. There's an additional issue in that the CI infrastructure insists on keeping the Solr schema and the types definitions in lock step. In addition, I'm not sure how to effectively test that the fallback works correctly since it effectively requires two different versions of the schema.

I'll clean up the tag stuff first and perhaps that'll also give a hint as to the memcache failures (different memcache definitions in the config?).

Copy link
Copy Markdown
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you Tom! This looks great to me. ICUCollationField makes sense and seems to the most performant option for this too! Which should address the strain we saw that led to #11211. Strength=primary also seems reasonable. I could see it potentially useful to use secondary so that accented names are grouped together, but seeing as most such cases are likely going to missing accents in our data set, primary seems good.

Tested by:

Copying in the docs from the Tavernier search

docker compose exec -e 'PYTHONPATH=.' web ./scripts/copydocs.py $(curl -s 'https://openlibrary.org/search/authors.json?q=Tavernier&sort=name&fields=key' | jq -r '.docs[].key | "/authors/" + .')

And confirming things appear in the correct order:

Image

(Note: Once the full reindex is deployed, we should take author sorting out of beta ; this will make it very stable.)

If we can find a way to sort that test out, then should be good to go. I also just noticed that there are a lot of other failing tests too, not just the amazon tag one :/ Let me see if I can find a quick workaround...

Comment thread conf/solr/conf/managed-schema.xml Outdated
Comment thread conf/solr/conf/managed-schema.xml Outdated
enumsConfig = ET.parse(os.path.join(root, "../../conf/solr/conf/", enumsConfigFile))
enum_values = [el.text for el in enumsConfig.findall(f".//enum[@name='{field_type.get('enumName')}']/value")]
python_type = f"Literal[{', '.join(map(repr, enum_values))}]"
elif field_class == "solr.ICUCollationField":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we might want to revisit this file in a future issue/PR and consider making stored="false" values excluded, since they can't be read. And maybe having a separate type for SolrInputDocument which allows setting some of the write-only feeds. Not a blocker ; current solution is good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing the types to be in lock step with a single version of the schema also makes testing for the type of migration done here difficult.

Comment on lines +35 to +36
# TODO: fallback can be removed after reindex is complete
"name": "name_sort asc" if get_solr_next() else "name_str asc",
Copy link
Copy Markdown
Collaborator

@cdrini cdrini May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it! The issue get_solr_next is called here effectively at import-time, causing the config to be loaded in early and having side effects for the other tests. If we put this in a lambda causes get_solr_next to be called much later when expected, and fixes the bug.

Suggested change
# TODO: fallback can be removed after reindex is complete
"name": "name_sort asc" if get_solr_next() else "name_str asc",
# TODO: fallback can be removed after reindex is complete
"name": lambda: "name_sort asc" if get_solr_next() else "name_str asc",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out the workaround. Having tests which fail when the standard config file is included seems fragile and error-prone to me, but I'm happy to have this merged and let someone else deal with that mess.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the way our openlibrary.yml is loaded in causes a lot of race conditions.

@cdrini cdrini merged commit 38ac59c into internetarchive:master May 27, 2026
4 checks passed
@cdrini cdrini changed the title Fix author search name sort. Fixes #12794 Fix author search name sort May 27, 2026
@cdrini cdrini removed the Needs: Response Issues which require feedback from lead label May 27, 2026
@tfmorris
Copy link
Copy Markdown
Contributor Author

Thanks! Please have a look at #11463 too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Author alphabetical search sort no longer works since #11211

4 participants