Fix author search name sort by tfmorris · Pull Request #12809 · internetarchive/openlibrary

tfmorris · 2026-05-25T23:43:27Z

Closes #12794. Repairs damage from #11211

fix

Technical

Uses a solr.ICUCollationField for sorting instead of a string, which has the advantage of being compact, since only the collation key is stored, and broadly internationalized, to the extent that can be done in a cross language fashion. The primary strength collator will ignore both case and diacritics.

This requires the analysis-extras module which is configured inconsistently across the environments. I made that more consistent in 37881245e4a65a9cbea128575803376ab84450b8 in my original PR, but have omitted it here to streamline things.

Testing

Screenshot

Stakeholders

@cdrini

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new Solr sort field for author names (name_sort) using ICU collation, and updates author search sorting to use the new field (with a temporary fallback during reindex).

Changes:

Add name_sort to Solr schema and Solr document typing.
Add an ICU collation-based field type for internationalized sorting and copy name into name_sort.
Update author search scheme to sort by name_sort when using “next” Solr, otherwise fall back to name_str.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
openlibrary/solr/solr_types.py	Adds `name_sort` (and TODOs) to the typed Solr document shape for authors.
openlibrary/plugins/worksearch/schemes/authors.py	Switches the `"name"` sort to prefer `name_sort` when using next Solr.
conf/solr/conf/managed-schema.xml	Defines `name_sort`, adds ICU collation field type, and updates copyField/sort-related schema entries.

    text: Optional[list[str]]
    seed: Optional[list[str]]
    name: Optional[str]
+    # TODO: name_str can be removed when removed when reindex is complete


mekarpeles · 2026-05-25T23:45:54Z

Thanks for the PR, @tfmorris!

Copilot has been assigned for an initial review.

@cdrini is assigned to this PR and currently has:

29 open PR(s) of equal or higher priority to review first

Possible improvements for this PR

CI is currently failing — please check the failing checks and resolve them before review. The pre-commit guide covers common fixes.

PR triage checklist (maintainers / Pam)

PR description — not empty; explains what the change does and how to verify it
References an issue — PR body contains a #NNN reference
- Linked issue is triaged — has a Priority: * label (not just Needs: Triage)
- Linked issue is assigned — has at least one assignee
Commit history clean — no WIP/fixup/conflict noise; commit messages are meaningful
CI passing — no failing check-runs
Test cases present — if the change touches substantive logic, test coverage exists or is explained
Proof of testing — PR body includes a description of what was tested, a screenshot, or a video

Note

This comment was automatically generated by Pam, Open Library's Project AI Manager, on behalf of @mekarpeles. Pam is designed to provide status visibility, perform basic project management functions and relevant codebase research, and provide actionable feedback so contributors aren't left waiting.

- create a new type text_international_sort of type ICUCollationField which is not stored - create a name_sort field which uses this type - update search schema to switch from name_str to name_sort - delete obsolete name_str field

- also add TODOs for post-reindex tasks

tfmorris · 2026-05-26T03:05:32Z

@cdrini I think this is ready for review, but I can't tell what's going on with the failing tests. They don't seem to have anything to do with code that I touched. Any ideas?

cdrini · 2026-05-26T18:10:39Z

How odd... oh I think I see the issue; calling get_solr_next here is causing the conf/openlibrary.yml file to be loaded where previously it wasn't -- resulting in the amazon tag being read-in. That's a bit of a tangle ; updating the test to include the tag seems like the easiest fix.

tfmorris · 2026-05-26T19:16:35Z

Thanks for the tip. I'll have a look. There's an additional issue in that the CI infrastructure insists on keeping the Solr schema and the types definitions in lock step. In addition, I'm not sure how to effectively test that the fallback works correctly since it effectively requires two different versions of the schema.

I'll clean up the tag stuff first and perhaps that'll also give a hint as to the memcache failures (different memcache definitions in the config?).

cdrini

Awesome, thank you Tom! This looks great to me. ICUCollationField makes sense and seems to the most performant option for this too! Which should address the strain we saw that led to #11211. Strength=primary also seems reasonable. I could see it potentially useful to use secondary so that accented names are grouped together, but seeing as most such cases are likely going to missing accents in our data set, primary seems good.

Tested by:

Copying in the docs from the Tavernier search

docker compose exec -e 'PYTHONPATH=.' web ./scripts/copydocs.py $(curl -s 'https://openlibrary.org/search/authors.json?q=Tavernier&sort=name&fields=key' | jq -r '.docs[].key | "/authors/" + .')

And confirming things appear in the correct order:

(Note: Once the full reindex is deployed, we should take author sorting out of beta ; this will make it very stable.)

If we can find a way to sort that test out, then should be good to go. I also just noticed that there are a lot of other failing tests too, not just the amazon tag one :/ Let me see if I can find a quick workaround...

cdrini · 2026-05-26T18:44:02Z

                enumsConfig = ET.parse(os.path.join(root, "../../conf/solr/conf/", enumsConfigFile))
                enum_values = [el.text for el in enumsConfig.findall(f".//enum[@name='{field_type.get('enumName')}']/value")]
                python_type = f"Literal[{', '.join(map(repr, enum_values))}]"
+            elif field_class == "solr.ICUCollationField":


Note we might want to revisit this file in a future issue/PR and consider making stored="false" values excluded, since they can't be read. And maybe having a separate type for SolrInputDocument which allows setting some of the write-only feeds. Not a blocker ; current solution is good.

Forcing the types to be in lock step with a single version of the schema also makes testing for the type of migration done here difficult.

cdrini · 2026-05-27T12:18:20Z

+            # TODO: fallback can be removed after reindex is complete
+            "name": "name_sort asc" if get_solr_next() else "name_str asc",


Ah got it! The issue get_solr_next is called here effectively at import-time, causing the config to be loaded in early and having side effects for the other tests. If we put this in a lambda causes get_solr_next to be called much later when expected, and fixes the bug.

Suggested change

# TODO: fallback can be removed after reindex is complete

"name": "name_sort asc" if get_solr_next() else "name_str asc",

# TODO: fallback can be removed after reindex is complete

"name": lambda: "name_sort asc" if get_solr_next() else "name_str asc",

Thanks for figuring out the workaround. Having tests which fail when the standard config file is included seems fragile and error-prone to me, but I'm happy to have this merged and let someone else deal with that mess.

Yeah, the way our openlibrary.yml is loaded in causes a lot of race conditions.

tfmorris · 2026-05-27T18:44:02Z

Thanks! Please have a look at #11463 too.

Copilot AI review requested due to automatic review settings May 25, 2026 23:43

github-actions Bot assigned cdrini May 25, 2026

Copilot AI reviewed May 25, 2026

View reviewed changes

mekarpeles added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 25, 2026

tfmorris added 2 commits May 25, 2026 20:21

Fix author search name collating order

1225ae7

- create a new type text_international_sort of type ICUCollationField which is not stored - create a name_sort field which uses this type - update search schema to switch from name_str to name_sort - delete obsolete name_str field

Address review comments

31a8451

- also add TODOs for post-reindex tasks

tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from deeeb0d to 31a8451 Compare May 26, 2026 00:24

github-actions Bot removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label May 26, 2026

tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from cbd5412 to e3d3348 Compare May 26, 2026 02:44

tfmorris added 2 commits May 25, 2026 22:49

Add support for ICUCollationField type

5b07bc0

Add docValues=true, even though it's the default

8bd33c9

tfmorris force-pushed the 12794/fix/solr-author-name-sort branch from e3d3348 to 8bd33c9 Compare May 26, 2026 03:02

This was referenced May 26, 2026

Fix librarian work title sorting not ignoring articles #11463

Merged

Fix author name sort field #12803

Closed

github-actions Bot added the Needs: Response Issues which require feedback from lead label May 26, 2026

cdrini approved these changes May 27, 2026

View reviewed changes

cdrini reviewed May 27, 2026

View reviewed changes

Apply code review feedback

1026a3f

cdrini merged commit 38ac59c into internetarchive:master May 27, 2026
4 checks passed

cdrini mentioned this pull request May 27, 2026

2026 H1 Full Solr Reindex #11650

Open

cdrini changed the title ~~Fix author search name sort. Fixes #12794~~ Fix author search name sort May 27, 2026

cdrini removed the Needs: Response Issues which require feedback from lead label May 27, 2026

		# TODO: fallback can be removed after reindex is complete
		"name": "name_sort asc" if get_solr_next() else "name_str asc",

Uh oh!

Conversation

tfmorris commented May 25, 2026

Technical

Testing

Screenshot

Stakeholders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mekarpeles commented May 25, 2026

Possible improvements for this PR

Uh oh!

tfmorris commented May 26, 2026

Uh oh!

cdrini commented May 26, 2026

Uh oh!

tfmorris commented May 26, 2026

Uh oh!

cdrini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cdrini May 26, 2026

Choose a reason for hiding this comment

Uh oh!

tfmorris May 27, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfmorris May 27, 2026

Choose a reason for hiding this comment

Uh oh!

cdrini May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tfmorris commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cdrini left a comment •

edited

Loading

cdrini May 27, 2026 •

edited

Loading