Fix author search name sort#12809
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces a new Solr sort field for author names (name_sort) using ICU collation, and updates author search sorting to use the new field (with a temporary fallback during reindex).
Changes:
- Add
name_sortto Solr schema and Solr document typing. - Add an ICU collation-based field type for internationalized sorting and copy
nameintoname_sort. - Update author search scheme to sort by
name_sortwhen using “next” Solr, otherwise fall back toname_str.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| openlibrary/solr/solr_types.py | Adds name_sort (and TODOs) to the typed Solr document shape for authors. |
| openlibrary/plugins/worksearch/schemes/authors.py | Switches the "name" sort to prefer name_sort when using next Solr. |
| conf/solr/conf/managed-schema.xml | Defines name_sort, adds ICU collation field type, and updates copyField/sort-related schema entries. |
| text: Optional[list[str]] | ||
| seed: Optional[list[str]] | ||
| name: Optional[str] | ||
| # TODO: name_str can be removed when removed when reindex is complete |
|
Thanks for the PR, @tfmorris! Copilot has been assigned for an initial review. @cdrini is assigned to this PR and currently has:
Possible improvements for this PR
PR triage checklist (maintainers / Pam)
Note This comment was automatically generated by Pam, Open Library's Project AI Manager, on behalf of @mekarpeles. Pam is designed to provide status visibility, perform basic project management functions and relevant codebase research, and provide actionable feedback so contributors aren't left waiting. |
- create a new type text_international_sort of type ICUCollationField which is not stored - create a name_sort field which uses this type - update search schema to switch from name_str to name_sort - delete obsolete name_str field
- also add TODOs for post-reindex tasks
deeeb0d to
31a8451
Compare
cbd5412 to
e3d3348
Compare
e3d3348 to
8bd33c9
Compare
|
@cdrini I think this is ready for review, but I can't tell what's going on with the failing tests. They don't seem to have anything to do with code that I touched. Any ideas? |
|
How odd... oh I think I see the issue; calling |
|
Thanks for the tip. I'll have a look. There's an additional issue in that the CI infrastructure insists on keeping the Solr schema and the types definitions in lock step. In addition, I'm not sure how to effectively test that the fallback works correctly since it effectively requires two different versions of the schema. I'll clean up the tag stuff first and perhaps that'll also give a hint as to the memcache failures (different memcache definitions in the config?). |
There was a problem hiding this comment.
Awesome, thank you Tom! This looks great to me. ICUCollationField makes sense and seems to the most performant option for this too! Which should address the strain we saw that led to #11211. Strength=primary also seems reasonable. I could see it potentially useful to use secondary so that accented names are grouped together, but seeing as most such cases are likely going to missing accents in our data set, primary seems good.
Tested by:
Copying in the docs from the Tavernier search
docker compose exec -e 'PYTHONPATH=.' web ./scripts/copydocs.py $(curl -s 'https://openlibrary.org/search/authors.json?q=Tavernier&sort=name&fields=key' | jq -r '.docs[].key | "/authors/" + .')And confirming things appear in the correct order:
(Note: Once the full reindex is deployed, we should take author sorting out of beta ; this will make it very stable.)
If we can find a way to sort that test out, then should be good to go. I also just noticed that there are a lot of other failing tests too, not just the amazon tag one :/ Let me see if I can find a quick workaround...
| enumsConfig = ET.parse(os.path.join(root, "../../conf/solr/conf/", enumsConfigFile)) | ||
| enum_values = [el.text for el in enumsConfig.findall(f".//enum[@name='{field_type.get('enumName')}']/value")] | ||
| python_type = f"Literal[{', '.join(map(repr, enum_values))}]" | ||
| elif field_class == "solr.ICUCollationField": |
There was a problem hiding this comment.
Note we might want to revisit this file in a future issue/PR and consider making stored="false" values excluded, since they can't be read. And maybe having a separate type for SolrInputDocument which allows setting some of the write-only feeds. Not a blocker ; current solution is good.
There was a problem hiding this comment.
Forcing the types to be in lock step with a single version of the schema also makes testing for the type of migration done here difficult.
| # TODO: fallback can be removed after reindex is complete | ||
| "name": "name_sort asc" if get_solr_next() else "name_str asc", |
There was a problem hiding this comment.
Ah got it! The issue get_solr_next is called here effectively at import-time, causing the config to be loaded in early and having side effects for the other tests. If we put this in a lambda causes get_solr_next to be called much later when expected, and fixes the bug.
| # TODO: fallback can be removed after reindex is complete | |
| "name": "name_sort asc" if get_solr_next() else "name_str asc", | |
| # TODO: fallback can be removed after reindex is complete | |
| "name": lambda: "name_sort asc" if get_solr_next() else "name_str asc", |
There was a problem hiding this comment.
Thanks for figuring out the workaround. Having tests which fail when the standard config file is included seems fragile and error-prone to me, but I'm happy to have this merged and let someone else deal with that mess.
There was a problem hiding this comment.
Yeah, the way our openlibrary.yml is loaded in causes a lot of race conditions.
|
Thanks! Please have a look at #11463 too. |
Closes #12794. Repairs damage from #11211
fix
Technical
Uses a
solr.ICUCollationFieldfor sorting instead of a string, which has the advantage of being compact, since only the collation key is stored, and broadly internationalized, to the extent that can be done in a cross language fashion. The primary strength collator will ignore both case and diacritics.This requires the
analysis-extrasmodule which is configured inconsistently across the environments. I made that more consistent in 37881245e4a65a9cbea128575803376ab84450b8 in my original PR, but have omitted it here to streamline things.Testing
Screenshot
Stakeholders
@cdrini