feat: Add sort_by option to CREATE INDEX syntax#3988
Merged
Conversation
51998d0 to
85b7afe
Compare
stuhood
reviewed
Jan 25, 2026
Adds a new WITH clause option `sort_by` that configures how Tantivy sorts documents within index segments. This enables optimizations for queries with ORDER BY clauses matching the segment sort order. Syntax: CREATE INDEX ... WITH (key_field='id', sort_by='field ASC|DESC [NULLS FIRST|LAST]') CREATE INDEX ... WITH (key_field='id', sort_by='none') -- disable sorting CREATE INDEX ... WITH (key_field='id') -- defaults to ctid ASC Features: - Parse and validate sort_by syntax during CREATE INDEX - Pass sort configuration to Tantivy IndexSettings - Validate sort field exists and is a fast field - Support ASC/DESC direction and NULLS FIRST/LAST Tests: - Unit tests for parsing (18 tests in options.rs) - Unit tests for Tantivy integration (7 tests in build.rs) - pg_regress tests covering syntax, sorted fetch, sawtooth patterns, NULLS handling, default behavior, multi-field warning, and error cases Closes #3984
Previously, data inserted AFTER index creation went into mutable segments that used IndexSettings::default(), ignoring the sort_by configuration. This caused mutable segment data to be returned in insertion order instead of the configured sort order. Changes: - SerialIndexWriter::in_memory() now reads sort_by from rd_options - build_sort_by_field() made pub(crate) for internal reuse - Added Section 6 test to verify mutable segment sorting
Replace manual index loop with idiomatic iterator using skip pattern. This is cleaner, less error-prone, and more Rust-idiomatic.
85b7afe to
15c03fb
Compare
NULLS ordering is now required to avoid confusion with PostgreSQL defaults. Tantivy's NULL ordering is fixed and differs from PostgreSQL: - ASC: only NULLS FIRST supported (PostgreSQL defaults to NULLS LAST) - DESC: only NULLS LAST supported (PostgreSQL defaults to NULLS FIRST) Valid syntax: - sort_by='field ASC NULLS FIRST' - sort_by='field DESC NULLS LAST' - sort_by='none' Invalid (will error): - sort_by='field ASC' (missing NULLS) - sort_by='field ASC NULLS LAST' (wrong NULLS for ASC) - sort_by='field DESC NULLS FIRST' (wrong NULLS for DESC)
rebasedming
reviewed
Jan 26, 2026
rebasedming
reviewed
Jan 26, 2026
rebasedming
reviewed
Jan 26, 2026
Read IndexSettings from MetaPage instead of rd_options when creating mutable segments. This ensures backward compatibility: - Old indexes (before sort_by feature): sort_by_field = None - New indexes with default: sort_by_field = Some(ctid ASC) - New indexes with explicit sort_by: sort_by_field = Some(<field>) The MetaPage is the source of truth for index settings. Mutable segments are only materialized after the index is valid, so settings always exist.
Address PR review comments: - Move multi-field validation to options.rs (parse_sort_by_string) - Move build_sort_by_field from build.rs to SearchIndexSchema - Update tests accordingly
Fall back to IndexSettings::default() when settings bytes are empty (legacy indexes). Return descriptive error for invalid JSON instead of panicking.
- Change default sort_by from 'ctid ASC' to None (no segment sorting) - Simplify mutable segment IndexSettings to use rd_options instead of MetaPage - Remove unused CTID_FIELD constant - Combine redundant test_parse_sort_by_none tests - Rename test to reflect explicit ctid sorting vs default
stuhood
approved these changes
Jan 28, 2026
Collaborator
stuhood
left a comment
There was a problem hiding this comment.
Thanks! Shipit with comments.
- Prune behavior tests, keeping only syntax validation (per PR review) - Add tests for case insensitivity, ctid field, composite TYPE, pdb.alias - Add error case tests for invalid syntax, unsupported combinations - Reject whitespace-only sort_by with clear error message
stuhood
added a commit
that referenced
this pull request
Jan 28, 2026
) ## What This change swaps from the hash join execution method which was added in #3930 to explicitly using DataFusion's hash join. Future changes will introduce DataFusion's optimizer (by producing logical nodes rather than physical nodes) so that it can take advantage of the sorted segments which will be provided by #3988. ## Why As described in #3930, the implementation there was explicitly temporary. We will be leaning in to using DataFusion to execute columnar joins. ## Tests The regression tests and proptests pass, with one exception: numeric columns cannot safely be pulled up from fast fields currently (see #2968). Additionally, improved `qgen`'s handling of panics.
stuhood
added a commit
that referenced
this pull request
Feb 2, 2026
# Ticket(s) Closed - Closes #3987. ## What This change migrates the `joinscan` from manual physical plan construction to using DataFusion logical plans, which are then optimized into physical plans. To do so, it adds a TableProvider implementation to allow DataFusion to natively scan and filter using the [previously extracted scan over fast fields](#3981). It additionally rearranges the code a bit to ensure separation between planning and execution: execution state and DataFusion logical planning are isolated to `scan_state.rs` (over time we might frontload and serialize more of DataFusion's logical plan during planning time, but for now none of that information is needed in the `CustomPath` that we produce). `JoinSideInfo` was lifted up into `pg_search/src/scan` as `ScanInfo`. ## Why In followup changes, we will give DataFusion's optimizer a lot more to work with: * Sorted segments (via #3988) * Predicate pushdown (see the TODO in `supports_filters_pushdown`), and dynamic filtering * Introducing use of parallel workers ## Tests Existing tests pass, and were lightly expanded to cover some new edge cases. One join benchmark is marginally faster, one is massively slower, others are unaffected. This is expected for now: more performance work is on the way.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds support for
sort_byoption inCREATE INDEX ... USING bm25syntax to configure index-time sorting.Syntax:
sort_byomitted'none')sort_by = 'none'sort_by = 'field ASC NULLS FIRST'sort_by = 'field DESC NULLS LAST'Constraints:
ASCrequiresNULLS FIRSTDESCrequiresNULLS LASTGrammar:
Why
Part of M1 for #3984. Enables:
Implementation
parse_sort_by_string()inoptions.rs- Parses and validates the sort_by syntaxbuild_sort_by_field()inschema/mod.rs- Validates field exists and is a fast fieldALTER INDEX ... SET (sort_by='...')+REINDEXsupported for changing sort orderTests
Closes #3984 (partial)