feat: Add sort_by option to CREATE INDEX syntax by mithuncy · Pull Request #3988 · paradedb/paradedb

mithuncy · 2026-01-22T16:25:48Z

What

Adds support for sort_by option in CREATE INDEX ... USING bm25 syntax to configure index-time sorting.

Syntax:

-- Sort by score descending
CREATE INDEX idx ON orders USING bm25 (id, customer_id, score)
WITH (
    key_field = 'id',
    sort_by = 'score DESC NULLS LAST'
);

-- Disable sorting (default behavior)
CREATE INDEX idx ON orders USING bm25 (id, customer_id, score)
WITH (
    key_field = 'id',
    sort_by = 'none'
);

Option	Behavior
`sort_by` omitted	No sorting (equivalent to `'none'`)
`sort_by = 'none'`	Explicitly disable sorting
`sort_by = 'field ASC NULLS FIRST'`	Sort ascending, NULLs first
`sort_by = 'field DESC NULLS LAST'`	Sort descending, NULLs last

Constraints:

Single field only (multi-field not yet supported)
Field must be a fast field (numeric, date, or boolean)
NULLS ordering must be explicit and match Tantivy's behavior:
- ASC requires NULLS FIRST
- DESC requires NULLS LAST

Grammar:

sort_by_value  ::= 'none' | sort_key
sort_key       ::= field_name ( 'ASC' 'NULLS FIRST' | 'DESC' 'NULLS LAST' )
field_name     ::= identifier

Why

Part of M1 for #3984. Enables:

TopN query acceleration - Early cutoff on sorted fast fields
Join acceleration - Sorting by primary/foreign keys enables merge/zig-zag joins
Query locality - Sorting by filtering/partitioning properties

Implementation

parse_sort_by_string() in options.rs - Parses and validates the sort_by syntax
build_sort_by_field() in schema/mod.rs - Validates field exists and is a fast field
Settings stored in MetaPage and used for both persistent and mutable segments
ALTER INDEX ... SET (sort_by='...') + REINDEX supported for changing sort order

Tests

Syntax validation (ASC/DESC, NULLS FIRST/LAST)
Single segment sorted fetch
Multi-segment sawtooth pattern
NULL value handling
Different field types (TIMESTAMP, REAL, BIGINT)
REINDEX behavior
Error cases (non-existent field, non-fast field, invalid syntax)

Closes #3984 (partial)

stuhood

Thanks a lot!

Adds a new WITH clause option `sort_by` that configures how Tantivy sorts documents within index segments. This enables optimizations for queries with ORDER BY clauses matching the segment sort order. Syntax: CREATE INDEX ... WITH (key_field='id', sort_by='field ASC|DESC [NULLS FIRST|LAST]') CREATE INDEX ... WITH (key_field='id', sort_by='none') -- disable sorting CREATE INDEX ... WITH (key_field='id') -- defaults to ctid ASC Features: - Parse and validate sort_by syntax during CREATE INDEX - Pass sort configuration to Tantivy IndexSettings - Validate sort field exists and is a fast field - Support ASC/DESC direction and NULLS FIRST/LAST Tests: - Unit tests for parsing (18 tests in options.rs) - Unit tests for Tantivy integration (7 tests in build.rs) - pg_regress tests covering syntax, sorted fetch, sawtooth patterns, NULLS handling, default behavior, multi-field warning, and error cases Closes #3984

Previously, data inserted AFTER index creation went into mutable segments that used IndexSettings::default(), ignoring the sort_by configuration. This caused mutable segment data to be returned in insertion order instead of the configured sort order. Changes: - SerialIndexWriter::in_memory() now reads sort_by from rd_options - build_sort_by_field() made pub(crate) for internal reuse - Added Section 6 test to verify mutable segment sorting

Replace manual index loop with idiomatic iterator using skip pattern. This is cleaner, less error-prone, and more Rust-idiomatic.

NULLS ordering is now required to avoid confusion with PostgreSQL defaults. Tantivy's NULL ordering is fixed and differs from PostgreSQL: - ASC: only NULLS FIRST supported (PostgreSQL defaults to NULLS LAST) - DESC: only NULLS LAST supported (PostgreSQL defaults to NULLS FIRST) Valid syntax: - sort_by='field ASC NULLS FIRST' - sort_by='field DESC NULLS LAST' - sort_by='none' Invalid (will error): - sort_by='field ASC' (missing NULLS) - sort_by='field ASC NULLS LAST' (wrong NULLS for ASC) - sort_by='field DESC NULLS FIRST' (wrong NULLS for DESC)

Read IndexSettings from MetaPage instead of rd_options when creating mutable segments. This ensures backward compatibility: - Old indexes (before sort_by feature): sort_by_field = None - New indexes with default: sort_by_field = Some(ctid ASC) - New indexes with explicit sort_by: sort_by_field = Some(<field>) The MetaPage is the source of truth for index settings. Mutable segments are only materialized after the index is valid, so settings always exist.

Address PR review comments: - Move multi-field validation to options.rs (parse_sort_by_string) - Move build_sort_by_field from build.rs to SearchIndexSchema - Update tests accordingly

Fall back to IndexSettings::default() when settings bytes are empty (legacy indexes). Return descriptive error for invalid JSON instead of panicking.

- Change default sort_by from 'ctid ASC' to None (no segment sorting) - Simplify mutable segment IndexSettings to use rd_options instead of MetaPage - Remove unused CTID_FIELD constant - Combine redundant test_parse_sort_by_none tests - Rename test to reflect explicit ctid sorting vs default

stuhood

Thanks! Shipit with comments.

- Prune behavior tests, keeping only syntax validation (per PR review) - Add tests for case insensitivity, ctid field, composite TYPE, pdb.alias - Add error case tests for invalid syntax, unsupported combinations - Reject whitespace-only sort_by with clear error message

) ## What This change swaps from the hash join execution method which was added in #3930 to explicitly using DataFusion's hash join. Future changes will introduce DataFusion's optimizer (by producing logical nodes rather than physical nodes) so that it can take advantage of the sorted segments which will be provided by #3988. ## Why As described in #3930, the implementation there was explicitly temporary. We will be leaning in to using DataFusion to execute columnar joins. ## Tests The regression tests and proptests pass, with one exception: numeric columns cannot safely be pulled up from fast fields currently (see #2968). Additionally, improved `qgen`'s handling of panics.

# Ticket(s) Closed - Closes #3987. ## What This change migrates the `joinscan` from manual physical plan construction to using DataFusion logical plans, which are then optimized into physical plans. To do so, it adds a TableProvider implementation to allow DataFusion to natively scan and filter using the [previously extracted scan over fast fields](#3981). It additionally rearranges the code a bit to ensure separation between planning and execution: execution state and DataFusion logical planning are isolated to `scan_state.rs` (over time we might frontload and serialize more of DataFusion's logical plan during planning time, but for now none of that information is needed in the `CustomPath` that we produce). `JoinSideInfo` was lifted up into `pg_search/src/scan` as `ScanInfo`. ## Why In followup changes, we will give DataFusion's optimizer a lot more to work with: * Sorted segments (via #3988) * Predicate pushdown (see the TODO in `supports_filters_pushdown`), and dynamic filtering * Introducing use of parallel workers ## Tests Existing tests pass, and were lightly expanded to cover some new edge cases. One join benchmark is marginally faster, one is massively slower, others are unaffected. This is expected for now: more performance work is on the way.

mithuncy added the Do Not Cherry Pick PR should not be cherry-picked to other branches label Jan 22, 2026

mithuncy force-pushed the feat/sort-by-syntax branch 3 times, most recently from 51998d0 to 85b7afe Compare January 24, 2026 16:18

mithuncy marked this pull request as ready for review January 24, 2026 18:27

mithuncy requested review from mdashti, philippemnoel, rebasedming and stuhood as code owners January 24, 2026 18:27

stuhood reviewed Jan 25, 2026

View reviewed changes

Comment thread pg_search/src/postgres/build.rs Outdated

Comment thread pg_search/src/postgres/options.rs

Comment thread pg_search/src/postgres/options.rs Outdated

Comment thread pg_search/tests/pg_regress/sql/sort_by.sql Outdated

mithuncy added 3 commits January 25, 2026 18:32

refactor: Use iterator pattern in parse_sort_key

15c03fb

Replace manual index loop with idiomatic iterator using skip pattern. This is cleaner, less error-prone, and more Rust-idiomatic.

mithuncy force-pushed the feat/sort-by-syntax branch from 85b7afe to 15c03fb Compare January 26, 2026 14:34

rebasedming reviewed Jan 26, 2026

View reviewed changes

Comment thread pg_search/src/postgres/build.rs Outdated

rebasedming reviewed Jan 26, 2026

View reviewed changes

Comment thread pg_search/src/postgres/build.rs Outdated

rebasedming reviewed Jan 26, 2026

View reviewed changes

Comment thread pg_search/src/postgres/options.rs Outdated

stuhood mentioned this pull request Jan 27, 2026

chore: Swap join scan execution to an unoptimized DataFusion plan #4003

Merged

mithuncy added 4 commits January 27, 2026 13:34

refactor: Move build_sort_by_field to SearchIndexSchema

8e39f6b

Address PR review comments: - Move multi-field validation to options.rs (parse_sort_by_string) - Move build_sort_by_field from build.rs to SearchIndexSchema - Update tests accordingly

fix: Handle missing IndexSettings for legacy indexes

8d81ccc

Fall back to IndexSettings::default() when settings bytes are empty (legacy indexes). Return descriptive error for invalid JSON instead of panicking.

Merge branch 'main' into feat/sort-by-syntax

93224a7

mithuncy mentioned this pull request Jan 27, 2026

Investigate ctid as default sort_by for heap access optimization #4008

Open

stuhood approved these changes Jan 28, 2026

View reviewed changes

Comment thread pg_search/src/postgres/options.rs Outdated

Comment thread pg_search/tests/pg_regress/sql/sort_by.sql

mithuncy added 3 commits January 28, 2026 12:46

Merge branch 'main' into feat/sort-by-syntax

c1f24f7

fix: Improve sort_by error messages

a3d207e

mithuncy merged commit 7684acc into main Jan 28, 2026
18 checks passed

mithuncy deleted the feat/sort-by-syntax branch January 28, 2026 15:07

stuhood mentioned this pull request Jan 29, 2026

chore: Use DataFusion's optimizer in the join scan #4009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add sort_by option to CREATE INDEX syntax#3988

feat: Add sort_by option to CREATE INDEX syntax#3988
mithuncy merged 12 commits intomainfrom
feat/sort-by-syntax

mithuncy commented Jan 22, 2026 •

edited

Loading

Uh oh!

stuhood left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stuhood left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mithuncy commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Implementation

Tests

Uh oh!

stuhood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stuhood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mithuncy commented Jan 22, 2026 •

edited

Loading