Skip to content

feat: Add sort_by option to CREATE INDEX syntax#3988

Merged
mithuncy merged 12 commits intomainfrom
feat/sort-by-syntax
Jan 28, 2026
Merged

feat: Add sort_by option to CREATE INDEX syntax#3988
mithuncy merged 12 commits intomainfrom
feat/sort-by-syntax

Conversation

@mithuncy
Copy link
Copy Markdown
Contributor

@mithuncy mithuncy commented Jan 22, 2026

What

Adds support for sort_by option in CREATE INDEX ... USING bm25 syntax to configure index-time sorting.

Syntax:

-- Sort by score descending
CREATE INDEX idx ON orders USING bm25 (id, customer_id, score)
WITH (
    key_field = 'id',
    sort_by = 'score DESC NULLS LAST'
);

-- Disable sorting (default behavior)
CREATE INDEX idx ON orders USING bm25 (id, customer_id, score)
WITH (
    key_field = 'id',
    sort_by = 'none'
);
Option Behavior
sort_by omitted No sorting (equivalent to 'none')
sort_by = 'none' Explicitly disable sorting
sort_by = 'field ASC NULLS FIRST' Sort ascending, NULLs first
sort_by = 'field DESC NULLS LAST' Sort descending, NULLs last

Constraints:

  • Single field only (multi-field not yet supported)
  • Field must be a fast field (numeric, date, or boolean)
  • NULLS ordering must be explicit and match Tantivy's behavior:
    • ASC requires NULLS FIRST
    • DESC requires NULLS LAST

Grammar:

sort_by_value  ::= 'none' | sort_key
sort_key       ::= field_name ( 'ASC' 'NULLS FIRST' | 'DESC' 'NULLS LAST' )
field_name     ::= identifier

Why

Part of M1 for #3984. Enables:

  1. TopN query acceleration - Early cutoff on sorted fast fields
  2. Join acceleration - Sorting by primary/foreign keys enables merge/zig-zag joins
  3. Query locality - Sorting by filtering/partitioning properties

Implementation

  • parse_sort_by_string() in options.rs - Parses and validates the sort_by syntax
  • build_sort_by_field() in schema/mod.rs - Validates field exists and is a fast field
  • Settings stored in MetaPage and used for both persistent and mutable segments
  • ALTER INDEX ... SET (sort_by='...') + REINDEX supported for changing sort order

Tests

  • Syntax validation (ASC/DESC, NULLS FIRST/LAST)
  • Single segment sorted fetch
  • Multi-segment sawtooth pattern
  • NULL value handling
  • Different field types (TIMESTAMP, REAL, BIGINT)
  • REINDEX behavior
  • Error cases (non-existent field, non-fast field, invalid syntax)

Closes #3984 (partial)

@mithuncy mithuncy added the Do Not Cherry Pick PR should not be cherry-picked to other branches label Jan 22, 2026
@mithuncy mithuncy force-pushed the feat/sort-by-syntax branch 3 times, most recently from 51998d0 to 85b7afe Compare January 24, 2026 16:18
@mithuncy mithuncy marked this pull request as ready for review January 24, 2026 18:27
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Comment thread pg_search/src/postgres/build.rs Outdated
Comment thread pg_search/src/postgres/options.rs
Comment thread pg_search/src/postgres/options.rs Outdated
Comment thread pg_search/tests/pg_regress/sql/sort_by.sql Outdated
Adds a new WITH clause option `sort_by` that configures how Tantivy
sorts documents within index segments. This enables optimizations for
queries with ORDER BY clauses matching the segment sort order.

Syntax:
  CREATE INDEX ... WITH (key_field='id', sort_by='field ASC|DESC [NULLS FIRST|LAST]')
  CREATE INDEX ... WITH (key_field='id', sort_by='none')  -- disable sorting
  CREATE INDEX ... WITH (key_field='id')  -- defaults to ctid ASC

Features:
- Parse and validate sort_by syntax during CREATE INDEX
- Pass sort configuration to Tantivy IndexSettings
- Validate sort field exists and is a fast field
- Support ASC/DESC direction and NULLS FIRST/LAST

Tests:
- Unit tests for parsing (18 tests in options.rs)
- Unit tests for Tantivy integration (7 tests in build.rs)
- pg_regress tests covering syntax, sorted fetch, sawtooth patterns,
  NULLS handling, default behavior, multi-field warning, and error cases

Closes #3984
Previously, data inserted AFTER index creation went into mutable segments
that used IndexSettings::default(), ignoring the sort_by configuration.
This caused mutable segment data to be returned in insertion order instead
of the configured sort order.

Changes:
- SerialIndexWriter::in_memory() now reads sort_by from rd_options
- build_sort_by_field() made pub(crate) for internal reuse
- Added Section 6 test to verify mutable segment sorting
Replace manual index loop with idiomatic iterator using skip pattern.
This is cleaner, less error-prone, and more Rust-idiomatic.
@mithuncy mithuncy force-pushed the feat/sort-by-syntax branch from 85b7afe to 15c03fb Compare January 26, 2026 14:34
NULLS ordering is now required to avoid confusion with PostgreSQL defaults.
Tantivy's NULL ordering is fixed and differs from PostgreSQL:
- ASC: only NULLS FIRST supported (PostgreSQL defaults to NULLS LAST)
- DESC: only NULLS LAST supported (PostgreSQL defaults to NULLS FIRST)

Valid syntax:
- sort_by='field ASC NULLS FIRST'
- sort_by='field DESC NULLS LAST'
- sort_by='none'

Invalid (will error):
- sort_by='field ASC' (missing NULLS)
- sort_by='field ASC NULLS LAST' (wrong NULLS for ASC)
- sort_by='field DESC NULLS FIRST' (wrong NULLS for DESC)
Comment thread pg_search/src/postgres/build.rs Outdated
Comment thread pg_search/src/postgres/build.rs Outdated
Comment thread pg_search/src/postgres/options.rs Outdated
Read IndexSettings from MetaPage instead of rd_options when creating
mutable segments. This ensures backward compatibility:

- Old indexes (before sort_by feature): sort_by_field = None
- New indexes with default: sort_by_field = Some(ctid ASC)
- New indexes with explicit sort_by: sort_by_field = Some(<field>)

The MetaPage is the source of truth for index settings. Mutable segments
are only materialized after the index is valid, so settings always exist.
Address PR review comments:
- Move multi-field validation to options.rs (parse_sort_by_string)
- Move build_sort_by_field from build.rs to SearchIndexSchema
- Update tests accordingly
Fall back to IndexSettings::default() when settings bytes are empty
(legacy indexes). Return descriptive error for invalid JSON instead
of panicking.
- Change default sort_by from 'ctid ASC' to None (no segment sorting)
- Simplify mutable segment IndexSettings to use rd_options instead of MetaPage
- Remove unused CTID_FIELD constant
- Combine redundant test_parse_sort_by_none tests
- Rename test to reflect explicit ctid sorting vs default
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Shipit with comments.

Comment thread pg_search/src/postgres/options.rs Outdated
Comment thread pg_search/tests/pg_regress/sql/sort_by.sql
- Prune behavior tests, keeping only syntax validation (per PR review)
- Add tests for case insensitivity, ctid field, composite TYPE, pdb.alias
- Add error case tests for invalid syntax, unsupported combinations
- Reject whitespace-only sort_by with clear error message
@mithuncy mithuncy merged commit 7684acc into main Jan 28, 2026
18 checks passed
@mithuncy mithuncy deleted the feat/sort-by-syntax branch January 28, 2026 15:07
stuhood added a commit that referenced this pull request Jan 28, 2026
)

## What

This change swaps from the hash join execution method which was added in
#3930 to explicitly using DataFusion's hash join.

Future changes will introduce DataFusion's optimizer (by producing
logical nodes rather than physical nodes) so that it can take advantage
of the sorted segments which will be provided by #3988.

## Why

As described in #3930, the implementation there was explicitly
temporary. We will be leaning in to using DataFusion to execute columnar
joins.

## Tests

The regression tests and proptests pass, with one exception: numeric
columns cannot safely be pulled up from fast fields currently (see
#2968).

Additionally, improved `qgen`'s handling of panics.
stuhood added a commit that referenced this pull request Feb 2, 2026
# Ticket(s) Closed

- Closes #3987.

## What

This change migrates the `joinscan` from manual physical plan
construction to using DataFusion logical plans, which are then optimized
into physical plans.

To do so, it adds a TableProvider implementation to allow DataFusion to
natively scan and filter using the [previously extracted scan over fast
fields](#3981).

It additionally rearranges the code a bit to ensure separation between
planning and execution: execution state and DataFusion logical planning
are isolated to `scan_state.rs` (over time we might frontload and
serialize more of DataFusion's logical plan during planning time, but
for now none of that information is needed in the `CustomPath` that we
produce). `JoinSideInfo` was lifted up into `pg_search/src/scan` as
`ScanInfo`.

## Why

In followup changes, we will give DataFusion's optimizer a lot more to
work with:
* Sorted segments (via #3988)
* Predicate pushdown (see the TODO in `supports_filters_pushdown`), and
dynamic filtering
* Introducing use of parallel workers

## Tests

Existing tests pass, and were lightly expanded to cover some new edge
cases.

One join benchmark is marginally faster, one is massively slower, others
are unaffected. This is expected for now: more performance work is on
the way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Cherry Pick PR should not be cherry-picked to other branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sortedness: M1: Single column sorts

3 participants