Skip to content

feat: Add TopN Scan Validation#3816

Merged
stuhood merged 11 commits intoparadedb:mainfrom
vrn21:main
Jan 16, 2026
Merged

feat: Add TopN Scan Validation#3816
stuhood merged 11 commits intoparadedb:mainfrom
vrn21:main

Conversation

@vrn21
Copy link
Copy Markdown
Contributor

@vrn21 vrn21 commented Dec 27, 2025

Ticket(s) Closed

What

Added paradedb.check_topn_scan GUC that emits a warning when a query with LIMIT should use TopN scan but falls back to a slower execution method.

Files changed:

  • pg_search/src/gucs.rs (+17 lines) — GUC definition
  • pg_search/src/postgres/customscan/pdbscan/mod.rs (+96 lines) — Validation logic
  • pg_search/tests/pg_regress/sql/topn_validation.sql (+116 lines) — Tests

Why

ParadeDB silently falls back to slower execution methods when TopN cannot be used (missing fast: true, normalizer mismatch, too many ORDER BY columns). Queries work fine on small dev datasets but become 100-1000x slower in production with no warning.

How

Added validate_topn_expectation() function in choose_exec_method() that checks if TopN was expected (has LIMIT + search operator + no GROUP BY) but not chosen. If validation is enabled and TopN was missed, emits a warning with the reason.

Default: false (opt-in)

Usage:

SET paradedb.check_topn_scan = true;

SELECT * FROM products WHERE cat @@@ 'electronics' ORDER BY cat LIMIT 10;
-- WARNING: Query has LIMIT 10 but is not using TopN scan (using FastFieldMixed instead).
-- Reason: ORDER BY columns cannot be pushed down to the index.

Tests

pg_search/tests/pg_regress/sql/topn_validation.sql covers 8 scenarios:

Test Scenario Expected
1 Validation OFF, non-fast ORDER BY No warning
2 Validation ON, non-fast ORDER BY WARNING
3 Validation ON, valid TopN query No warning
4 Too many ORDER BY columns WARNING
5a ORDER BY lower() matching index No warning
5b ORDER BY not matching lower() index WARNING
6 Query without LIMIT No validation
7 EXPLAIN with validation WARNING
8 Disable validation mid-session No warning

Note

Introduces an opt-in validator to catch missed TopN optimizations on LIMIT queries using ParadeDB search.

  • New GUC paradedb.check_topn_scan (default false) in pg_search/src/gucs.rs
  • Adds validate_topn_expectation() in pdbscan/mod.rs; invoked from choose_exec_method() to log a warning when TopN is expected (LIMIT + search + no GROUP BY) but ExecMethodType is not TopN, with reasons derived from PathKeyInfo (e.g., unusable pathkeys, prefix-only, >MAX_TOPN_FEATURES, normalizer mismatch)
  • Minor refactor to choose method then validate before returning
  • Regression tests tests/pg_regress/sql/topn_validation.sql (+ expected output) cover non-fast ORDER BY, valid TopN, too many ORDER BY cols, lower() normalizer mismatch, no LIMIT, EXPLAIN logging, and toggling the GUC

Written by Cursor Bugbot for commit 9b41430. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 27, 2025

CLA assistant check
All committers have signed the CLA.

@philippemnoel philippemnoel added the Do Not Cherry Pick PR should not be cherry-picked to other branches label Dec 31, 2025
@philippemnoel
Copy link
Copy Markdown
Member

This failing test runs need to pass before we can review: https://github.com/paradedb/paradedb/actions/runs/20538587650/job/58999930420?pr=3816. You can ignore the SchemaBot CI failure

@vrn21
Copy link
Copy Markdown
Contributor Author

vrn21 commented Dec 31, 2025

yo sorry, thought that failed test was irrelevant for my changes, anyways looking into it rn

@vrn21 vrn21 requested a review from philippemnoel as a code owner December 31, 2025 15:59
@vrn21
Copy link
Copy Markdown
Contributor Author

vrn21 commented Dec 31, 2025

@philippemnoel lgtm?

@philippemnoel
Copy link
Copy Markdown
Member

@philippemnoel lgtm?

Please rebase, your code shows 220 files changed. Then we'll need either Ming, Stu or Moe reviewing

@vrn21
Copy link
Copy Markdown
Contributor Author

vrn21 commented Jan 1, 2026

Sorry for the mishap, have updated it @philippemnoel

@philippemnoel
Copy link
Copy Markdown
Member

philippemnoel commented Jan 2, 2026

Sorry for the mishap, have updated it @philippemnoel

Thanks. I'll update your branch, as we have now fixed SchemaBot to work for external contributors

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs Outdated
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Just for comments, and then I think that we can merge this.

Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs
Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs Outdated
Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs Outdated
Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs Outdated
Comment thread pg_search/src/gucs.rs Outdated
Comment thread pg_search/src/postgres/customscan/pdbscan/mod.rs Outdated
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look great: thanks for the help!

I'll handle the rest of getting this landed: it looks like there is one regress test that gets a complete false positive, but the rest are accurate.

Comment thread pg_search/tests/pg_regress/expected/fast_fields_options.out
Comment thread pg_search/tests/pg_regress/expected/fast_fields_options.out
Comment thread pg_search/tests/pg_regress/expected/issue_2688.out
Comment thread pg_search/tests/pg_regress/expected/issue_2932.out
Comment thread pg_search/tests/pg_regress/expected/json_aggregate.out
Comment thread pg_search/tests/pg_regress/expected/top_n_scan.out
Comment thread pg_search/tests/pg_regress/expected/fast_fields_options.out
Comment thread pg_search/tests/pg_regress/expected/fast_fields_options.out
@vrn21
Copy link
Copy Markdown
Contributor Author

vrn21 commented Jan 7, 2026

fixed issues with some .out files

@stuhood
Copy link
Copy Markdown
Collaborator

stuhood commented Jan 7, 2026

fixed issues with some .out files

Not to worry: we'll take it from here. Thanks a lot!

@stuhood stuhood assigned mdashti and mithuncy and unassigned mdashti Jan 7, 2026
vrn21 and others added 3 commits January 16, 2026 00:06
Add is_minmax_implicit_limit() to detect when LIMIT 1 is injected by
PostgreSQL's MIN/MAX rewrite, preventing spurious warnings for users.
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Add COSTS OFF to EXPLAIN statements and ANALYZE to setup.
Add ORDER BY to INTERSECT query for consistent row ordering.
Ensures PostgreSQL has accurate statistics for all test tables,
preventing plan variations across different environments.
@stuhood stuhood merged commit 462740b into paradedb:main Jan 16, 2026
19 checks passed
mdashti pushed a commit that referenced this pull request Feb 13, 2026
# Ticket(s) Closed

- Closes #3687

## What

Added `paradedb.check_topn_scan` GUC that emits a warning when a query
with LIMIT should use TopN scan but falls back to a slower execution
method.

**Files changed:**
-
[pg_search/src/gucs.rs](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/gucs.rs)
(+17 lines) — GUC definition
-
[pg_search/src/postgres/customscan/pdbscan/mod.rs](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs)
(+96 lines) — Validation logic
-
[pg_search/tests/pg_regress/sql/topn_validation.sql](file:///Users/vrn21/Developer/rust/paradedb/pg_search/tests/pg_regress/sql/topn_validation.sql)
(+116 lines) — Tests

## Why

ParadeDB silently falls back to slower execution methods when TopN
cannot be used (missing `fast: true`, normalizer mismatch, too many
ORDER BY columns). Queries work fine on small dev datasets but become
100-1000x slower in production with no warning.

## How

Added
[validate_topn_expectation()](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs#1344-1445)
function in
[choose_exec_method()](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs#1446-1488)
that checks if TopN was expected (has LIMIT + search operator + no GROUP
BY) but not chosen. If validation is enabled and TopN was missed, emits
a warning with the reason.

**Default**: `false` (opt-in)

**Usage:**
```sql
SET paradedb.check_topn_scan = true;

SELECT * FROM products WHERE cat @@@ 'electronics' ORDER BY cat LIMIT 10;
-- WARNING: Query has LIMIT 10 but is not using TopN scan (using FastFieldMixed instead).
-- Reason: ORDER BY columns cannot be pushed down to the index.
```

## Tests


[pg_search/tests/pg_regress/sql/topn_validation.sql](file:///Users/vrn21/Developer/rust/paradedb/pg_search/tests/pg_regress/sql/topn_validation.sql)
covers 8 scenarios:

| Test | Scenario | Expected |
|------|----------|----------|
| 1 | Validation OFF, non-fast ORDER BY | No warning |
| 2 | Validation ON, non-fast ORDER BY | WARNING |
| 3 | Validation ON, valid TopN query | No warning |
| 4 | Too many ORDER BY columns | WARNING |
| 5a | ORDER BY lower() matching index | No warning |
| 5b | ORDER BY not matching lower() index | WARNING |
| 6 | Query without LIMIT | No validation |
| 7 | EXPLAIN with validation | WARNING |
| 8 | Disable validation mid-session | No warning |

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces an opt-in validator to catch missed TopN optimizations on
LIMIT queries using ParadeDB search.
> 
> - New GUC `paradedb.check_topn_scan` (default `false`) in
`pg_search/src/gucs.rs`
> - Adds `validate_topn_expectation()` in `pdbscan/mod.rs`; invoked from
`choose_exec_method()` to log a warning when TopN is expected (LIMIT +
search + no GROUP BY) but `ExecMethodType` is not `TopN`, with reasons
derived from `PathKeyInfo` (e.g., unusable pathkeys, prefix-only,
>`MAX_TOPN_FEATURES`, normalizer mismatch)
> - Minor refactor to choose method then validate before returning
> - Regression tests `tests/pg_regress/sql/topn_validation.sql` (+
expected output) cover non-fast ORDER BY, valid TopN, too many ORDER BY
cols, `lower()` normalizer mismatch, no LIMIT, EXPLAIN logging, and
toggling the GUC
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
9b41430. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Stu Hood <[email protected]>
Co-authored-by: Mithun Chicklore Yogendra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Cherry Pick PR should not be cherry-picked to other branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make it harder to accidentally miss a TopN scan

6 participants