feat: Add TopN Scan Validation#3816
Conversation
|
This failing test runs need to pass before we can review: https://github.com/paradedb/paradedb/actions/runs/20538587650/job/58999930420?pr=3816. You can ignore the SchemaBot CI failure |
|
yo sorry, thought that failed test was irrelevant for my changes, anyways looking into it rn |
|
@philippemnoel lgtm? |
Please rebase, your code shows 220 files changed. Then we'll need either Ming, Stu or Moe reviewing |
|
Sorry for the mishap, have updated it @philippemnoel |
Thanks. I'll update your branch, as we have now fixed SchemaBot to work for external contributors |
There was a problem hiding this comment.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
stuhood
left a comment
There was a problem hiding this comment.
Thanks a lot! Just for comments, and then I think that we can merge this.
stuhood
left a comment
There was a problem hiding this comment.
These look great: thanks for the help!
I'll handle the rest of getting this landed: it looks like there is one regress test that gets a complete false positive, but the rest are accurate.
|
fixed issues with some .out files |
Not to worry: we'll take it from here. Thanks a lot! |
Add is_minmax_implicit_limit() to detect when LIMIT 1 is injected by PostgreSQL's MIN/MAX rewrite, preventing spurious warnings for users.
Add COSTS OFF to EXPLAIN statements and ANALYZE to setup. Add ORDER BY to INTERSECT query for consistent row ordering.
Ensures PostgreSQL has accurate statistics for all test tables, preventing plan variations across different environments.
# Ticket(s) Closed - Closes #3687 ## What Added `paradedb.check_topn_scan` GUC that emits a warning when a query with LIMIT should use TopN scan but falls back to a slower execution method. **Files changed:** - [pg_search/src/gucs.rs](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/gucs.rs) (+17 lines) — GUC definition - [pg_search/src/postgres/customscan/pdbscan/mod.rs](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs) (+96 lines) — Validation logic - [pg_search/tests/pg_regress/sql/topn_validation.sql](file:///Users/vrn21/Developer/rust/paradedb/pg_search/tests/pg_regress/sql/topn_validation.sql) (+116 lines) — Tests ## Why ParadeDB silently falls back to slower execution methods when TopN cannot be used (missing `fast: true`, normalizer mismatch, too many ORDER BY columns). Queries work fine on small dev datasets but become 100-1000x slower in production with no warning. ## How Added [validate_topn_expectation()](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs#1344-1445) function in [choose_exec_method()](file:///Users/vrn21/Developer/rust/paradedb/pg_search/src/postgres/customscan/pdbscan/mod.rs#1446-1488) that checks if TopN was expected (has LIMIT + search operator + no GROUP BY) but not chosen. If validation is enabled and TopN was missed, emits a warning with the reason. **Default**: `false` (opt-in) **Usage:** ```sql SET paradedb.check_topn_scan = true; SELECT * FROM products WHERE cat @@@ 'electronics' ORDER BY cat LIMIT 10; -- WARNING: Query has LIMIT 10 but is not using TopN scan (using FastFieldMixed instead). -- Reason: ORDER BY columns cannot be pushed down to the index. ``` ## Tests [pg_search/tests/pg_regress/sql/topn_validation.sql](file:///Users/vrn21/Developer/rust/paradedb/pg_search/tests/pg_regress/sql/topn_validation.sql) covers 8 scenarios: | Test | Scenario | Expected | |------|----------|----------| | 1 | Validation OFF, non-fast ORDER BY | No warning | | 2 | Validation ON, non-fast ORDER BY | WARNING | | 3 | Validation ON, valid TopN query | No warning | | 4 | Too many ORDER BY columns | WARNING | | 5a | ORDER BY lower() matching index | No warning | | 5b | ORDER BY not matching lower() index | WARNING | | 6 | Query without LIMIT | No validation | | 7 | EXPLAIN with validation | WARNING | | 8 | Disable validation mid-session | No warning | <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Introduces an opt-in validator to catch missed TopN optimizations on LIMIT queries using ParadeDB search. > > - New GUC `paradedb.check_topn_scan` (default `false`) in `pg_search/src/gucs.rs` > - Adds `validate_topn_expectation()` in `pdbscan/mod.rs`; invoked from `choose_exec_method()` to log a warning when TopN is expected (LIMIT + search + no GROUP BY) but `ExecMethodType` is not `TopN`, with reasons derived from `PathKeyInfo` (e.g., unusable pathkeys, prefix-only, >`MAX_TOPN_FEATURES`, normalizer mismatch) > - Minor refactor to choose method then validate before returning > - Regression tests `tests/pg_regress/sql/topn_validation.sql` (+ expected output) cover non-fast ORDER BY, valid TopN, too many ORDER BY cols, `lower()` normalizer mismatch, no LIMIT, EXPLAIN logging, and toggling the GUC > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 9b41430. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Stu Hood <[email protected]> Co-authored-by: Mithun Chicklore Yogendra <[email protected]>
Ticket(s) Closed
What
Added
paradedb.check_topn_scanGUC that emits a warning when a query with LIMIT should use TopN scan but falls back to a slower execution method.Files changed:
Why
ParadeDB silently falls back to slower execution methods when TopN cannot be used (missing
fast: true, normalizer mismatch, too many ORDER BY columns). Queries work fine on small dev datasets but become 100-1000x slower in production with no warning.How
Added validate_topn_expectation() function in choose_exec_method() that checks if TopN was expected (has LIMIT + search operator + no GROUP BY) but not chosen. If validation is enabled and TopN was missed, emits a warning with the reason.
Default:
false(opt-in)Usage:
Tests
pg_search/tests/pg_regress/sql/topn_validation.sql covers 8 scenarios:
Note
Introduces an opt-in validator to catch missed TopN optimizations on LIMIT queries using ParadeDB search.
paradedb.check_topn_scan(defaultfalse) inpg_search/src/gucs.rsvalidate_topn_expectation()inpdbscan/mod.rs; invoked fromchoose_exec_method()to log a warning when TopN is expected (LIMIT + search + no GROUP BY) butExecMethodTypeis notTopN, with reasons derived fromPathKeyInfo(e.g., unusable pathkeys, prefix-only, >MAX_TOPN_FEATURES, normalizer mismatch)tests/pg_regress/sql/topn_validation.sql(+ expected output) cover non-fast ORDER BY, valid TopN, too many ORDER BY cols,lower()normalizer mismatch, no LIMIT, EXPLAIN logging, and toggling the GUCWritten by Cursor Bugbot for commit 9b41430. This will update automatically on new commits. Configure here.