Skip to content

feat: Add composite type support for BM25 indexes (>32 fields)#3776

Merged
mithuncy merged 17 commits intomainfrom
feat/composite-types-support
Jan 9, 2026
Merged

feat: Add composite type support for BM25 indexes (>32 fields)#3776
mithuncy merged 17 commits intomainfrom
feat/composite-types-support

Conversation

@mithuncy
Copy link
Copy Markdown
Contributor

@mithuncy mithuncy commented Dec 15, 2025

Summary

Fixes #3686

  • Adds support for indexing more than 32 columns using composite types with ROW(...)::type expressions
  • Composite fields are searchable individually by their field names from the type definition

Changes

Core Module (pg_search/src/postgres/composite.rs)

  • CompositeSlotValues - Unpacks composite values upfront during construction for field extraction
  • CompositeFieldInfo - Metadata struct for composite type fields
  • CompositeError - Validation errors for nested composites, anonymous ROW, and domain types
  • Helper functions: is_composite_type, get_composite_type_fields, get_composite_fields_for_index

Field Matching (pg_search/src/api/operator.rs)

  • expr_matches_node - New helper function for matching WHERE clause expressions against indexed expressions
  • Extended field_name_from_node to detect composite type fields and match them by field name

Integration (pg_search/src/postgres/utils.rs)

  • FieldSource::CompositeField variant for composite-derived fields
  • get_field_value helper to extract individual fields from unpacked composites
  • Extended extract_field_attributes to detect and expand composite expressions
  • Duplicate field name detection across composites and regular columns

Parallel Build & Insert

  • build_parallel.rs and insert.rs updated to use CompositeSlotValues
  • mvcc.rs updated for MVCC-aware composite unpacking

Usage Example

-- Create a composite type with the fields you want to index
CREATE TYPE product_search AS (name TEXT, description TEXT, category TEXT);

-- Create table with regular columns
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT,
    description TEXT,
    category TEXT
);

-- Create BM25 index using composite type
CREATE INDEX idx_products ON products USING bm25 (
    id,
    (ROW(name, description, category)::product_search)
) WITH (key_field = 'id');

-- Search by individual field names from the composite type
SELECT * FROM products WHERE name @@@ 'Widget';
SELECT * FROM products WHERE description @@@ 'amazing';

-- Or with pdb functions for more control
SELECT * FROM products WHERE name @@@ pdb.term('Widget');
SELECT * FROM products WHERE name @@@ pdb.fuzzy_term('Widgt', distance => 1);

Test plan

  • 47 pg_regress test sections covering:
    • composite.sql (39 test sections): Basic indexing, >32 fields, 100 fields, JSON/array fields, tokenizers, error cases, parallel builds, MVCC, and more
    • composite_advanced.sql (8 test sections): Field-level queries with field @@@ query syntax, pdb functions, scoring, snippets
  • cargo fmt and cargo clippy clean

@mithuncy mithuncy added the Do Not Cherry Pick PR should not be cherry-picked to other branches label Dec 15, 2025
@mithuncy mithuncy force-pushed the feat/composite-types-support branch 2 times, most recently from d6aca04 to 3071050 Compare December 15, 2025 05:57
Support indexing >32 columns using ROW(...)::type expressions with
field-level search granularity via composite.rs module and tests.
@mithuncy mithuncy force-pushed the feat/composite-types-support branch from 3071050 to be575d6 Compare December 15, 2025 09:44
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/insert.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Move all #[pg_test] composite tests to pg_regress golden tests using v2 query syntax (paradedb.parse, paradedb.boolean). Remove implementation-dependent assertions and update expected outputs.
Comment thread pg_search/tests/pg_regress/sql/composite.sql Outdated
Comment thread pg_search/tests/pg_regress/sql/composite.sql Outdated
Comment thread pg_search/tests/pg_regress/sql/composite_advanced.sql Outdated
Extract field name from composite type definition when column is on LHS of @@@ operator. Update composite tests to use v2 pdb APIs with EXPLAIN and TopN queries.
@mithuncy mithuncy force-pushed the feat/composite-types-support branch 2 times, most recently from 64df5e2 to 1bd5189 Compare December 17, 2025 23:10
@mithuncy mithuncy force-pushed the feat/composite-types-support branch from 1bd5189 to b2b77cc Compare December 17, 2025 23:42
Comment thread pg_search/src/api/operator.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/src/postgres/utils.rs Outdated
Comment thread pg_search/src/postgres/utils.rs Outdated
Comment thread pg_search/src/postgres/composite.rs Outdated
Comment thread pg_search/tests/pg_regress/sql/composite.sql
Comment thread pg_search/tests/pg_regress/sql/composite.sql
Comment thread tests/tests/composite.rs Outdated
- Use .is_some() instead of if let Some(_) for nodecast! check
- Use crate::api::HashMap instead of std::collections::HashMap
- Remove self-explanatory comments
- Replace on-demand caching with upfront unpacking via from_composites()
- Add collect_composites_for_unpacking() helper in utils.rs
- Simplify get_field_value() to use immutable reference and simple lookup
- Update all callers: insert.rs, build_parallel.rs, mvcc.rs
- Remove integration tests (covered by pg_regress tests)
@mithuncy mithuncy force-pushed the feat/composite-types-support branch from ab8f6b2 to 74ac551 Compare December 18, 2025 22:00
@mithuncy mithuncy requested a review from rebasedming December 18, 2025 22:01
…hing

Bug: Refactoring moved expr_no += 1 outside the tokenizer check and
removed the early return None for non-tokenizable types.
Resolved conflicts:
- pg_search/src/api/operator.rs: Keep composite type support, add type_is_alias guard to expr_matches_node (#3760 fix)
Copy link
Copy Markdown
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really, really clean. Thanks a lot @mithuncy!

Comment thread pg_search/src/postgres/utils.rs Outdated
Comment thread pg_search/src/postgres/composite.rs
Avoid intermediate Vec allocation; clarify safety contract for lazy pointer dereferencing.
@mithuncy mithuncy added the benchmark-stressgres Add this label to a PR to request that the Stressgres benchmarks run. Automatically removed. label Jan 9, 2026
@github-actions github-actions bot removed the benchmark-stressgres Add this label to a PR to request that the Stressgres benchmarks run. Automatically removed. label Jan 9, 2026
Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search single-server.toml Performance - TPS

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom Scan - Primary - tps 581.6928221528699 median tps 541.6051734873952 median tps 0.93
Delete values - Primary - tps 3088.6080709462467 median tps 3010.4362011538165 median tps 0.97
Index Only Scan - Primary - tps 594.8230011332421 median tps 622.8590893387328 median tps 1.05
Index Scan - Primary - tps 501.01156127041315 median tps 452.78617553248864 median tps 0.90
Insert value - Primary - tps 3328.9979710095527 median tps 3201.977317603253 median tps 0.96
Update random values - Primary - tps 2145.9159451854002 median tps 2089.9038650329157 median tps 0.97
Vacuum - Primary - tps 156.04670488067885 median tps 111.76418077854224 median tps 0.72

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search single-server.toml Performance - Other Metrics

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom Scan - Primary - cpu 4.6647234 median cpu 4.678363 median cpu 1.00
Custom Scan - Primary - mem 57.712890625 median mem 57.34375 median mem 1.01
Delete values - Primary - cpu 4.655674 median cpu 4.669261 median cpu 1.00
Delete values - Primary - mem 33.1796875 median mem 33.5390625 median mem 0.99
Index Only Scan - Primary - cpu 4.6647234 median cpu 4.669261 median cpu 1.00
Index Only Scan - Primary - mem 58.234375 median mem 57.796875 median mem 1.01
Index Scan - Primary - cpu 4.64666 median cpu 4.655674 median cpu 1.00
Index Scan - Primary - mem 57.43359375 median mem 57.203125 median mem 1.00
Insert value - Primary - cpu 4.660194 median cpu 4.6647234 median cpu 1.00
Insert value - Primary - mem 45.8046875 median mem 45.85546875 median mem 1.00
Monitor Index Size - Primary - block_count 1767 median block_count 1702 median block_count 1.04
Monitor Index Size - Primary - segment_count 12 median segment_count 7 median segment_count 1.71
Update random values - Primary - cpu 4.6511626 median cpu 4.6875 median cpu 0.99
Update random values - Primary - mem 48.421875 median mem 48.48828125 median mem 1.00
Vacuum - Primary - cpu 0 median cpu 4.673807 median cpu 0
Vacuum - Primary - mem 49.8515625 median mem 50.8125 median mem 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'pg_search single-server.toml Performance - Other Metrics'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Monitor Index Size - Primary - segment_count 12 median segment_count 7 median segment_count 1.71

This comment was automatically generated by workflow using github-action-benchmark.

CC: @mithuncy

@paradedb-bot
Copy link
Copy Markdown
Contributor

single-server result: stressgres-single-server-9e4496d9d9963576177da569ba390b8674e9d0e3.png

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search bulk-updates.toml Performance - TPS

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Bulk Update - Primary - tps 7.4167670507693435 median tps 7.440606065943415 median tps 1.00
Count Query - Primary - tps 5.481560146066513 median tps 5.430168528742451 median tps 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search bulk-updates.toml Performance - Other Metrics

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Bulk Update - Primary - cpu 23.233301 median cpu 23.166023 median cpu 1.00
Bulk Update - Primary - mem 232.51171875 median mem 232.1953125 median mem 1.00
Count Query - Primary - cpu 23.323614 median cpu 23.233301 median cpu 1.00
Count Query - Primary - mem 172.30078125 median mem 171.9921875 median mem 1.00
Monitor Index Size - Primary - block_count 48506 median block_count 48783 median block_count 0.99
Monitor Index Size - Primary - segment_count 88 median segment_count 89 median segment_count 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@paradedb-bot
Copy link
Copy Markdown
Contributor

bulk-updates result: stressgres-bulk-updates-9e4496d9d9963576177da569ba390b8674e9d0e3.png

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search wide-table.toml Performance - TPS

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Bulk Update - Primary - tps 1089.4493537883277 median tps 1129.8923655292951 median tps 1.04
Single Insert - Primary - tps 1219.6374030035308 median tps 1229.8000758847982 median tps 1.01
Single Update - Primary - tps 1826.849307784978 median tps 1926.2613579019683 median tps 1.05
Top N - Primary - tps 5.3612485573338535 median tps 5.725964068646208 median tps 1.07

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search wide-table.toml Performance - Other Metrics

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Background Merger - Primary - background_merging 0 median background_merging 0 median background_merging 1
Background Merger - Primary - cpu 4.660194 median cpu 4.660194 median cpu 1
Background Merger - Primary - mem 22.77734375 median mem 22.90625 median mem 0.99
Bulk Update - Primary - cpu 4.660194 median cpu 4.669261 median cpu 1.00
Bulk Update - Primary - mem 165.8359375 median mem 165.80859375 median mem 1.00
Monitor Index Size - Primary - block_count 66871 median block_count 64565 median block_count 1.04
Monitor Index Size - Primary - segment_count 46 median segment_count 47 median segment_count 0.98
Single Insert - Primary - cpu 4.6647234 median cpu 4.655674 median cpu 1.00
Single Insert - Primary - mem 120.6796875 median mem 123.1640625 median mem 0.98
Single Update - Primary - cpu 4.660194 median cpu 4.660194 median cpu 1
Single Update - Primary - mem 165.29296875 median mem 165.23828125 median mem 1.00
Top N - Primary - cpu 23.369036 median cpu 23.346306 median cpu 1.00
Top N - Primary - mem 160.04296875 median mem 160.0078125 median mem 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@paradedb-bot
Copy link
Copy Markdown
Contributor

wide-table result: stressgres-wide-table-9e4496d9d9963576177da569ba390b8674e9d0e3.png

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search background-merge.toml Performance - TPS

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom scan - Primary - tps 32.3171187074876 median tps 32.52606751548291 median tps 1.01
Delete value - Primary - tps 237.33447161497622 median tps 238.48276200379138 median tps 1.00
Insert value - Primary - tps 1902.59970177451 median tps 1868.3480100972972 median tps 0.98
Update random values - Primary - tps 154.4303790030059 median tps 153.8506542306622 median tps 1.00
Vacuum - Primary - tps 14.939362952411155 median tps 14.697187851368014 median tps 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search background-merge.toml Performance - Other Metrics

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom scan - Primary - cpu 18.60465 median cpu 18.60465 median cpu 1
Custom scan - Primary - mem 163.08984375 median mem 151.4453125 median mem 1.08
Delete value - Primary - cpu 4.6511626 median cpu 4.6511626 median cpu 1
Delete value - Primary - mem 117.12109375 median mem 118.33984375 median mem 0.99
Insert value - Primary - cpu 4.6421666 median cpu 4.6421666 median cpu 1
Insert value - Primary - mem 124.9140625 median mem 125.390625 median mem 1.00
Monitor Segment Count - Primary - block_count 14031 median block_count 14027 median block_count 1.00
Monitor Segment Count - Primary - cpu 4.6376815 median cpu 4.6332045 median cpu 1.00
Monitor Segment Count - Primary - mem 99.01171875 median mem 98.64453125 median mem 1.00
Monitor Segment Count - Primary - segment_count 26 median segment_count 26 median segment_count 1
Update random values - Primary - cpu 9.248554 median cpu 9.239654 median cpu 1.00
Update random values - Primary - mem 159.5390625 median mem 150.65625 median mem 1.06
Vacuum - Primary - cpu 13.859479 median cpu 13.913043 median cpu 1.00
Vacuum - Primary - mem 171.921875 median mem 171.8671875 median mem 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@paradedb-bot
Copy link
Copy Markdown
Contributor

background-merge result: stressgres-background-merge-9e4496d9d9963576177da569ba390b8674e9d0e3.png

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search logical-replication.toml Performance - TPS

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom Scan - Subscriber - tps 558.4758186580741 median tps 546.2631357068778 median tps 0.98
Index Only Scan - Subscriber - tps 620.9670331744552 median tps 660.1542280056016 median tps 1.06
Parallel Custom Scan - Subscriber - tps 85.68014104349551 median tps 86.85447420211887 median tps 1.01
Top N - Subscriber - tps 109.46626996065463 median tps 109.53068986286965 median tps 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_search logical-replication.toml Performance - Other Metrics

Details
Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Custom Scan - Subscriber - cpu 4.5714283 median cpu 4.5714283 median cpu 1
Custom Scan - Subscriber - mem 47.08984375 median mem 47.40234375 median mem 0.99
Delete values - Publisher - cpu 4.58891 median cpu 4.5584044 median cpu 1.01
Delete values - Publisher - mem 29.94140625 median mem 29.8828125 median mem 1.00
Find by ctid - Subscriber - cpu 9.099526 median cpu 9.108159 median cpu 1.00
Find by ctid - Subscriber - mem 49.61328125 median mem 50.16015625 median mem 0.99
Index Only Scan - Subscriber - cpu 4.5714283 median cpu 4.567079 median cpu 1.00
Index Only Scan - Subscriber - mem 46.78515625 median mem 47.17578125 median mem 0.99
Index Size Info - Subscriber - cpu 4.567079 median cpu 4.567079 median cpu 1
Index Size Info - Subscriber - mem 30.7421875 median mem 30.86328125 median mem 1.00
Index Size Info - Subscriber - pages 1127 median pages 1122 median pages 1.00
Index Size Info - Subscriber - relation_size:MB 8.8046875 median relation_size:MB 8.765625 median relation_size:MB 1.00
Index Size Info - Subscriber - segment_count 8 median segment_count 7 median segment_count 1.14
Insert value A - Publisher - cpu 4.567079 median cpu 4.524034 median cpu 1.01
Insert value A - Publisher - mem 27.7421875 median mem 27.41796875 median mem 1.01
Insert value B - Publisher - cpu 4.549763 median cpu 4.5540795 median cpu 1.00
Insert value B - Publisher - mem 27.62890625 median mem 27.4453125 median mem 1.01
Parallel Custom Scan - Subscriber - cpu 4.597701 median cpu 4.58891 median cpu 1.00
Parallel Custom Scan - Subscriber - mem 44.9453125 median mem 45.25 median mem 0.99
`SELECT
pid,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag,
application_name::text,
state::text
FROM pg_stat_replication; - Publisher - replication_lag:MB` 0 median replication_lag:MB 0 median replication_lag:MB 1
Top N - Subscriber - cpu 4.5714283 median cpu 4.5714283 median cpu 1
Top N - Subscriber - mem 45.68359375 median mem 46.0390625 median mem 0.99
Update 1..9 - Publisher - cpu 4.5801525 median cpu 4.5933013 median cpu 1.00
Update 1..9 - Publisher - mem 30.5859375 median mem 30.55859375 median mem 1.00
Update 10,11 - Publisher - cpu 4.567079 median cpu 4.567079 median cpu 1
Update 10,11 - Publisher - mem 30.484375 median mem 30.65625 median mem 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@paradedb-bot paradedb-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'pg_search logical-replication.toml Performance - Other Metrics'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: 1d01c3b Previous: ac76a85 Ratio
Index Size Info - Subscriber - segment_count 8 median segment_count 7 median segment_count 1.14

This comment was automatically generated by workflow using github-action-benchmark.

CC: @mithuncy

@paradedb-bot
Copy link
Copy Markdown
Contributor

logical-replication result: stressgres-logical-replication-9e4496d9d9963576177da569ba390b8674e9d0e3.png

@mithuncy mithuncy merged commit f89302e into main Jan 9, 2026
22 checks passed
@mithuncy mithuncy deleted the feat/composite-types-support branch January 9, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Cherry Pick PR should not be cherry-picked to other branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow indexes to take more than 32 columns

4 participants