Skip to content

Add MergeTree skip index support for JSON paths using JSONAllPaths with bloom_filter, tokenbf_v1, ngrambf_v1, and text (inverted) index types#98886

Open
Avogar wants to merge 13 commits intomasterfrom
json-skip-index-support
Open

Add MergeTree skip index support for JSON paths using JSONAllPaths with bloom_filter, tokenbf_v1, ngrambf_v1, and text (inverted) index types#98886
Avogar wants to merge 13 commits intomasterfrom
json-skip-index-support

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Mar 5, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add MergeTree skip index support for JSON columns using JSONAllPaths with bloom_filter, tokenbf_v1, ngrambf_v1, and text (inverted) index types, enabling granule skipping based on the set of JSON paths present in each granule.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Note

Medium Risk
Touches MergeTree skip-index condition building for multiple index types, so any mismatch could lead to incorrect granule skipping or missed optimizations; mitigated by explicit safety checks for missing-path defaults and extensive new stateless tests.

Overview
Adds MergeTree skip-index support for JSON path presence by allowing indexes on JSONAllPaths(json_col) to accelerate filters on JSON subcolumns.

Extends bloom-filter-based indexes (bloom_filter, tokenbf_v1, ngrambf_v1) and the text index to recognize JSON subcolumn predicates (including CAST/typed subcolumns), support equals, IN (bloom only), and IS NOT NULL where safe, and avoid unsafe skipping when missing paths would produce default values or when using ^ sub-object access.

Includes a new shared helper (MergeTreeIndexJSONSubcolumnHelper) plus comprehensive stateless tests and documentation updates describing JSON path-based skipping and examples.

Written by Cursor Bugbot for commit 2603d47. This will update automatically on new commits. Configure here.

…th bloom_filter, tokenbf_v1, ngrambf_v1, and text (inverted) index types, enabling granule skipping based on the set of JSON paths present in each granule.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 5, 2026

Workflow [PR], commit [a1492d1]

Summary:

job_name test_name status info comment
Stateless tests (arm_binary, parallel) failure
03668_shard_join_in_reverse_order FAIL cidb, issue
Stress test (arm_msan) failure
MemorySanitizer: use-of-uninitialized-value (STID: 1003-358c) FAIL cidb, issue

AI Review

Summary

This PR adds skip-index support for JSON subcolumn path existence via JSONAllPaths(...) across bloom_filter, tokenbf_v1, ngrambf_v1, and text index conditions, plus extensive stateless coverage and docs. I reviewed the changed MergeTree condition-building logic and safety guards for missing-path/default-value semantics; I did not find high-confidence correctness, safety, or compatibility issues that require code changes.

Missing context
  • ⚠️ No CI runtime logs or benchmark/perf reports were provided in this job context, so runtime/perf validation here is code-and-test based.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 5, 2026
Avogar and others added 4 commits March 5, 2026 20:50
The CI docs check requires all headings to have explicit anchor IDs.
Added IDs to 5 headings in the JSON skip index documentation section.

#98886

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Avogar Avogar marked this pull request as ready for review March 6, 2026 17:56
@Avogar Avogar mentioned this pull request Mar 6, 2026
56 tasks
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

@ahmadov ahmadov self-assigned this Mar 10, 2026
@Avogar Avogar requested a review from ahmadov March 30, 2026 16:52
@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Mar 30, 2026

In general LGTM

FROM (EXPLAIN indexes = 1 SELECT * FROM t_json_text WHERE json.a.b::Int64 = 1)
WHERE explain LIKE '%Parts:%' OR explain LIKE '%Granules:%' OR explain LIKE '%Skip%';

-- 1e: CAST ::Int64 = 0 — unsafe
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader wonders why json.a.b::Int64 = 0 is "unsafe"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we cannot skip granules without path a.b with this condition, values of this path in granules without this path will be read as Null and in cast to Int64 that will be converted to 0.

@CurtizJ CurtizJ self-assigned this Mar 31, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.80% +0.10%

Changed lines: 96.70% (205/212) · Uncovered code

Full report · Diff report

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Apr 1, 2026

@CurtizJ can you approve so I can merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants