Skip to content

Introduce new "combined" subcolumn for JSON data type that combines literal and sub-object values#98788

Merged
Avogar merged 12 commits intomasterfrom
json-new-subcolumn
Mar 30, 2026
Merged

Introduce new "combined" subcolumn for JSON data type that combines literal and sub-object values#98788
Avogar merged 12 commits intomasterfrom
json-new-subcolumn

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Mar 4, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add combined subcolumn syntax json.@path for the JSON type that returns the literal value as Dynamic if the path holds a scalar, the sub-object as Dynamic if the path holds a nested object, or NULL if the path is absent.

Documentation entry for user-facing changes

Reading JSON combined sub-columns

The JSON type supports reading a path as a combined sub-column using the special syntax [email protected].
A combined sub-column for a given path returns:

  • The literal value stored at that path as Dynamic, if the path has a literal value.
  • A JSON sub-object at that path as Dynamic, if the path has no literal value but has nested sub-paths.
  • NULL, if neither a literal value nor any sub-paths exist for that path.

This is useful when a path may hold either a scalar value or a nested object across different rows, and is more convenient than separately querying the literal sub-column (json.a) and the sub-object sub-column (json.^a).

The following example compares all three sub-column types for path a:

CREATE TABLE test (json JSON) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : 42, "b" : {"c" : 1, "d" : "Hello"}}'), ('{"a" : {"x": 1, "y": 2}, "b" : {"c" : 1}}'), ('{"c" : "World"}');
SELECT json FROM test;
┌─json────────────────────────────┐
│ {"a":42,"b":{"c":1,"d":"Hello"}}│
│ {"a":{"x":1,"y":2},"b":{"c":1}}│
│ {"c":"World"}                   │
└─────────────────────────────────┘
SELECT
    json.a,
    dynamicType(json.a),
    json.^a,
    toTypeName(json.^a),
    json.@a,
    dynamicType(json.@a)
FROM test;
┌─json.a─┬─dynamicType(json.a)─┬─json.^a───────┬─toTypeName(json.^a)─┬─json.@a───────┬─dynamicType(json.@a)─┐
│ 42     │ Int64               │ {}            │ JSON                │ 42            │ Int64                │
│ NULL   │ None                │ {"x":1,"y":2} │ JSON                │ {"x":1,"y":2} │ JSON                 │
│ NULL   │ None                │ {}            │ JSON                │ NULL          │ None                 │
└────────┴─────────────────────┴───────────────┴─────────────────────┴───────────────┴──────────────────────┘
  • Row 1: a holds a literal 42. json.a returns it as Dynamic(Int64), json.^a returns an empty sub-object {} (no nested keys under a), and json.@a returns the literal 42.

  • Row 2: a holds a nested object. json.a returns NULL (no literal at that path), json.^a returns the sub-object as JSON, and json.@a also returns the sub-object as Dynamic(JSON).

  • Row 3: a is absent entirely. Both json.a and json.@a return NULL, while json.^a returns an empty {}.

  • Documentation is written (mandatory for new features)

Avogar added 2 commits March 4, 2026 20:31
path subcolumn and the sub-object subcolumn into a single access:
- returns the literal value as `Dynamic` if the path holds a scalar,
- returns the sub-object as `Dynamic` if the path holds a nested object,
- returns `NULL` if the path is absent entirely.

Implementation:
- Parser: handle `$`-prefixed identifiers in `ParserCompoundIdentifier`
  via the new `is_dollar_bareword` flag in `ExpressionElementParsers`.
- In-memory (Memory engine): `extractCombinedColumn` in `DataTypeObject`
  merges the literal and sub-object columns row by row.
- MergeTree (on-disk): new `SerializationObjectCombinedPath` reads the
  literal and sub-object substreams and merges them during deserialization,
  using `insert_only_rows_in_current_range_from_substreams_cache` to avoid
  stale cache interactions between the two substreams.
- `tupleElement` and `JSONExtract` function families extended to support
  the combined subcolumn path prefix.

Tests added in `03467`–`03470` covering parser round-trips, prefix
correctness, Memory/Compact/Wide MergeTree variants, and `Array(JSON)`.

Documentation added to `newjson.md` with a comparison example of all
three subcolumn types (`json.a`, `json.^a`, `json.`) side by side.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 4, 2026

Workflow [PR], commit [f6525f5]

Summary:


AI Review

Summary

This PR introduces a new JSON combined subcolumn syntax (json.@path) that merges literal and sub-object reads, adds dedicated combined-path serialization logic, updates parser support, and adds broad stateless coverage for Memory and MergeTree variants. I did not find additional high-confidence correctness, safety, or performance issues beyond already-posted inline discussion; overall implementation looks coherent and sufficiently tested for the introduced behavior.

Missing context
  • ⚠️ No CI run logs or benchmark reports were provided in the review context, so I could not independently validate runtime/perf behavior beyond code and test diff inspection.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate JSON functionality remains under enable_json_type.
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 4, 2026
@UnamedRus
Copy link
Copy Markdown
Contributor

This is useful when a path may hold either a scalar value or a nested object across different rows, and is more convenient than separately querying the literal sub-column (json.a) and the sub-object sub-column (json.^a).

Does, it also apply to root json as well?
Ie, json could return 42 or NULL or "null"?

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Mar 5, 2026

Does, it also apply to root json as well?
Ie, json could return 42 or NULL or "null"?

No. Root JSON column can contain only objects.

In theory, you can already use Dynamic type for this purpose:

create table test (json_value Dynamic) engine=MergeTree order by tuple();
insert into test select 42::Int64;
insert into test select null;
insert into test select 'data';
insert into test select '{"a" : {"b" : 42}}'::JSON;

select json_value, json_value.Int64, json_value.String, json_value.JSON, json_value.JSON.a.b from test;
   ┌─json_value─────┬─json_value.Int64─┬─json_value.String─┬─json_value.JSON─┬─json_value.JSON.a.b─┐
1. │ data           │             ᴺᵁᴸᴸ │ data              │ ᴺᵁᴸᴸ            │ ᴺᵁᴸᴸ                │
2. │ 42             │               42 │ ᴺᵁᴸᴸ              │ ᴺᵁᴸᴸ            │ ᴺᵁᴸᴸ                │
3. │ ᴺᵁᴸᴸ           │             ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ              │ ᴺᵁᴸᴸ            │ ᴺᵁᴸᴸ                │
4. │ {"a":{"b":42}} │             ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ              │ {"a":{"b":42}}  │ 42                  │
   └────────────────┴──────────────────┴───────────────────┴─────────────────┴─────────────────────┘

But for this to work with JSON input formats we need to add support for JSON type in schema inference (right now objects are inferred as named tuples).

Or you can always wrap your JSON value into an object and just use JSON type: {"data" : {"json_value" : <anything>}}

@Avogar Avogar marked this pull request as ready for review March 6, 2026 14:56
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

@Avogar Avogar requested a review from bharatnc March 12, 2026 19:02
auto subcolumn_name = subcolumn_name_col->getValue<String>();
return wrapInArrays(object->getSubcolumnType(subcolumn_name), count_arrays);
/// Use combined `$` subcolumn that merges literal value and sub-object.
auto combined_name = String(1, DataTypeObject::COMBINED_SUBCOLUMN_PREFIX) + "`" + subcolumn_name + "`";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ This change drops previously covered behavior for tupleElement on JSON typed subcolumns (for example tupleElement(json, 'c.:Array(JSON)')), and the corresponding test was removed in this PR.

Because combined_name is always built as $`<element_name>` , any element_name containing a dynamic-type suffix (or embedded backquotes) no longer maps to the original subcolumn syntax. This is a backward-compatibility regression and effectively relaxes an existing test instead of preserving behavior.

Please either preserve support for these existing paths in tupleElement or keep the old behavior behind compatibility handling and add explicit tests for both forms.

@nickitat
Copy link
Copy Markdown
Member

nickitat commented Mar 18, 2026

#99920
If you are already looking at this failure, pls let me know )

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Mar 18, 2026

If you are already looking at this failure, pls let me know )

No, I didn't look at it yet. You can go ahead (or assign me to the issue, I can also take a look)

…fier ambiguity; add serialization pooling support for SerializationObjectCombinedPath
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 25, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 24.60% 24.40% -0.20%
Branches 76.70% 76.60% -0.10%

PR changed lines: PR changed-lines coverage: 88.70% (259/292, 0 noise lines excluded)
Diff coverage report
Uncovered code

@Avogar Avogar added this pull request to the merge queue Mar 30, 2026
Merged via the queue into master with commit 3e9e721 Mar 30, 2026
153 checks passed
@Avogar Avogar deleted the json-new-subcolumn branch March 30, 2026 17:02
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 30, 2026
Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026
Introduce new "combined" subcolumn for JSON data type that combines literal and sub-object values
clickgapai added a commit to clickgapai/ClickHouse that referenced this pull request Mar 30, 2026
@clickgapai
Copy link
Copy Markdown
Contributor

Hi @Avogar @bharatnc — while reviewing this PR I found the following:

Could you review the test PR and add the can be tested label if it looks good?
Happy to discuss — close anything that's wrong or already addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants