Skip to content

Validate ArrowStream first bytes to prevent huge allocations during format detection#98893

Merged
alexey-milovidov merged 2 commits intomasterfrom
fix-format-confusion-allocation
Mar 6, 2026
Merged

Validate ArrowStream first bytes to prevent huge allocations during format detection#98893
alexey-milovidov merged 2 commits intomasterfrom
fix-format-confusion-allocation

Conversation

@thevar1able
Copy link
Copy Markdown
Member

@thevar1able thevar1able commented Mar 6, 2026

Summary

  • During format auto-detection, the ArrowStream reader interprets the first 4 bytes of non-Arrow data as a metadata length in the Arrow IPC framing protocol. For example, JSON starting with {\n is read as a ~514 MiB metadata length, causing Arrow to allocate that much memory before discovering the data is invalid.
  • Add a pre-validation check in createStreamReader that peeks at the first 4 bytes and rejects data that is clearly not Arrow IPC: the value must be either the continuation token (0xFFFFFFFF) or a positive metadata length under 256 MiB.

Closes #65036

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix excessive memory usage (~514 MiB) during format auto-detection when reading non-Arrow data (e.g. JSON from url() or file() without explicit format), caused by the ArrowStream reader misinterpreting the first bytes as a huge metadata length.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

No documentation changes needed — this is a bug fix with no user-facing API changes.


Note

Medium Risk
Adds stricter pre-validation to ArrowStream parsing that changes failure behavior and could reject edge-case/legacy inputs if they don’t match the expected IPC framing. Otherwise it’s a targeted guardrail plus a regression test to prevent runaway memory allocations on misdetected data.

Overview
Pre-validates the first 4 bytes of ArrowStream input in createStreamReader and rejects streams that clearly aren’t Arrow IPC (empty input, invalid continuation token/metadata length, or an unreasonably large metadata length), preventing Arrow from attempting huge allocations during format auto-detection on non-Arrow data.

Adds a stateless regression test that reads a JSON file without an extension via file() under a tight max_memory_usage limit to ensure autodetection no longer spikes memory.

Written by Cursor Bugbot for commit bf4a614. This will update automatically on new commits. Configure here.

During format auto-detection, the ArrowStream reader would interpret the
first 4 bytes of non-Arrow data as a metadata length in the Arrow IPC
framing protocol. For example, JSON starting with `{\n  ` (hex 7b 0a 20 20)
was read as little-endian int32 0x20200a7b = ~514 MiB, causing Arrow to
allocate that much memory before discovering the data was invalid.

Add a pre-validation check in `createStreamReader` that peeks at the first
4 bytes and rejects data that is clearly not Arrow IPC: the value must be
either the continuation token (0xFFFFFFFF) or a positive metadata length
under 256 MiB.

Closes #65036

Assisted-by: Claude Opus 4.6 via GitHub Copilot
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 6, 2026

Workflow [PR], commit [bf4a614]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 6, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Arrow IPC uses little-endian byte order on the wire. Add
`fromLittleEndian` conversion after reading the first 4 bytes so the
validation works correctly on big-endian platforms (e.g. s390x).

Assisted-by: Claude Opus 4.6 via GitHub Copilot
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!

Also, should we order formats that we try to put more obscure formats last?
We can extend format properties to have this ordering.

@alexey-milovidov alexey-milovidov self-assigned this Mar 6, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Mar 6, 2026
Merged via the queue into master with commit d9fb9d5 Mar 6, 2026
294 of 295 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-format-confusion-allocation branch March 6, 2026 08:44
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 6, 2026
alexey-milovidov added a commit that referenced this pull request Mar 26, 2026
…Stream memory fix

- Add entry for bucketed serialization for Map columns (#99200) under New Feature.
- Move #98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026
…Stream memory fix

- Add entry for bucketed serialization for Map columns (ClickHouse#99200) under New Feature.
- Move ClickHouse#98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surprisingly high memory usage for SELECT ... FROM url

3 participants