Validate ArrowStream first bytes to prevent huge allocations during format detection by thevar1able · Pull Request #98893 · ClickHouse/ClickHouse

thevar1able · 2026-03-06T01:20:54Z

Summary

During format auto-detection, the ArrowStream reader interprets the first 4 bytes of non-Arrow data as a metadata length in the Arrow IPC framing protocol. For example, JSON starting with {\n is read as a ~514 MiB metadata length, causing Arrow to allocate that much memory before discovering the data is invalid.
Add a pre-validation check in createStreamReader that peeks at the first 4 bytes and rejects data that is clearly not Arrow IPC: the value must be either the continuation token (0xFFFFFFFF) or a positive metadata length under 256 MiB.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix excessive memory usage (~514 MiB) during format auto-detection when reading non-Arrow data (e.g. JSON from url() or file() without explicit format), caused by the ArrowStream reader misinterpreting the first bytes as a huge metadata length.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

No documentation changes needed — this is a bug fix with no user-facing API changes.

Note

Medium Risk
Adds stricter pre-validation to ArrowStream parsing that changes failure behavior and could reject edge-case/legacy inputs if they don’t match the expected IPC framing. Otherwise it’s a targeted guardrail plus a regression test to prevent runaway memory allocations on misdetected data.

Overview
Pre-validates the first 4 bytes of ArrowStream input in createStreamReader and rejects streams that clearly aren’t Arrow IPC (empty input, invalid continuation token/metadata length, or an unreasonably large metadata length), preventing Arrow from attempting huge allocations during format auto-detection on non-Arrow data.

Adds a stateless regression test that reads a JSON file without an extension via file() under a tight max_memory_usage limit to ensure autodetection no longer spikes memory.

^{Written by Cursor Bugbot for commit bf4a614. This will update automatically on new commits. Configure here.}

During format auto-detection, the ArrowStream reader would interpret the first 4 bytes of non-Arrow data as a metadata length in the Arrow IPC framing protocol. For example, JSON starting with `{\n ` (hex 7b 0a 20 20) was read as little-endian int32 0x20200a7b = ~514 MiB, causing Arrow to allocate that much memory before discovering the data was invalid. Add a pre-validation check in `createStreamReader` that peeks at the first 4 bytes and rejects data that is clearly not Arrow IPC: the value must be either the continuation token (0xFFFFFFFF) or a positive metadata length under 256 MiB. Closes #65036 Assisted-by: Claude Opus 4.6 via GitHub Copilot

clickhouse-gh · 2026-03-06T01:21:33Z

Workflow [PR], commit [bf4a614]

Summary: ✅

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

src/Processors/Formats/Impl/ArrowBlockInputFormat.cpp

Arrow IPC uses little-endian byte order on the wire. Add `fromLittleEndian` conversion after reading the first 4 bytes so the validation works correctly on big-endian platforms (e.g. s390x). Assisted-by: Claude Opus 4.6 via GitHub Copilot

alexey-milovidov

Amazing!

Also, should we order formats that we try to put more obscure formats last?
We can extend format properties to have this ordering.

…Stream memory fix - Add entry for bucketed serialization for Map columns (#99200) under New Feature. - Move #98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…Stream memory fix - Add entry for bucketed serialization for Map columns (ClickHouse#99200) under New Feature. - Move ClickHouse#98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 6, 2026

cursor bot reviewed Mar 6, 2026

View reviewed changes

src/Processors/Formats/Impl/ArrowBlockInputFormat.cpp Show resolved Hide resolved

alexey-milovidov approved these changes Mar 6, 2026

View reviewed changes

alexey-milovidov self-assigned this Mar 6, 2026

alexey-milovidov added this pull request to the merge queue Mar 6, 2026

Merged via the queue into master with commit d9fb9d5 Mar 6, 2026
294 of 295 checks passed

alexey-milovidov deleted the fix-format-confusion-allocation branch March 6, 2026 08:44

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 6, 2026

alexey-milovidov mentioned this pull request Mar 26, 2026

Update CHANGELOG.md entries #100767

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate ArrowStream first bytes to prevent huge allocations during format detection#98893

Validate ArrowStream first bytes to prevent huge allocations during format detection#98893
alexey-milovidov merged 2 commits intomasterfrom
fix-format-confusion-allocation

thevar1able commented Mar 6, 2026 •

edited by cursor bot

Loading

Uh oh!

clickhouse-gh bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

alexey-milovidov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thevar1able commented Mar 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thevar1able commented Mar 6, 2026 •

edited by cursor bot

Loading

clickhouse-gh bot commented Mar 6, 2026 •

edited

Loading