Validate ArrowStream first bytes to prevent huge allocations during format detection#98893
Merged
alexey-milovidov merged 2 commits intomasterfrom Mar 6, 2026
Merged
Conversation
During format auto-detection, the ArrowStream reader would interpret the
first 4 bytes of non-Arrow data as a metadata length in the Arrow IPC
framing protocol. For example, JSON starting with `{\n ` (hex 7b 0a 20 20)
was read as little-endian int32 0x20200a7b = ~514 MiB, causing Arrow to
allocate that much memory before discovering the data was invalid.
Add a pre-validation check in `createStreamReader` that peeks at the first
4 bytes and rejects data that is clearly not Arrow IPC: the value must be
either the continuation token (0xFFFFFFFF) or a positive metadata length
under 256 MiB.
Closes #65036
Assisted-by: Claude Opus 4.6 via GitHub Copilot
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Arrow IPC uses little-endian byte order on the wire. Add `fromLittleEndian` conversion after reading the first 4 bytes so the validation works correctly on big-endian platforms (e.g. s390x). Assisted-by: Claude Opus 4.6 via GitHub Copilot
alexey-milovidov
approved these changes
Mar 6, 2026
Member
alexey-milovidov
left a comment
There was a problem hiding this comment.
Amazing!
Also, should we order formats that we try to put more obscure formats last?
We can extend format properties to have this ordering.
alexey-milovidov
added a commit
that referenced
this pull request
Mar 26, 2026
…Stream memory fix - Add entry for bucketed serialization for Map columns (#99200) under New Feature. - Move #98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1 task
Desel72
pushed a commit
to Desel72/ClickHouse
that referenced
this pull request
Mar 30, 2026
…Stream memory fix - Add entry for bucketed serialization for Map columns (ClickHouse#99200) under New Feature. - Move ClickHouse#98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
{\nis read as a ~514 MiB metadata length, causing Arrow to allocate that much memory before discovering the data is invalid.createStreamReaderthat peeks at the first 4 bytes and rejects data that is clearly not Arrow IPC: the value must be either the continuation token (0xFFFFFFFF) or a positive metadata length under 256 MiB.Closes #65036
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix excessive memory usage (~514 MiB) during format auto-detection when reading non-Arrow data (e.g. JSON from
url()orfile()without explicit format), caused by the ArrowStream reader misinterpreting the first bytes as a huge metadata length.Documentation entry for user-facing changes
No documentation changes needed — this is a bug fix with no user-facing API changes.
Note
Medium Risk
Adds stricter pre-validation to
ArrowStreamparsing that changes failure behavior and could reject edge-case/legacy inputs if they don’t match the expected IPC framing. Otherwise it’s a targeted guardrail plus a regression test to prevent runaway memory allocations on misdetected data.Overview
Pre-validates the first 4 bytes of
ArrowStreaminput increateStreamReaderand rejects streams that clearly aren’t Arrow IPC (empty input, invalid continuation token/metadata length, or an unreasonably large metadata length), preventing Arrow from attempting huge allocations during format auto-detection on non-Arrow data.Adds a stateless regression test that reads a JSON file without an extension via
file()under a tightmax_memory_usagelimit to ensure autodetection no longer spikes memory.Written by Cursor Bugbot for commit bf4a614. This will update automatically on new commits. Configure here.