Task #6: Implement EnsureFileMetadataCached function#16
Merged
Conversation
- Enhanced EnsureCompleteMetadata() to fully load ORC file metadata - Opens ORC reader if not provided, with recursive call pattern - Reads and validates physical schema from ORC file - Initializes stripes_ with all stripes if not already set - Gets ORC type tree and builds OrcSchemaManifest - Initializes StripeStatisticsCache with per-stripe guarantees - Validates stripe indices against file's total stripe count - Thread-safe with mutex locking (physical_schema_mutex_) Implementation notes: - Follows Parquet's EnsureCompleteMetadata pattern closely - Handles recursive call when reader is null (unlock, open, recurse) - Casts void* reader to ORCFileReader* for method access - Statistics cache initialized with literal(true) per stripe - All stripe-level metadata loaded in one function call Verified: Manual code review following Parquet reference (lines 802-870) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
cbb330
added a commit
that referenced
this pull request
Feb 20, 2026
cbb330
added a commit
that referenced
this pull request
Feb 20, 2026
- Modified OrcFileFormat::CountRows to use TryCountRows optimization
- If metadata loaded: call TryCountRows immediately (synchronous fast path)
- If not loaded: load metadata async, then call TryCountRows
- TryCountRows returns count if computable from statistics, nullopt otherwise
- nullopt triggers fallback to full scan (default FileFragment behavior)
Flow:
1. CountRows called with predicate
2. Check if metadata cached
- Yes: TryCountRows(predicate) → count or nullopt
- No: async { load metadata, TryCountRows(predicate) }
3. If nullopt: FileFragment base class does full scan
Mirrors Parquet's CountRows (file_parquet.cc:669-684) exactly.
Result: COUNT(*) queries with compatible predicates avoid reading data.
Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330
added a commit
that referenced
this pull request
Feb 20, 2026
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
cbb330
added a commit
that referenced
this pull request
Feb 24, 2026
cbb330
added a commit
that referenced
this pull request
Feb 24, 2026
- Modified OrcFileFormat::CountRows to use TryCountRows optimization
- If metadata loaded: call TryCountRows immediately (synchronous fast path)
- If not loaded: load metadata async, then call TryCountRows
- TryCountRows returns count if computable from statistics, nullopt otherwise
- nullopt triggers fallback to full scan (default FileFragment behavior)
Flow:
1. CountRows called with predicate
2. Check if metadata cached
- Yes: TryCountRows(predicate) → count or nullopt
- No: async { load metadata, TryCountRows(predicate) }
3. If nullopt: FileFragment base class does full scan
Mirrors Parquet's CountRows (file_parquet.cc:669-684) exactly.
Result: COUNT(*) queries with compatible predicates avoid reading data.
Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330
added a commit
that referenced
this pull request
Feb 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fully implements the EnsureCompleteMetadata() function for ORC fragments with complete metadata loading.
Changes
Implementation Details
Testing
Manual code review following Parquet reference patterns
Task Reference
Completes Task #6 from task_list.json
Depends on: Task #4 (OrcFileFragment - complete)
Enables: Task #7 (EnsureManifestCached)