Skip to content

Comments

Task #6: Implement EnsureFileMetadataCached function#16

Merged
cbb330 merged 1 commit intomainfrom
task-6-ensure-file-metadata-cached
Feb 20, 2026
Merged

Task #6: Implement EnsureFileMetadataCached function#16
cbb330 merged 1 commit intomainfrom
task-6-ensure-file-metadata-cached

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Summary

Fully implements the EnsureCompleteMetadata() function for ORC fragments with complete metadata loading.

Changes

  • Enhanced EnsureCompleteMetadata() to load all ORC file metadata
  • Reads and validates physical schema from ORC reader
  • Initializes stripes_ with all stripes if not specified
  • Builds OrcSchemaManifest from ORC type tree
  • Initializes StripeStatisticsCache with per-stripe guarantees
  • Validates stripe indices against file's stripe count
  • Recursive call pattern when reader is not provided

Implementation Details

  • Follows Parquet's EnsureCompleteMetadata pattern (lines 802-870)
  • Thread-safe with physical_schema_mutex_ from Fragment base class
  • Unlocks mutex before recursive call to avoid deadlock
  • Statistics cache initialized with literal(true) per stripe

Testing

Manual code review following Parquet reference patterns

Task Reference

Completes Task #6 from task_list.json
Depends on: Task #4 (OrcFileFragment - complete)
Enables: Task #7 (EnsureManifestCached)

- Enhanced EnsureCompleteMetadata() to fully load ORC file metadata
- Opens ORC reader if not provided, with recursive call pattern
- Reads and validates physical schema from ORC file
- Initializes stripes_ with all stripes if not already set
- Gets ORC type tree and builds OrcSchemaManifest
- Initializes StripeStatisticsCache with per-stripe guarantees
- Validates stripe indices against file's total stripe count
- Thread-safe with mutex locking (physical_schema_mutex_)

Implementation notes:
- Follows Parquet's EnsureCompleteMetadata pattern closely
- Handles recursive call when reader is null (unlock, open, recurse)
- Casts void* reader to ORCFileReader* for method access
- Statistics cache initialized with literal(true) per stripe
- All stripe-level metadata loaded in one function call

Verified: Manual code review following Parquet reference (lines 802-870)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@cbb330 cbb330 merged commit 3851eab into main Feb 20, 2026
31 of 50 checks passed
@cbb330 cbb330 deleted the task-6-ensure-file-metadata-cached branch February 20, 2026 22:34
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Modified OrcFileFormat::CountRows to use TryCountRows optimization
- If metadata loaded: call TryCountRows immediately (synchronous fast path)
- If not loaded: load metadata async, then call TryCountRows
- TryCountRows returns count if computable from statistics, nullopt otherwise
- nullopt triggers fallback to full scan (default FileFragment behavior)

Flow:
1. CountRows called with predicate
2. Check if metadata cached
   - Yes: TryCountRows(predicate) → count or nullopt
   - No: async { load metadata, TryCountRows(predicate) }
3. If nullopt: FileFragment base class does full scan

Mirrors Parquet's CountRows (file_parquet.cc:669-684) exactly.

Result: COUNT(*) queries with compatible predicates avoid reading data.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

cbb330 added a commit that referenced this pull request Feb 24, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
- Modified OrcFileFormat::CountRows to use TryCountRows optimization
- If metadata loaded: call TryCountRows immediately (synchronous fast path)
- If not loaded: load metadata async, then call TryCountRows
- TryCountRows returns count if computable from statistics, nullopt otherwise
- nullopt triggers fallback to full scan (default FileFragment behavior)

Flow:
1. CountRows called with predicate
2. Check if metadata cached
   - Yes: TryCountRows(predicate) → count or nullopt
   - No: async { load metadata, TryCountRows(predicate) }
3. If nullopt: FileFragment base class does full scan

Mirrors Parquet's CountRows (file_parquet.cc:669-684) exactly.

Result: COUNT(*) queries with compatible predicates avoid reading data.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant