Skip to content

Comments

Task #4: Create OrcFileFragment class#13

Merged
cbb330 merged 1 commit intomainfrom
task-4-create-orc-file-fragment
Feb 20, 2026
Merged

Task #4: Create OrcFileFragment class#13
cbb330 merged 1 commit intomainfrom
task-4-create-orc-file-fragment

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Summary

Implements OrcFileFragment class with ORC-specific predicate pushdown capabilities.

Changes

  • Added OrcFileFragment class extending FileFragment
  • Added StripeStatisticsCache for caching stripe-level statistics
  • Added OrcCacheStatus enum for metadata loading state tracking
  • Implemented public API for stripe operations and metadata access
  • Added private methods as stubs for future tasks (Mark Task #1 as complete #5-13)

Implementation Details

  • Mirrors ParquetFileFragment structure (stripes vs row_groups)
  • Uses void* for orc::Reader to avoid header dependencies
  • Thread-safe with inherited physical_schema_mutex_
  • Forward declarations for ORC types in header

Testing

Manual code review following Parquet reference patterns

Task Reference

Completes Task #4 from task_list.json
Depends on: Task #1 (OrcSchemaManifest - complete)
Enables: Tasks #5, #6 (next in dependency chain)

- Add OrcFileFragment class extending FileFragment with ORC-specific predicate pushdown
- Add StripeStatisticsCache structure for caching stripe-level statistics expressions
- Add OrcCacheStatus enum for tracking metadata loading state
- Implement public API: stripes(), metadata(), EnsureCompleteMetadata(), Subset(), SplitByStripe()
- Add private methods: FilterStripes(), TestStripes(), TryCountRows() (stubs for later tasks)
- Mirror ParquetFileFragment structure adapted for ORC stripes
- Use void* for orc::Reader to avoid exposing ORC headers in public API
- Thread-safe with mutex locking following Fragment base class pattern

Implementation notes:
- stripes_ replaces Parquet's row_groups_
- statistics_cache_ stores per-stripe guarantee expressions
- Forward declarations for ORC types to minimize header dependencies
- Most methods are stubs marked with TODO for subsequent tasks

Verified: Manual code review following Parquet reference patterns

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@cbb330 cbb330 force-pushed the task-4-create-orc-file-fragment branch from 02686fe to 79710d0 Compare February 20, 2026 22:31
@cbb330 cbb330 merged commit e245233 into main Feb 20, 2026
45 of 75 checks passed
@cbb330 cbb330 deleted the task-4-create-orc-file-fragment branch February 20, 2026 22:32
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Added BuildOrcSchemaManifest function to build schema manifest from Arrow schema
- Walks Arrow schema using depth-first pre-order traversal
- Assigns ORC column indices starting from 1 (column 0 is root struct)
- Handles container types: STRUCT, LIST, LARGE_LIST, MAP (marked as non-leaf)
- Handles leaf types: primitives (marked as leaf with column index)
- Recursively processes children for nested types
- Foundation for GetOrcColumnIndex (Task 3)

Verified: Code compiles with manifest builder function.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
Main entry point for stripe filtering with predicate pushdown.
Mirrors Parquet's FilterRowGroups pattern.

Implementation:
1. Ensure complete metadata is loaded (file, manifest, statistics cache)
2. Call TestStripes to evaluate predicate against stripe statistics
3. Filter results to include only stripes where:
   - Predicate is satisfiable (not literal(false))
   - Stripe is non-empty (num_rows > 0)
4. Return vector of selected stripe indices

Stripes are skipped if:
- The predicate simplifies to literal(false) given statistics
- The stripe contains zero rows

This function is called by:
- ScanBatchesAsync (for scan optimization)
- Subset (for fragment splitting)
- TryCountRows (for count optimization)

Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc FilterRowGroups (lines 918-931)

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
- Added BuildOrcSchemaManifest function to build schema manifest from Arrow schema
- Walks Arrow schema using depth-first pre-order traversal
- Assigns ORC column indices starting from 1 (column 0 is root struct)
- Handles container types: STRUCT, LIST, LARGE_LIST, MAP (marked as non-leaf)
- Handles leaf types: primitives (marked as leaf with column index)
- Recursively processes children for nested types
- Foundation for GetOrcColumnIndex (Task 3)

Verified: Code compiles with manifest builder function.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Main entry point for stripe filtering with predicate pushdown.
Mirrors Parquet's FilterRowGroups pattern.

Implementation:
1. Ensure complete metadata is loaded (file, manifest, statistics cache)
2. Call TestStripes to evaluate predicate against stripe statistics
3. Filter results to include only stripes where:
   - Predicate is satisfiable (not literal(false))
   - Stripe is non-empty (num_rows > 0)
4. Return vector of selected stripe indices

Stripes are skipped if:
- The predicate simplifies to literal(false) given statistics
- The stripe contains zero rows

This function is called by:
- ScanBatchesAsync (for scan optimization)
- Subset (for fragment splitting)
- TryCountRows (for count optimization)

Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc FilterRowGroups (lines 918-931)

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant