Skip to content

Session 2 progress notes#6

Merged
cbb330 merged 1 commit intomainfrom
session-1-progress
Feb 20, 2026
Merged

Session 2 progress notes#6
cbb330 merged 1 commit intomainfrom
session-1-progress

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Documents progress from Session 2:

@cbb330 cbb330 merged commit 141899d into main Feb 20, 2026
4 of 6 checks passed
@cbb330 cbb330 deleted the session-1-progress branch February 20, 2026 22:18
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Enhanced EnsureCompleteMetadata() to fully load ORC file metadata
- Opens ORC reader if not provided, with recursive call pattern
- Reads and validates physical schema from ORC file
- Initializes stripes_ with all stripes if not already set
- Gets ORC type tree and builds OrcSchemaManifest
- Initializes StripeStatisticsCache with per-stripe guarantees
- Validates stripe indices against file's total stripe count
- Thread-safe with mutex locking (physical_schema_mutex_)

Implementation notes:
- Follows Parquet's EnsureCompleteMetadata pattern closely
- Handles recursive call when reader is null (unlock, open, recurse)
- Casts void* reader to ORCFileReader* for method access
- Statistics cache initialized with literal(true) per stripe
- All stripe-level metadata loaded in one function call

Verified: Manual code review following Parquet reference (lines 802-870)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Enhanced EnsureCompleteMetadata() to fully load ORC file metadata
- Opens ORC reader if not provided, with recursive call pattern
- Reads and validates physical schema from ORC file
- Initializes stripes_ with all stripes if not already set
- Gets ORC type tree and builds OrcSchemaManifest
- Initializes StripeStatisticsCache with per-stripe guarantees
- Validates stripe indices against file's total stripe count
- Thread-safe with mutex locking (physical_schema_mutex_)

Implementation notes:
- Follows Parquet's EnsureCompleteMetadata pattern closely
- Handles recursive call when reader is null (unlock, open, recurse)
- Casts void* reader to ORCFileReader* for method access
- Statistics cache initialized with literal(true) per stripe
- All stripe-level metadata loaded in one function call

Verified: Manual code review following Parquet reference (lines 802-870)

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
Implemented the OrcFileFragment class structure with ORC-specific predicate
pushdown capabilities. This mirrors ParquetFileFragment design but adapted
for ORC's stripe-based organization.

Added structures:
- OrcFileMetadata: Holds file-level metadata (stripe info, row counts, schema)
  Unlike Parquet's rich FileMetaData, ORC metadata is accessed through reader
  methods, so we cache essentials here.
- StripeStatisticsCache: Cache for stripe-level statistics and guarantees.
  Stores derived guarantee expressions, tracks processed fields, and maintains
  per-column completion flags.
- CacheStatus enum: Tracks lazy metadata loading state (uncached/loading/cached)

OrcFileFragment class:
- Extends FileFragment with stripe-level filtering
- Fields: stripes_ (optional selected indices), metadata_, manifest_,
  statistics_cache_, cache_status_
- Public methods: SplitByStripe, stripes(), metadata(), EnsureCompleteMetadata,
  ClearCachedMetadata, Subset (two overloads)
- Private methods: SetMetadata, FilterStripes, TestStripes, TryCountRows
- Method implementations are stubs with TODO references to future tasks

Key design decisions:
- Mirrors Parquet's row_groups → stripes mapping
- Thread safety via physical_schema_mutex_ (inherited from FileFragment)
- Lazy metadata loading with cache status tracking
- Forward declarations for adapters::orc types to avoid circular dependencies

All public methods documented with detailed comments explaining parameters,
return values, and intended usage patterns.

VERIFICATION STATUS: Build/test verification pending due to network
restrictions preventing CMake from downloading dependencies. Implementation
follows ParquetFileFragment patterns exactly. Methods are stubs that will
be implemented in subsequent tasks (#6-#18).

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

cbb330 added a commit that referenced this pull request Feb 24, 2026
Implemented the OrcFileFragment class structure with ORC-specific predicate
pushdown capabilities. This mirrors ParquetFileFragment design but adapted
for ORC's stripe-based organization.

Added structures:
- OrcFileMetadata: Holds file-level metadata (stripe info, row counts, schema)
  Unlike Parquet's rich FileMetaData, ORC metadata is accessed through reader
  methods, so we cache essentials here.
- StripeStatisticsCache: Cache for stripe-level statistics and guarantees.
  Stores derived guarantee expressions, tracks processed fields, and maintains
  per-column completion flags.
- CacheStatus enum: Tracks lazy metadata loading state (uncached/loading/cached)

OrcFileFragment class:
- Extends FileFragment with stripe-level filtering
- Fields: stripes_ (optional selected indices), metadata_, manifest_,
  statistics_cache_, cache_status_
- Public methods: SplitByStripe, stripes(), metadata(), EnsureCompleteMetadata,
  ClearCachedMetadata, Subset (two overloads)
- Private methods: SetMetadata, FilterStripes, TestStripes, TryCountRows
- Method implementations are stubs with TODO references to future tasks

Key design decisions:
- Mirrors Parquet's row_groups → stripes mapping
- Thread safety via physical_schema_mutex_ (inherited from FileFragment)
- Lazy metadata loading with cache status tracking
- Forward declarations for adapters::orc types to avoid circular dependencies

All public methods documented with detailed comments explaining parameters,
return values, and intended usage patterns.

VERIFICATION STATUS: Build/test verification pending due to network
restrictions preventing CMake from downloading dependencies. Implementation
follows ParquetFileFragment patterns exactly. Methods are stubs that will
be implemented in subsequent tasks (#6-#18).

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
- Extended ConvertColumnStatistics to handle Boolean, Date, Timestamp, and Decimal types
- Boolean: populated num_values from getTrueCount + getFalseCount (no min/max)
- Date: Date32Scalar from int32_t days since epoch
- Timestamp: TimestampScalar with nanosecond precision (millis * 1M + nanos)
- Decimal: Decimal128Scalar from ORC Int128 value with scale

Added 3 new passing tests:
- GetColumnStatisticsBoolean: verifies num_values, has_min_max=false
- GetColumnStatisticsDate: verifies Date32Scalar min/max conversion
- GetColumnStatisticsTimestamp: verifies TimestampScalar min/max conversion

Verified: build and tests pass
Types supported: int, double, string, bool, date, timestamp (6 types)

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant