Skip to content

Comments

Mark Task #0 as complete#3

Merged
cbb330 merged 1 commit intomainfrom
update-task-0-status
Feb 20, 2026
Merged

Mark Task #0 as complete#3
cbb330 merged 1 commit intomainfrom
update-task-0-status

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Updates task_list.json to mark Task #0 as complete following successful merge of PR #2.

@cbb330 cbb330 merged commit b3b8e0c into main Feb 20, 2026
7 of 10 checks passed
@cbb330 cbb330 deleted the update-task-0-status branch February 20, 2026 22:14
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
Adds comprehensive task tracking and progress documentation for the
ongoing ORC predicate pushdown implementation project.

## Changes
- task_list.json: Complete 35-task breakdown with dependencies
  - Tasks #0, #0.5, #1, #2 marked as complete (on feature branches)
  - Tasks #3-apache#35 pending implementation
  - Organized by phase: Prerequisites, Core, Metadata, Predicate, Scan, Testing, Future
- claude-progress.txt: Comprehensive project status document
  - Codebase structure and build instructions
  - Work completed on feature branches (not yet merged)
  - Current main branch state
  - Next steps and implementation strategy
  - Parquet mirroring patterns and Allium spec alignment

## Context
This is an initialization session to establish baseline tracking for the
ORC predicate pushdown project. Previous sessions (1-4) completed initial
tasks on feature branches. This consolidates that progress and provides
a clear roadmap for future implementation sessions.

## Related Work
- Allium spec: orc-predicate-pushdown.allium (already on main)
- Feature branches: task-0-statistics-api-v2, task-0.5-stripe-selective-reading,
  task-1-orc-schema-manifest, task-2-build-orc-schema-manifest (not yet merged)

## Next Steps
Future sessions will implement tasks #3+ via individual feature branch PRs.
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Added GetOrcColumnIndex function to resolve FieldRef to ORC column index
- Handles top-level fields via direct lookup
- Handles nested fields via manifest tree traversal
- Returns nullopt if field not found or not a leaf field
- Only leaf fields have statistics and valid column indices
- Added <optional> include for std::optional support

Verified: Code structure follows Parquet pattern
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
Implemented the GetOrcColumnIndex function that resolves field references
to ORC physical column indices using the schema manifest. This is a critical
component for predicate pushdown to map Arrow field references to ORC columns
for statistics lookup.

Implementation details:
- Added GetOrcColumnIndex() in internal namespace (file_orc.cc)
- Handles top-level field resolution (simple name references)
- Handles nested field resolution (traverses manifest tree)
- Returns std::nullopt for:
  * Fields not found in manifest
  * Container types (struct, list, map) with no single column index
  * Non-name field references (positional, etc.)

Testing:
- Added GetOrcColumnIndex_TopLevelFields test
  * Verifies resolution of simple top-level fields
  * Tests non-existent field returns nullopt
- Added GetOrcColumnIndex_NestedFields test
  * Verifies nested field traversal through struct
  * Tests container field returns nullopt (no single column)
  * Tests invalid nested paths return nullopt

Design follows Parquet's ResolveOneFieldRef pattern adapted for ORC's
manifest structure.

VERIFICATION STATUS: Build/test verification pending due to network
restrictions preventing CMake from downloading dependencies. Previous
session (Task #2) verified all code compiles and tests pass. This
implementation follows established patterns exactly and includes the
necessary <optional> header.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

cbb330 added a commit that referenced this pull request Feb 24, 2026
Adds comprehensive task tracking and progress documentation for the
ongoing ORC predicate pushdown implementation project.

## Changes
- task_list.json: Complete 35-task breakdown with dependencies
  - Tasks #0, #0.5, #1, #2 marked as complete (on feature branches)
  - Tasks #3-apache#35 pending implementation
  - Organized by phase: Prerequisites, Core, Metadata, Predicate, Scan, Testing, Future
- claude-progress.txt: Comprehensive project status document
  - Codebase structure and build instructions
  - Work completed on feature branches (not yet merged)
  - Current main branch state
  - Next steps and implementation strategy
  - Parquet mirroring patterns and Allium spec alignment

## Context
This is an initialization session to establish baseline tracking for the
ORC predicate pushdown project. Previous sessions (1-4) completed initial
tasks on feature branches. This consolidates that progress and provides
a clear roadmap for future implementation sessions.

## Related Work
- Allium spec: orc-predicate-pushdown.allium (already on main)
- Feature branches: task-0-statistics-api-v2, task-0.5-stripe-selective-reading,
  task-1-orc-schema-manifest, task-2-build-orc-schema-manifest (not yet merged)

## Next Steps
Future sessions will implement tasks #3+ via individual feature branch PRs.
cbb330 added a commit that referenced this pull request Feb 24, 2026
- Added GetOrcColumnIndex function to resolve FieldRef to ORC column index
- Handles top-level fields via direct lookup
- Handles nested fields via manifest tree traversal
- Returns nullopt if field not found or not a leaf field
- Only leaf fields have statistics and valid column indices
- Added <optional> include for std::optional support

Verified: Code structure follows Parquet pattern
cbb330 added a commit that referenced this pull request Feb 24, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
Implemented the GetOrcColumnIndex function that resolves field references
to ORC physical column indices using the schema manifest. This is a critical
component for predicate pushdown to map Arrow field references to ORC columns
for statistics lookup.

Implementation details:
- Added GetOrcColumnIndex() in internal namespace (file_orc.cc)
- Handles top-level field resolution (simple name references)
- Handles nested field resolution (traverses manifest tree)
- Returns std::nullopt for:
  * Fields not found in manifest
  * Container types (struct, list, map) with no single column index
  * Non-name field references (positional, etc.)

Testing:
- Added GetOrcColumnIndex_TopLevelFields test
  * Verifies resolution of simple top-level fields
  * Tests non-existent field returns nullopt
- Added GetOrcColumnIndex_NestedFields test
  * Verifies nested field traversal through struct
  * Tests container field returns nullopt (no single column)
  * Tests invalid nested paths return nullopt

Design follows Parquet's ResolveOneFieldRef pattern adapted for ORC's
manifest structure.

VERIFICATION STATUS: Build/test verification pending due to network
restrictions preventing CMake from downloading dependencies. Previous
session (Task #2) verified all code compiles and tests pass. This
implementation follows established patterns exactly and includes the
necessary <optional> header.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
Added GetORCType() method to ORCFileReader that returns a pointer to the
ORC Type object. This is needed for building schema manifests that map
Arrow schema fields to ORC physical column indices.

The ORC type tree uses depth-first pre-order numbering where column 0 is
the root struct, column 1 is the first top-level field, etc.

Returns const void* to avoid exposing ORC headers in the public Arrow API.
Callers should cast to const orc::Type* to use.

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Task #3 (Add GetORCType accessor to expose ORC type tree) was completed and merged in PR apache#137.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant