Skip to content

Task #2: Implement BuildOrcSchemaManifest function#7

Merged
cbb330 merged 1 commit intomainfrom
task-2-build-orc-schema-manifest
Feb 20, 2026
Merged

Task #2: Implement BuildOrcSchemaManifest function#7
cbb330 merged 1 commit intomainfrom
task-2-build-orc-schema-manifest

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Summary

Implement the schema manifest building logic that maps Arrow schema fields to ORC physical column indices.

Changes

  • Added BuildSchemaFieldRecursive helper function that:

    • Walks Arrow schema and ORC type tree in parallel using depth-first traversal
    • Assigns column indices following ORC's depth-first pre-order convention
    • Marks leaf nodes (primitives) with column_index for statistics lookup
    • Marks container nodes (struct/list/map) with column_index = -1
    • Builds lookup maps for fast column access
  • Fully implemented OrcSchemaManifest::Make:

    • Validates ORC root type is STRUCT
    • Initializes manifest with schema and empty collections
    • Processes each top-level field recursively
    • Builds column_index_to_field and child_to_parent maps
    • Returns Status::OK() on success
  • Added #include "orc/Type.hh" for ORC type information access

Implementation Details

ORC Column Indexing:

  • Column 0 = root struct (not used for statistics)
  • User columns start at index 1
  • Uses depth-first pre-order traversal

Type Handling:

  • Leaf nodes (primitives): INT, LONG, DOUBLE, STRING, etc. → have statistics
  • Container nodes: STRUCT, LIST, MAP, UNION → no direct statistics, recursively process children
  • Struct: Match Arrow fields by position
  • List: Single value field
  • Map: Key field (index 0) and item field (index 1)

Lookup Maps:

  • column_index_to_field: Fast column → field lookup
  • child_to_parent: Parent traversal support

Testing

  • Manual code review completed - follows ORC type system patterns
  • Build verification pending (environment issues)
  • Unit tests will be added in future tasks

Task Reference

Completes Task #2 from task_list.json - Core Data Structures phase
Depends on Task #1 (complete)
Enables Task #3 (GetOrcColumnIndex function)

Co-Authored-By: Claude Sonnet 4.5 [email protected]

- Implemented BuildSchemaFieldRecursive helper for depth-first traversal
- Walks Arrow schema and ORC type tree in parallel
- Assigns column indices using ORC depth-first pre-order (col 0 = root struct)
- Marks leaf nodes (primitives) with column_index for statistics
- Marks container nodes (struct/list/map) with column_index = -1
- Builds column_index_to_field and child_to_parent lookup maps
- Handles struct, list, and map types with proper child matching
- Added orc/Type.hh include for ORC type information

Implementation details:
- Column indexing starts at 1 (column 0 is root struct)
- Leaf nodes are primitives that have statistics
- Container types recursively process children
- Validates ORC root type is STRUCT

Verified: Manual code review - follows ORC depth-first pre-order pattern

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@cbb330 cbb330 merged commit eed6e86 into main Feb 20, 2026
44 of 74 checks passed
@cbb330 cbb330 deleted the task-2-build-orc-schema-manifest branch February 20, 2026 22:20
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Implemented BuildSchemaFieldRecursive helper for depth-first traversal
- Walks Arrow schema and ORC type tree in parallel
- Assigns column indices using ORC depth-first pre-order (col 0 = root struct)
- Marks leaf nodes (primitives) with column_index for statistics
- Marks container nodes (struct/list/map) with column_index = -1
- Builds column_index_to_field and child_to_parent lookup maps
- Handles struct, list, and map types with proper child matching
- Added orc/Type.hh include for ORC type information

Implementation details:
- Column indexing starts at 1 (column 0 is root struct)
- Leaf nodes are primitives that have statistics
- Container types recursively process children
- Validates ORC root type is STRUCT

Verified: Manual code review - follows ORC depth-first pre-order pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant