Skip to content

Comments

Task #4: Create OrcFileFragment class#12

Closed
cbb330 wants to merge 137 commits intomainfrom
task-4-create-orc-file-fragment
Closed

Task #4: Create OrcFileFragment class#12
cbb330 wants to merge 137 commits intomainfrom
task-4-create-orc-file-fragment

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Summary

Implements OrcFileFragment class with ORC-specific predicate pushdown capabilities.

Changes

  • Added OrcFileFragment class extending FileFragment
  • Added StripeStatisticsCache for caching stripe-level statistics
  • Added OrcCacheStatus enum for metadata loading state tracking
  • Implemented public API for stripe operations and metadata access
  • Added private methods as stubs for future tasks (Mark Task #1 as complete #5-13)

Implementation Details

  • Mirrors ParquetFileFragment structure (stripes vs row_groups)
  • Uses void* for orc::Reader to avoid header dependencies
  • Thread-safe with inherited physical_schema_mutex_
  • Forward declarations for ORC types in header

Testing

Manual code review following Parquet reference patterns

Task Reference

Completes Task #4 from task_list.json
Depends on: Task #1 (OrcSchemaManifest - complete)
Enables: Tasks #5, #6 (next in dependency chain)

raulcd and others added 30 commits January 26, 2026 11:08
… CFlightInfo to nullptr instead of NULL (apache#48968)

### Rationale for this change

Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 |           __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
      |                            
```

### What changes are included in this PR?

Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: apache#48965

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…apache#48925)

### What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths

Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:
* Add convenience functions for integer overflow checking

### Are these changes tested?

Yes, by existing and improved tests.

### Are there any user-facing changes?

Bug fixes.

**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.

* GitHub Issue: apache#48924

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…river and the Flight Client (apache#48967)

### Rationale for this change

The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled

### What changes are included in this PR?

1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.

### Are these changes tested?
Manually on Windows, and CI

### Are there any user-facing changes?

No
* GitHub Issue: apache#48966

Authored-by: jianfengmao <[email protected]>
Signed-off-by: David Li <[email protected]>
…e buffer is empty (apache#48692)

### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

### What changes are included in this PR?
Early check the array is not all null values before serialize it

### Are these changes tested?

Added tests.
### Are there any user-facing changes?

No

* GitHub Issue: apache#48691

Authored-by: rexan <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
…r.msix to fix docker rebuild on Windows wheels (apache#48948)

### Rationale for this change

As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix

### What changes are included in this PR?

- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: apache#48947

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
)

### Rationale for this change

There are date32 and date64 variants for date arrays.

### What changes are included in this PR?

* Add `ArrowFormat::DateType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#48990

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…che#48993)

### Rationale for this change

It's a large variant of UTF-8 array.

### What changes are included in this PR?

* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#48992

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…::FileReader::ReadRowGroup(s) (apache#48982)

### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
  - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
  - Deprecate the old Status/out-parameter overloads
  - Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
                                     std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);

virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      const std::vector<int>& column_indices,
                                      std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: apache#48949

Lead-authored-by: fenfeng9 <[email protected]>
Co-authored-by: fenfeng9 <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…essions (apache#48989)

### Rationale for this change

Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.

### What changes are included in this PR?

* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
 
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#48985

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…s found on emscripten jobs (apache#49007)

### Rationale for this change

When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
 + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.

### What changes are included in this PR?

Correct logic and only return 404 if the file requested wasn't found.

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: apache#47692

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…king (apache#48974)

### Rationale for this change

Benchmark failing since C++20 upgrade due to lack of C++20 configuration

### What changes are included in this PR?

Changes entirely from 🤖 (Claude) with discussion from me regarding optimal approach.  

Description as follows:

> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. 
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.                                                               

### Are these changes tested?

I got 🤖  to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.

>  Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.                                       

### Are there any user-facing changes?

Nope
* GitHub Issue: apache#48912

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
…ch is empty (apache#48718)

### Rationale for this change

Fixes apache#36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
```

### What changes are included in this PR?

The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content

The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

### Are these changes tested?

Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table

### Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

* GitHub Issue: apache#36889

Lead-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
…DBC Nightly Package (apache#48933)

### Rationale for this change
apache#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package 
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release

* GitHub Issue: apache#48932

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…he#48952)

### Rationale for this change

Add guidance re AI tooling

### What changes are included in this PR?

Updates to main docs and links to it from new contributor's guide

### Are these changes tested?

No but I'll built the docs

### Are there any user-facing changes?

Just docs

:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: apache#48951

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change

`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#49029

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
### Rationale for this change

This functionality is unused now that we have a proper atfork facility.

### Are these changes tested?

By existing CI tests.

### Are there any user-facing changes?

Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).

* GitHub Issue: apache#33450

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…t& return types (apache#48956)

### Rationale for this change

The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.

It was added in commit 6ceb12f when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.

The issue was fixed in commit 95a8bfb which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.

### What changes are included in this PR?

Removed the outdated TODO comment that referenced apacheGH-35437.

### Are these changes tested?

I did not test.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#35437

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…che#49008)

### Rationale for this change

When running the python-sdist job we are currently not uploading the build artifact to the job.

### What changes are included in this PR?

Upload artifact as part of building the job so it's easier to test and validate contents if necessary.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: apache#48586

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
### Rationale for this change

CI needs updating to test old R package versions

### What changes are included in this PR?

Add 22.0.0.1

### Are these changes tested?

Nah, it's CI stuff

### Are there any user-facing changes?

No

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
)

### Rationale for this change
See issue apache#48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.

### Are these changes tested?
Yes, locally.

### Are there any user-facing changes?
No.

Closes apache#48961 
* GitHub Issue: apache#48961

Authored-by: Tadeja Kadunc <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
…nchmarking (apache#49038)

### Rationale for this change

Slow benchmarks due to conda duckdb building from source

### What changes are included in this PR?

Try ditching conda and installing R via rig and using PPM binaries

### Are these changes tested?

I'll try running

### Are there any user-facing changes?
 
Nope
* GitHub Issue: apache#49037

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change

This patch was integrated upstream in microsoft/mimalloc#1139

### Are these changes tested?

By existing CI.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#49042

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change

Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#49024

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
)

### Rationale for this change

There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.

### What changes are included in this PR?

* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#49027

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
apache#49031)

### Rationale for this change

It's a fixed size variant of binary array.

### What changes are included in this PR?

* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: apache#49030

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…s in `castTIMESTAMP_utf8` and `castTIME_utf8` (apache#48867)

### Rationale for this change

Fixes apache#48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.

### What changes are included in this PR?

- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: apache#48866

Authored-by: Arkadii Kravchuk <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…d+ pattern before removing lines (apache#48674)

### Rationale for this change

This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.

I could not find the relevant example to demonstrate within this project but assume that we have a test such as:

(Generated by ChatGPT)

```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
            "Error details: Time format: 12:34:56, Key: value\n"
            "parser_test.cc:940  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
      "Error details: Time format: 12:34:56, Key: value";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}

// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
            "URL: http://arrow.apache.org:8080/api\n"
            "parser_test.cc:974  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
      "URL: http://arrow.apache.org:8080/api";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```

then it fails.

### What changes are included in this PR?

Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.

### Are these changes tested?

Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.

### Are there any user-facing changes?

No, test-only.

* GitHub Issue: apache#48673

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…dding required user-agent on urllib request (apache#49052)

### Rationale for this change

See: apache#49044

### What changes are included in this PR?

Urllib now request with `"user-agent": "pyarrow"`

### Are these changes tested?

It's a CI fix.

### Are there any user-facing changes?

No, just a CI test fix.
* GitHub Issue: apache#49044

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…d and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (apache#48988)

### Rationale for this change

Currently the files are missing from the published wheels.

### What changes are included in this PR?

- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist

### Are these changes tested?

Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
 AssertionError: LICENSE.txt is missing from the wheel.
```

### Are there any user-facing changes?

No

* GitHub Issue: apache#48983

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…che#49060)

### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: apache#49059

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
cbb330 and others added 9 commits February 20, 2026 14:25
- Added OrcSchemaField struct to map Arrow fields to ORC column indices
- Added OrcSchemaManifest struct for schema mapping infrastructure
- Includes GetColumnField() and GetParent() helper methods
- Added stub Make() implementation (full logic in Task #2)
- Mirrors Parquet SchemaManifest design adapted for ORC type system

Verified: Code structure matches Parquet pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
- Implemented BuildSchemaFieldRecursive helper for depth-first traversal
- Walks Arrow schema and ORC type tree in parallel
- Assigns column indices using ORC depth-first pre-order (col 0 = root struct)
- Marks leaf nodes (primitives) with column_index for statistics
- Marks container nodes (struct/list/map) with column_index = -1
- Builds column_index_to_field and child_to_parent lookup maps
- Handles struct, list, and map types with proper child matching
- Added orc/Type.hh include for ORC type information

Implementation details:
- Column indexing starts at 1 (column 0 is root struct)
- Leaf nodes are primitives that have statistics
- Container types recursively process children
- Validates ORC root type is STRUCT

Verified: Manual code review - follows ORC depth-first pre-order pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
- Add OrcFileFragment class extending FileFragment with ORC-specific predicate pushdown
- Add StripeStatisticsCache structure for caching stripe-level statistics expressions
- Add OrcCacheStatus enum for tracking metadata loading state
- Implement public API: stripes(), metadata(), EnsureCompleteMetadata(), Subset(), SplitByStripe()
- Add private methods: FilterStripes(), TestStripes(), TryCountRows() (stubs for later tasks)
- Mirror ParquetFileFragment structure adapted for ORC stripes
- Use void* for orc::Reader to avoid exposing ORC headers in public API
- Thread-safe with mutex locking following Fragment base class pattern

Implementation notes:
- stripes_ replaces Parquet's row_groups_
- statistics_cache_ stores per-stripe guarantee expressions
- Forward declarations for ORC types to minimize header dependencies
- Most methods are stubs marked with TODO for subsequent tasks

Verified: Manual code review following Parquet reference patterns

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@cbb330
Copy link
Owner Author

cbb330 commented Feb 20, 2026

Closing this PR to recreate with clean base. Will recreate immediately.

@cbb330 cbb330 closed this Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
)

- Created FoldingAnd inline helper function for combining expressions with AND
- Optimizes for literal(true) case: avoids creating redundant 'true AND expr'
- Will be used by TestStripes (Task #12) to build guarantee expressions
- Follows same pattern as Parquet implementation
- SimplifyWithGuarantee is already available from compute::Expression

Verified:
- Optimizes true AND right -> right
- General case: left AND right for non-true left expressions
- Uses compute::and_ for logical AND combination
cbb330 added a commit that referenced this pull request Feb 20, 2026
Core statistics evaluation function for ORC predicate pushdown.
This mirrors Parquet's TestRowGroups pattern.

Algorithm:
1. Lock mutex for thread-safe cache access
2. Simplify predicate with partition-level guarantees
3. Resolve field references to ORC column indices via manifest
4. For each unprocessed field, load stripe statistics from ORC file
5. Derive guarantee expressions from min/max statistics
6. Combine guarantees incrementally into stripe-level cache
7. Simplify predicate with each stripe's guarantees

Returns per-stripe simplified expressions:
- literal(true) = must scan this stripe
- literal(false) = can skip this stripe
- Mixed expression = partial information available

Thread-safe with mutex protection. Statistics cached incrementally
as predicates reference different fields.

Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc lines 933-983

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
cbb330 added a commit that referenced this pull request Feb 24, 2026
)

- Created FoldingAnd inline helper function for combining expressions with AND
- Optimizes for literal(true) case: avoids creating redundant 'true AND expr'
- Will be used by TestStripes (Task #12) to build guarantee expressions
- Follows same pattern as Parquet implementation
- SimplifyWithGuarantee is already available from compute::Expression

Verified:
- Optimizes true AND right -> right
- General case: left AND right for non-true left expressions
- Uses compute::and_ for logical AND combination
cbb330 added a commit that referenced this pull request Feb 24, 2026
Core statistics evaluation function for ORC predicate pushdown.
This mirrors Parquet's TestRowGroups pattern.

Algorithm:
1. Lock mutex for thread-safe cache access
2. Simplify predicate with partition-level guarantees
3. Resolve field references to ORC column indices via manifest
4. For each unprocessed field, load stripe statistics from ORC file
5. Derive guarantee expressions from min/max statistics
6. Combine guarantees incrementally into stripe-level cache
7. Simplify predicate with each stripe's guarantees

Returns per-stripe simplified expressions:
- literal(true) = must scan this stripe
- literal(false) = can skip this stripe
- Mixed expression = partial information available

Thread-safe with mutex protection. Statistics cached incrementally
as predicates reference different fields.

Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc lines 933-983

Co-authored-by: Claude Sonnet 4.5 <[email protected]>
cbb330 added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.