Task #4: Create OrcFileFragment class by cbb330 · Pull Request #12 · cbb330/arrow

cbb330 · 2026-02-20T22:29:38Z

Summary

Implements OrcFileFragment class with ORC-specific predicate pushdown capabilities.

Changes

Added OrcFileFragment class extending FileFragment
Added StripeStatisticsCache for caching stripe-level statistics
Added OrcCacheStatus enum for metadata loading state tracking
Implemented public API for stripe operations and metadata access
Added private methods as stubs for future tasks (Mark Task #1 as complete #5-13)

Implementation Details

Mirrors ParquetFileFragment structure (stripes vs row_groups)
Uses void* for orc::Reader to avoid header dependencies
Thread-safe with inherited physical_schema_mutex_
Forward declarations for ORC types in header

Testing

Manual code review following Parquet reference patterns

Task Reference

Completes Task #4 from task_list.json
Depends on: Task #1 (OrcSchemaManifest - complete)
Enables: Tasks #5, #6 (next in dependency chain)

… CFlightInfo to nullptr instead of NULL (apache#48968) ### Rationale for this change Cython built code is currently failing to compile on free threaded wheels due to: ``` /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’: /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous 43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL)); | ``` ### What changes are included in this PR? Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: apache#48965 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…apache#48925) ### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: apache#48924 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…river and the Flight Client (apache#48967) ### Rationale for this change The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled ### What changes are included in this PR? 1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields. 2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys. ### Are these changes tested? Manually on Windows, and CI ### Are there any user-facing changes? No * GitHub Issue: apache#48966 Authored-by: jianfengmao <[email protected]> Signed-off-by: David Li <[email protected]>

…e buffer is empty (apache#48692) ### Rationale for this change WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow. ### What changes are included in this PR? Early check the array is not all null values before serialize it ### Are these changes tested? Added tests. ### Are there any user-facing changes? No * GitHub Issue: apache#48691 Authored-by: rexan <[email protected]> Signed-off-by: Gang Wu <[email protected]>

…r.msix to fix docker rebuild on Windows wheels (apache#48948) ### Rationale for this change As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix ### What changes are included in this PR? - Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker. - Update `pymanager install` command to use newer API (old command fails with missing flags) - Update default python command to use the free-threaded required suffix if free-threaded wheels ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: apache#48947 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

) ### Rationale for this change There are date32 and date64 variants for date arrays. ### What changes are included in this PR? * Add `ArrowFormat::DateType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#48990 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…che#48993) ### Rationale for this change It's a large variant of UTF-8 array. ### What changes are included in this PR? * Add `ArrowFormat::LargeUTF8Type#to_flatbuffers` * Add support for large UTF-8 array of `#values` and `#raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#48992 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…::FileReader::ReadRowGroup(s) (apache#48982) ### Rationale for this change `FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter. ### What changes are included in this PR? Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation: - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods - Deprecate the old Status/out-parameter overloads - Update C++ callers and R/Python/GLib bindings to use the new API ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. Status versions of FileReader::ReadRowGroup(s) have been deprecated. ```cpp virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, std::shared_ptr<::arrow::Table>* out); ``` * GitHub Issue: apache#48949 Lead-authored-by: fenfeng9 <[email protected]> Co-authored-by: fenfeng9 <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…essions (apache#48989) ### Rationale for this change Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC. ### What changes are included in this PR? * Refer arguments of `garrow_filter_node_options_new()` * Refer arguments of `garrow_project_node_options_new()` * Refer arguments of `garrow_aggregate_node_options_new()` * Refer arguments of `garrow_literal_expression_new()` * Refer arguments of `garrow_call_expression_new()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#48985 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…s found on emscripten jobs (apache#49007) ### Rationale for this change When looking for the wheel the script was falling back to returning a 404 even when the wheel was found: ``` + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome 127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found ``` Timing out the job and failing. ### What changes are included in this PR? Correct logic and only return 404 if the file requested wasn't found. ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: apache#47692 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…king (apache#48974) ### Rationale for this change Benchmark failing since C++20 upgrade due to lack of C++20 configuration ### What changes are included in this PR? Changes entirely from 🤖 (Claude) with discussion from me regarding optimal approach. Description as follows: > conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. > > This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present. ### Are these changes tested? I got 🤖 to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly. > Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch. > > The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding. ### Are there any user-facing changes? Nope * GitHub Issue: apache#48912 Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>

…ch is empty (apache#48718) ### Rationale for this change Fixes apache#36889 When writing CSV from a table where the first batch is empty, the header gets written twice: ```python table = pa.table({"col1": ["a", "b", "c"]}) combined = pa.concat_tables([table.schema.empty_table(), table]) write_csv(combined, buf) # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice ``` ### What changes are included in this PR? The bug happens because: 1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization 2. The buffer is not cleared after flush 3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_` 4. The write loop then writes `data_buffer_` which still contains stale content The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths: - `WriteHeader()` - `WriteRecordBatch()` - `WriteTable()` This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again. ### Are these changes tested? Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`: - Empty batch at start of table - Empty batch in middle of table ### Are there any user-facing changes? No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches. * GitHub Issue: apache#36889 Lead-authored-by: Ruiyang Wang <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: Gang Wu <[email protected]>

…DBC Nightly Package (apache#48933) ### Rationale for this change apache#48932 ### What changes are included in this PR? - Fix `rsync` build error ODBC Nightly Package ### Are these changes tested? - tested in CI ### Are there any user-facing changes? - After fix, users should be able to get Nightly ODBC package release * GitHub Issue: apache#48932 Authored-by: Alina (Xi) Li <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…he#48952) ### Rationale for this change Add guidance re AI tooling ### What changes are included in this PR? Updates to main docs and links to it from new contributor's guide ### Are these changes tested? No but I'll built the docs ### Are there any user-facing changes? Just docs :robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness. * GitHub Issue: apache#48951 Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Nic Crane <[email protected]>

### Rationale for this change `sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs). ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: apache#49029 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

### Rationale for this change This functionality is unused now that we have a proper atfork facility. ### Are these changes tested? By existing CI tests. ### Are there any user-facing changes? Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal). * GitHub Issue: apache#33450 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…t& return types (apache#48956) ### Rationale for this change The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete. It was added in commit 6ceb12f when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies. The issue was fixed in commit 95a8bfb which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved. ### What changes are included in this PR? Removed the outdated TODO comment that referenced apacheGH-35437. ### Are these changes tested? I did not test. ### Are there any user-facing changes? No. * GitHub Issue: apache#35437 Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…che#49008) ### Rationale for this change When running the python-sdist job we are currently not uploading the build artifact to the job. ### What changes are included in this PR? Upload artifact as part of building the job so it's easier to test and validate contents if necessary. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: apache#48586 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

### Rationale for this change CI needs updating to test old R package versions ### What changes are included in this PR? Add 22.0.0.1 ### Are these changes tested? Nah, it's CI stuff ### Are there any user-facing changes? No Authored-by: Nic Crane <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

) ### Rationale for this change See issue apache#48961 Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default ### What changes are included in this PR? Updating several doctest examples from `string` to `large_string`. ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? No. Closes apache#48961 * GitHub Issue: apache#48961 Authored-by: Tadeja Kadunc <[email protected]> Signed-off-by: AlenkaF <[email protected]>

…nchmarking (apache#49038) ### Rationale for this change Slow benchmarks due to conda duckdb building from source ### What changes are included in this PR? Try ditching conda and installing R via rig and using PPM binaries ### Are these changes tested? I'll try running ### Are there any user-facing changes? Nope * GitHub Issue: apache#49037 Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>

### Rationale for this change This patch was integrated upstream in microsoft/mimalloc#1139 ### Are these changes tested? By existing CI. ### Are there any user-facing changes? No. * GitHub Issue: apache#49042 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

### Rationale for this change Default Debian version in `.env` now maps to oldstable, we should use stable instead. Also prune entries that are not used anymore. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: apache#49024 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

) ### Rationale for this change There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays. ### What changes are included in this PR? * Add `ArrowFormat::TimeType#to_flatbuffers` * Add bit width information to `ArrowFormat::TimeType` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#49027 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

apache#49031) ### Rationale for this change It's a fixed size variant of binary array. ### What changes are included in this PR? * Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers` * Add `ArrowFormat::FixedSizeBinaryArray#each_buffer` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#49030 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…s in `castTIMESTAMP_utf8` and `castTIME_utf8` (apache#48867) ### Rationale for this change Fixes apache#48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations. ### What changes are included in this PR? - Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error - Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: apache#48866 Authored-by: Arkadii Kravchuk <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…d+ pattern before removing lines (apache#48674) ### Rationale for this change This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers. I could not find the relevant example to demonstrate within this project but assume that we have a test such as: (Generated by ChatGPT) ```cpp TEST(BlockParser, ErrorMessageWithColonsPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value\n" "parser_test.cc:940 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } // Test with URL-like data (another common case with colons) TEST(BlockParser, ErrorMessageWithURLPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api\n" "parser_test.cc:974 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } ``` then it fails. ### What changes are included in this PR? Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped. ### Are these changes tested? Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`. ### Are there any user-facing changes? No, test-only. * GitHub Issue: apache#48673 Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…dding required user-agent on urllib request (apache#49052) ### Rationale for this change See: apache#49044 ### What changes are included in this PR? Urllib now request with `"user-agent": "pyarrow"` ### Are these changes tested? It's a CI fix. ### Are there any user-facing changes? No, just a CI test fix. * GitHub Issue: apache#49044 Authored-by: Rok Mihevc <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…d and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (apache#48988) ### Rationale for this change Currently the files are missing from the published wheels. ### What changes are included in this PR? - Ensure the license and notice files are part of the wheels - Use build frontend to build wheels - Build wheel from sdist ### Are these changes tested? Yes, via archery. I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing: ``` AssertionError: LICENSE.txt is missing from the wheel. ``` ### Are there any user-facing changes? No * GitHub Issue: apache#48983 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>

…che#49060) ### Rationale for this change Fix two issues found by OSS-Fuzz in the IPC reader: * a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984 * a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408 None of these two issues is a security issue. ### Are these changes tested? Yes, by new unit tests and new fuzz regression files. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: apache#49059 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

- Added OrcSchemaField struct to map Arrow fields to ORC column indices - Added OrcSchemaManifest struct for schema mapping infrastructure - Includes GetColumnField() and GetParent() helper methods - Added stub Make() implementation (full logic in Task #2) - Mirrors Parquet SchemaManifest design adapted for ORC type system Verified: Code structure matches Parquet pattern Co-authored-by: Claude Sonnet 4.5 <[email protected]>

- Implemented BuildSchemaFieldRecursive helper for depth-first traversal - Walks Arrow schema and ORC type tree in parallel - Assigns column indices using ORC depth-first pre-order (col 0 = root struct) - Marks leaf nodes (primitives) with column_index for statistics - Marks container nodes (struct/list/map) with column_index = -1 - Builds column_index_to_field and child_to_parent lookup maps - Handles struct, list, and map types with proper child matching - Added orc/Type.hh include for ORC type information Implementation details: - Column indexing starts at 1 (column 0 is root struct) - Leaf nodes are primitives that have statistics - Container types recursively process children - Validates ORC root type is STRUCT Verified: Manual code review - follows ORC depth-first pre-order pattern Co-authored-by: Claude Sonnet 4.5 <[email protected]>

- Implemented GetOrcColumnIndex helper function that: - Resolves FieldRef to ORC column index using manifest - Uses FieldRef.FindOne() to locate field in schema - Traverses manifest tree following field path indices - Handles both top-level and nested fields - Returns column_index for leaf nodes (primitives with statistics) - Returns std::nullopt for containers or not found - Added necessary includes: - <optional> for std::optional return type - arrow/compute/api_scalar.h for FieldRef and FieldPath Implementation details: - Top-level fields accessed via manifest.schema_fields[index] - Nested fields traversed via current_field->children[index] - Validates indices at each level to prevent out-of-bounds - Only returns column_index if field is leaf (has statistics) - Containers (struct/list/map) return nullopt Verified: Manual code review - follows FieldRef resolution pattern Co-authored-by: Claude Sonnet 4.5 <[email protected]>

- Add OrcFileFragment class extending FileFragment with ORC-specific predicate pushdown - Add StripeStatisticsCache structure for caching stripe-level statistics expressions - Add OrcCacheStatus enum for tracking metadata loading state - Implement public API: stripes(), metadata(), EnsureCompleteMetadata(), Subset(), SplitByStripe() - Add private methods: FilterStripes(), TestStripes(), TryCountRows() (stubs for later tasks) - Mirror ParquetFileFragment structure adapted for ORC stripes - Use void* for orc::Reader to avoid exposing ORC headers in public API - Thread-safe with mutex locking following Fragment base class pattern Implementation notes: - stripes_ replaces Parquet's row_groups_ - statistics_cache_ stores per-stripe guarantee expressions - Forward declarations for ORC types to minimize header dependencies - Most methods are stubs marked with TODO for subsequent tasks Verified: Manual code review following Parquet reference patterns Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

cbb330 · 2026-02-20T22:30:37Z

Closing this PR to recreate with clean base. Will recreate immediately.

) - Created FoldingAnd inline helper function for combining expressions with AND - Optimizes for literal(true) case: avoids creating redundant 'true AND expr' - Will be used by TestStripes (Task #12) to build guarantee expressions - Follows same pattern as Parquet implementation - SimplifyWithGuarantee is already available from compute::Expression Verified: - Optimizes true AND right -> right - General case: left AND right for non-true left expressions - Uses compute::and_ for logical AND combination

Core statistics evaluation function for ORC predicate pushdown. This mirrors Parquet's TestRowGroups pattern. Algorithm: 1. Lock mutex for thread-safe cache access 2. Simplify predicate with partition-level guarantees 3. Resolve field references to ORC column indices via manifest 4. For each unprocessed field, load stripe statistics from ORC file 5. Derive guarantee expressions from min/max statistics 6. Combine guarantees incrementally into stripe-level cache 7. Simplify predicate with each stripe's guarantees Returns per-stripe simplified expressions: - literal(true) = must scan this stripe - literal(false) = can skip this stripe - Mixed expression = partial information available Thread-safe with mutex protection. Statistics cached incrementally as predicates reference different fields. Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc lines 933-983 Co-authored-by: Claude Sonnet 4.5 <[email protected]>

) - Created FoldingAnd inline helper function for combining expressions with AND - Optimizes for literal(true) case: avoids creating redundant 'true AND expr' - Will be used by TestStripes (Task #12) to build guarantee expressions - Follows same pattern as Parquet implementation - SimplifyWithGuarantee is already available from compute::Expression Verified: - Optimizes true AND right -> right - General case: left AND right for non-true left expressions - Uses compute::and_ for logical AND combination

Core statistics evaluation function for ORC predicate pushdown. This mirrors Parquet's TestRowGroups pattern. Algorithm: 1. Lock mutex for thread-safe cache access 2. Simplify predicate with partition-level guarantees 3. Resolve field references to ORC column indices via manifest 4. For each unprocessed field, load stripe statistics from ORC file 5. Derive guarantee expressions from min/max statistics 6. Combine guarantees incrementally into stripe-level cache 7. Simplify predicate with each stripe's guarantees Returns per-stripe simplified expressions: - literal(true) = must scan this stripe - literal(false) = can skip this stripe - Mixed expression = partial information available Thread-safe with mutex protection. Statistics cached incrementally as predicates reference different fields. Verified: Mirrors cpp/src/arrow/dataset/file_parquet.cc lines 933-983 Co-authored-by: Claude Sonnet 4.5 <[email protected]>

raulcd and others added 30 commits January 26, 2026 11:08

cbb330 and others added 9 commits February 20, 2026 14:25

Mark Task #1 as complete (#5)

62a3fee

Session 2 progress: completed Tasks #0 and #1 (#6)

c7c5f13

Mark Task #2 as complete (#8)

5618682

Mark Task #3 as complete (#10)

bb269ae

Final Session 2 summary: completed Tasks #0-#3 (#11)

73875a2

cbb330 closed this Feb 20, 2026

cbb330 added a commit that referenced this pull request Feb 20, 2026

Mark Task #1 as complete (#12)

de5e959

cbb330 added a commit that referenced this pull request Feb 20, 2026

Mark Task #12 as complete (apache#32)

789c0c3

github-actions bot added Component: C++ Component: GLib Component: Python Component: R Component: Ruby Component: FlightRPC Component: Gandiva Component: Parquet Component: Documentation CI: Extra: C++ awaiting review labels Feb 21, 2026

cbb330 added a commit that referenced this pull request Feb 24, 2026

Mark Task #1 as complete (#12)

42504ad

cbb330 added a commit that referenced this pull request Feb 24, 2026

Mark Task #12 as complete (apache#32)

c127f20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Task #4: Create OrcFileFragment class#12

Task #4: Create OrcFileFragment class#12
cbb330 wants to merge 137 commits intomainfrom
task-4-create-orc-file-fragment

cbb330 commented Feb 20, 2026

Uh oh!

cbb330 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

cbb330 commented Feb 20, 2026

Summary

Changes

Implementation Details

Testing

Task Reference

Uh oh!

cbb330 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants