Skip to content

GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide#48619

Merged
AlenkaF merged 47 commits intoapache:mainfrom
AlenkaF:gh-28859-python-docs-examples-testing
Jan 30, 2026
Merged

GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide#48619
AlenkaF merged 47 commits intoapache:mainfrom
AlenkaF:gh-28859-python-docs-examples-testing

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Dec 22, 2025

Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (In and Out during the doc build). This can lead to slower builds.

What changes are included in this PR?

IPython directives are converted to runnable code-block (with >>> and ...) and pytest doctest support for .rst files is added to the conda-python-docs CI job. This means the code in the Python User Guide is tested separately to the building of the documentation.

Are these changes tested?

Yes, with the CI.

Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with pytest --doctest-glob='*.rst' docs/source/python/file.rst

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 23, 2025

Converting this PR to draft till I figure out what would be the best way to run RST doctest on 3.12 Sphinx Documentation CI job and not on the Python 3.10 Sphinx & Numpydoc.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

@raulcd the main pain point why AMD64 Conda Python 3.10 Sphinx & Numpydoc fails and AMD64 Conda Python 3.12 Sphinx Documentation succeeds is the Python version and the use of datetime.UTC which was only added in Python 3.11, see https://docs.python.org/3/library/datetime.html#datetime.UTC.

I think the easiest solution would be to run Sphinx & Numpydoc on Python 3.11, or even Python 3.12 (I am not aware of any reason we would need the olderst Python version we support here. Sphinx Documentation runs on docs changes only while Sphinx & Numpydoc runs on any Python or C++ changes and validates the docstrings).

@raulcd
Copy link
Member

raulcd commented Jan 8, 2026

Thanks for checking that @AlenkaF ! So currently we are providing a snippet on our documentation:

.. ipython:: python
    :okexcept:

    import datetime

    current_year = datetime.datetime.now(datetime.UTC).year
    for table_chunk in birthdays_dataset.to_batches():
        print("AGES", pc.subtract(current_year, table_chunk["years"]))

that will fail for some users as we are still supporting Python 3.10, right? Is it worth for the example to add the datetime.UTC? Should we just use for the example: current_year = datetime.datetime.now().year
Or maybe add a comment with a note?

I am ok to just bump the Python version of the job but we probably should not provide examples that will fail on some of the supported versions.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

Yeah, you are right. Changing to datetime.datetime.now().year or even datetime.datetime.now(datetime.timezone.utc) makes much more sense! Will update 👍

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 8, 2026

Ha ha, the example would fail anyways as the year changed in the meantime 🤣
Probably it is best to just hardcode it.

@raulcd
Copy link
Member

raulcd commented Jan 8, 2026

Ha ha, the example would fail anyways as the year changed in the meantime 🤣 Probably it is best to just hardcode it.

Yes, we don't want to have to update this every year because the data changes 😄

@AlenkaF AlenkaF force-pushed the gh-28859-python-docs-examples-testing branch from f417177 to 1fb6f0a Compare January 9, 2026 08:03
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 1fb6f0a

Submitted crossbow builds: ursacomputing/crossbow @ actions-b1a47e0770

Task Status
preview-docs GitHub Actions

@AlenkaF AlenkaF marked this pull request as ready for review January 12, 2026 09:42
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 12, 2026

Hi @rmnskb @tadeja @zhengruifeng @HyukjinKwon! In case anybody fancies giving a review, it would be much appreciated.
This PR looks like a big change but it only unifies how we write code examples in the Python User guide (code-block and not ipython directive. Note >>> is needed in order for the examples to be tested).

Link to the preview: https://s3.amazonaws.com/arrow-data/pr_docs/48619/python/index.html

This PR also adds a doctest of the .rst files to the two existing documentation CI jobs. One job runs only with changes to the documentation, the other job runs with changes in the C++ and Python code. cc @raulcd in case you have time to look at the ci/ changes.

Copy link
Contributor

@rmnskb rmnskb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥 Thanks for working on that! I can imagine it was a tremendous amount of work. Left some general comments about some smaller things that I picked up while looking at the PR, otherwise I think it's good to merge.

.. code-block:: python

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you decide to opt out from the explicit imports? Does the documentation still compile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not duplicate imports per page. Meaning once on the top should suffice (see lines above, approx line 32). Yes, the compilation of docs and doctest should work.

Copy link
Member Author

@AlenkaF AlenkaF Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we keep the explicit imports, the examples will be copy-paste able which might be more friendly to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a User Guide, I think the content should flow as in a notebook for example. Meaning we start at the beginning and work our way through.

Even if duplicating imports would be beneficial, we might then need to add it on every code-block or on every section? Which makes it a bit unclear to me.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 12, 2026
@HyukjinKwon
Copy link
Member

ack. taking a look now

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 13, 2026

Thanks for the review @rmnskb! 🎈

| naive| aware|
+-------------------+-------------------+
|2019-01-01 00:00:00|2019-01-01 08:00:00|
|2018-12-31 23:00:00|2019-01-01 08:00:00|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2019-01-01 00:00:00 became 2018-12-31 23:00:00 here cuz I suspect you or CI (?) is somewhere in GMT+1. datetime(2019, 1, 1, 0) is assumed as local time (yes it's up to the system to interpret but Spark thinks so). So, Spark thought that it's a local time but the timezone was set as UTC so it decreased one hour.

I think we should probably just skip all here cuz now it seems depending on local timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to just keep the original input/output here and skip all. I will take a separate look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree - these tests should already be skipped. I can remove the diff and keep it as it was before?
The example is good altogether and shows the timezone conversion behaviour. Maybe we could add a note? Claude suggests:

.. note::
   The examples above demonstrate timezone conversion behaviour.
   The exact output may differ depending on your system's local timezone, as Spark
   interprets naive timestamps relative to the local timezone when converting to UTC.

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the diff and keep it as it was before?

yupyup. whichever easier. I will take a separate look after this gets merged.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: LGTM

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 13, 2026

How much does it save time BTW? Some might argue that IPython input/output are better and the speed of building could be considered as a secondary (e.g., PySpark doc build takes super duper long - it generates things a lot). For myself, I prefer faster build in any event so my take is on this change.

My main aim was to unify the docs and have the possibility of running doctest on the examples separately. But am curious if there is any change in performance so I will try it out now 😄 (not sure if the amount of IPython directives has been that big before this change, though).

@AlenkaF AlenkaF force-pushed the gh-28859-python-docs-examples-testing branch from 8e47447 to 1209294 Compare January 20, 2026 14:37
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 29, 2026

@raulcd in case you have some time this PR only needs a review for the smaller CI-related part.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thanks @AlenkaF ! I haven't gone over all the doc changes but tests are passing which are enough insurance to me.
Thanks for documenting how to use it and how to test it locally!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 29, 2026
@AlenkaF
Copy link
Member Author

AlenkaF commented Jan 30, 2026

Thank you @raulcd! The doc changes were reviewed by others so I think this is ready to go.
Thank you all for the reviews!!

@AlenkaF AlenkaF merged commit 12cdb09 into apache:main Jan 30, 2026
48 checks passed
@AlenkaF AlenkaF removed the awaiting merge Awaiting merge label Jan 30, 2026
@AlenkaF AlenkaF deleted the gh-28859-python-docs-examples-testing branch January 30, 2026 08:09
@raulcd
Copy link
Member

raulcd commented Jan 30, 2026

@AlenkaF I can see some failures related to this on a rebased PR from main, do you know if this is expected? Is around the string vs large_string and object vs str, see example of failure:
https://github.com/apache/arrow/actions/runs/21512998856/job/61984417322?pr=45854

@jorisvandenbossche
Copy link
Member

I suppose that are similar failures as were being fixed in parallel for the doctests in #48969

@rok
Copy link
Member

rok commented Jan 30, 2026

It seem CI didn't use Pandas 3 yet when this PR was tested. Here's a proposed fix: #49088

rok added a commit that referenced this pull request Feb 5, 2026
…as 3+ (#49088)

Fixes: #49150
See #48619 (comment)

### Rationale for this change

Fix CI failures

### What changes are included in this PR?

Tests are made more general to allow for Pandas 2 and Pandas 3 style string types

### Are these changes tested?

By CI

### Are there any user-facing changes?

No
* GitHub Issue: #49150

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>
cbb330 added a commit to cbb330/arrow that referenced this pull request Feb 20, 2026
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)

### Rationale for this change

Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 |           __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
      |                            
```

### What changes are included in this PR?

Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48965

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)

### What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths

Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:
* Add convenience functions for integer overflow checking

### Are these changes tested?

Yes, by existing and improved tests.

### Are there any user-facing changes?

Bug fixes.

**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.

* GitHub Issue: #48924

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)


### Rationale for this change

The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled

### What changes are included in this PR?

1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.

### Are these changes tested?
Manually on Windows, and CI

### Are there any user-facing changes?

No
* GitHub Issue: #48966

Authored-by: jianfengmao <[email protected]>
Signed-off-by: David Li <[email protected]>

* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)

### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

### What changes are included in this PR?
Early check the array is not all null values before serialize it

### Are these changes tested?

Added tests.
### Are there any user-facing changes?

No

* GitHub Issue: #48691

Authored-by: rexan <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)

### Rationale for this change

As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix

### What changes are included in this PR?

- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #48947

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48990: [Ruby] Add support for writing date arrays (#48991)

### Rationale for this change

There are date32 and date64 variants for date arrays.

### What changes are included in this PR?

* Add `ArrowFormat::DateType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48990

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)

### Rationale for this change

It's a large variant of UTF-8 array.

### What changes are included in this PR?

* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48992

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)

### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
  - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
  - Deprecate the old Status/out-parameter overloads
  - Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
                                     std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);

virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      const std::vector<int>& column_indices,
                                      std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949

Lead-authored-by: fenfeng9 <[email protected]>
Co-authored-by: fenfeng9 <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)

### Rationale for this change

Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.

### What changes are included in this PR?

* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
 
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48985

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)

### Rationale for this change

When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
 + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.

### What changes are included in this PR?

Correct logic and only return 404 if the file requested wasn't found.

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #47692

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)

### Rationale for this change

Benchmark failing since C++20 upgrade due to lack of C++20 configuration

### What changes are included in this PR?

Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.  

Description as follows:

> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. 
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.                                                               

### Are these changes tested?

I got :robot:  to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.

>  Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.                                       

### Are there any user-facing changes?

Nope
* GitHub Issue: #48912

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)

### Rationale for this change

Fixes https://github.com/apache/arrow/issues/36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
```

### What changes are included in this PR?

The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content

The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

### Are these changes tested?

Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table

### Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

* GitHub Issue: #36889

Lead-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)

### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package 
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release

* GitHub Issue: #48932

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)

### Rationale for this change

Add guidance re AI tooling

### What changes are included in this PR?

Updates to main docs and links to it from new contributor's guide

### Are these changes tested?

No but I'll built the docs

### Are there any user-facing changes?

Just docs

:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49029: [Doc] Run sphinx-build in parallel (#49026)

### Rationale for this change

`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49029

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)

### Rationale for this change

This functionality is unused now that we have a proper atfork facility.

### Are these changes tested?

By existing CI tests.

### Are there any user-facing changes?

Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).

* GitHub Issue: #33450

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)

### Rationale for this change

The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.

It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.

The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.

### What changes are included in this PR?

Removed the outdated TODO comment that referenced GH-35437.

### Are these changes tested?

I did not test.

### Are there any user-facing changes?

No.
* GitHub Issue: #35437

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)

### Rationale for this change

When running the python-sdist job we are currently not uploading the build artifact to the job.

### What changes are included in this PR?

Upload artifact as part of building the job so it's easier to test and validate contents if necessary.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48586

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)

### Rationale for this change

CI needs updating to test old R package versions

### What changes are included in this PR?

Add 22.0.0.1

### Are these changes tested?

Nah, it's CI stuff

### Are there any user-facing changes?

No

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)

### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.

### Are these changes tested?
Yes, locally.

### Are there any user-facing changes?
No.

Closes #48961 
* GitHub Issue: #48961

Authored-by: Tadeja Kadunc <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking  (#49038)

### Rationale for this change

Slow benchmarks due to conda duckdb building from source

### What changes are included in this PR?

Try ditching conda and installing R via rig and using PPM binaries

### Are these changes tested?

I'll try running

### Are there any user-facing changes?
 
Nope
* GitHub Issue: #49037

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49042: [C++] Remove mimalloc patch (#49041)

### Rationale for this change

This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139

### Are these changes tested?

By existing CI.

### Are there any user-facing changes?

No.
* GitHub Issue: #49042

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49024: [CI] Update Debian version in `.env` (#49032)

### Rationale for this change

Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49024

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49027: [Ruby] Add support for writing time arrays (#49028)

### Rationale for this change

There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.

### What changes are included in this PR?

* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #49027

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)

### Rationale for this change

It's a fixed size variant of binary array.

### What changes are included in this PR?

* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49030

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)

### Rationale for this change

Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.

### What changes are included in this PR?

- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #48866

Authored-by: Arkadii Kravchuk <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)

### Rationale for this change

This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.

I could not find the relevant example to demonstrate within this project but assume that we have a test such as:

(Generated by ChatGPT)

```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
            "Error details: Time format: 12:34:56, Key: value\n"
            "parser_test.cc:940  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
      "Error details: Time format: 12:34:56, Key: value";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}

// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
            "URL: http://arrow.apache.org:8080/api\n"
            "parser_test.cc:974  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
      "URL: http://arrow.apache.org:8080/api";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```

then it fails.

### What changes are included in this PR?

Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.

### Are these changes tested?

Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.

### Are there any user-facing changes?

No, test-only.

* GitHub Issue: #48673

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)

### Rationale for this change

See: #49044

### What changes are included in this PR?

Urllib now request with `"user-agent": "pyarrow"`

### Are these changes tested?

It's a CI fix.

### Are there any user-facing changes?

No, just a CI test fix.
* GitHub Issue: #49044

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)

### Rationale for this change

Currently the files are missing from the published wheels.

### What changes are included in this PR?

- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist

### Are these changes tested?

Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
 AssertionError: LICENSE.txt is missing from the wheel.
```

### Are there any user-facing changes?

No

* GitHub Issue: #48983

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)

### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)

### Rationale for this change

Decimal128/256 arrays are only supported.

### What changes are included in this PR?

Add `ArrowFormat::DecimalType#to_flatbuffers`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49055

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49053: [Ruby] Add support for writing timestamp array (#49054)

### Rationale for this change

It has `unit` and `time_zone` parameters.

### What changes are included in this PR?

* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49053

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)

### Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.

### What changes are included in this PR?

IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building  of the documentation.

### Are these changes tested?

Yes, with the CI.

### Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`

* GitHub Issue: #28859

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: tadeja <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)

### Rationale for this change

The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.

### What changes are included in this PR?

Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.

### Are these changes tested?

Yes, existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #49065

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)

### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.

### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.

Before the change (3 TargetMachines created):

First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.

Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.

Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.

After the change (1 TargetMachine created):

The key changes are:

Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).

Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.

### Are these changes tested?
Yes, unit and integration.

### Are there any user-facing changes?
No.

* GitHub Issue: #48159

Lead-authored-by: [email protected] <[email protected]>
Co-authored-by: Logan Riggs <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)

### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043

### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
 
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.

The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.

### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes. 
* GitHub Issue: #49043

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)

### Rationale for this change

The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.

### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions. 
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.

### Are these changes tested?
Yes, unit and integration testing.

### Are there any user-facing changes?
No.

* GitHub Issue: #49034

Authored-by: Logan Riggs <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)

### Rationale for this change

Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).

### What changes are included in this PR?

Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.

### Are these changes tested?

Yes, through CI build and existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #48980

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)

### Rationale for this change

The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.

### What changes are included in this PR?

- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders

### Are these changes tested?

Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.

### Are there any user-facing changes?

No.
* GitHub Issue: #49069

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)

### Rationale for this change

The current version of vcpkg used is a from April 2025

### What changes are included in this PR?

Update baseline to newer version.

### Are these changes tested?

Yes on CI. I've validated for example that xsimd 14 will be pulled.

### Are there any user-facing changes?
No

* GitHub Issue: #49076

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49074: [Ruby] Add support for writing interval arrays (#49075)

### Rationale for this change

There are year month/day time/month day nano variants.

### What changes are included in this PR?

* Add `ArrowFormat::IntervalType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49074

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)

### Rationale for this change

They use different offset size.

### What changes are included in this PR?

* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49071

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)

### Rationale for this change

LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.

### What changes are included in this PR?

Use `llvm::Optional` with LLVM 15 or earlier.

### Are these changes tested?

Yes, compiling.

### Are there any user-facing changes?

No

* GitHub Issue: #49087

Authored-by: [email protected] <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)

### Rationale for this change

The Swift documentation link in the implementations.rst file was broken and returned a 404 error.

### What changes are included in this PR?

Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49100

Lead-authored-by: ChiLin Chiu <[email protected]>
Co-authored-by: Chilin <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49096: [Ruby] Add support for writing struct array (#49097)

### Rationale for this change

It's a nested array.

### What changes are included in this PR?

* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49096

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49093: [Ruby] Add support for writing duration array (#49094)

### Rationale for this change

It has unit parameter.

### What changes are included in this PR?

* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49093

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)

### Rationale for this change

Documents for libarrow-cuda-glib are generated but they aren't packaged.

### What changes are included in this PR?

Package documents for libarrow-cuda-glib.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49098

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48764: [C++] Update xsimd (#48765)

### Rationale for this change
Homogenized versions used

### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.

### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.

### Are there any user-facing changes?
No.

* GitHub Issue: #48764

Authored-by: AntoinePrv <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)

### Rationale for this change

As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.

### What changes are included in this PR?

Remove asv benchmarking related files and docs.

### Are these changes tested?

No, Validate CI and run preview-docs to validate docs.

### Are there any user-facing changes?

No
* GitHub Issue: #46008

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)

### Rationale for this change

`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.

### What changes are included in this PR?

Add f prefix to the string in `SparseCOOTensor.__repr__`.

### Are these changes tested?

Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```

### Are there any user-facing changes?

a bug that caused incorrect or invalid data to be produced:

```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```

* GitHub Issue: #49108

Authored-by: Chilin <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)

### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).

### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.

### Are these changes tested?
Yes, with extendeed dask build.

### Are there any user-facing changes?
No.
* GitHub Issue: #49083

Authored-by: AlenkaF <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49117: [Ruby] Add support for writing union arrays (#49118)

### Rationale for this change

There are dense and sparse variants.

### What changes are included in this PR?

* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49117

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49119: [Ruby] Add support for writing map array (#49120)

### Rationale for this change

It's a list based array.

### What changes are included in this PR?

* Add `ArrowFormat::MapType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49119

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)

### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.

### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No
* GitHub Issue: #48922

Authored-by: Abhishek Bansal <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)

### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.

`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. 

### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.

### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.

### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.

With this patch, the CSV reader in PyArrow outputs:

```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```

Closes #49003 

* GitHub Issue: #49003

Authored-by: Alvaro-Kothe <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)

### Rationale for this change

The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

### What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.

### Are these changes tested?

There are existent tests for JSON.

### Are there any user-facing changes?

No, test-only.
* GitHub Issue: #48941

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49067: [R] Disable GCS on macos (#49068)

### Rationale for this change
Builds that complete on CRAN

### What changes are included in this PR?
Disable GCS by default

### Are these changes tested?

### Are there any user-facing changes?
Hopefully not 

**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)

**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)

* GitHub Issue: #49067

---------

Co-authored-by: Nic Crane <[email protected]>

* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)

### Rationale for this change

Current wheels are failing to be built due to old version of vcpkg failing with our latest main.

### What changes are included in this PR?

- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.

### Are these changes tested?

Yes on CI

### Are there any user-facing changes?

No

* GitHub Issue: #49115

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)

### Rationale for this change

Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:

```python
import pyarrow as pa
import pyarrow.compute as pc

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```

I believe it does not make sense to specifically disallow this in dictionaries at this point.

### What changes are included in this PR?

Added a unittest for null sorting behaviour.

### Are these changes tested?

Yes, the unittest was added.

### Are there any user-facing changes?

No.
* GitHub Issue: #48954

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-36193: [R] arm64 binaries for R  (#48574)

### Rationale for this change

Issues building on ARM

### What changes are included in this PR?

CI job and nixlibs update

### Are these changes tested?

On CI

### Are there any user-facing changes?

No

AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) 

* GitHub Issue: #36193

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)

### Rationale for this change

Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.

### What changes are included in this PR?

Update docs.

### Are these changes tested?

Will preview docs build.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: #48397

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)

### Rationale for This Change

The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.

### What Changes Are Included in This PR?

This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.

### Are These Changes Tested?

Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.

### Are There Any User-Facing Changes?

No. This change improves internal safety and robustness without altering public APIs or observable user behavior.

* GitHub Issue: #49104

Lead-authored-by: Alirana2829 <[email protected]>
Co-authored-by: Ali Mahmood Rana <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* MINOR: [Docs] Add links to AI-generated code guidance (#49131)

### Rationale for this change

Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though

### What changes are included in this PR?

Add link to AI-generated code guidance

### Are these changes tested?

No

### Are there any user-facing changes?

No

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* MINOR: [R] Add new vignette to pkgdown config (#49145)

### Rationale for this change

CI failing on preview-docs; see #49141

### What changes are included in this PR?

Add the vignette created in #49068 to pkgdown config

### Are these changes tested?

I'll trigger CI

### Are there any user-facing changes?

Nah

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)

Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381

### Rationale for this change

Fix CI failures

### What changes are included in this PR?

Tests are made more general to allow for Pandas 2 and Pandas 3 style string types

### Are these changes tested?

By CI

### Are there any user-facing changes?

No
* GitHub Issue: #49150

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)

Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.

### Rationale for this change

I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.

### What changes are included in this PR?

AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.

This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.

Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.

### Are these changes tested?

I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.

One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.

### Are there any user-facing changes?

Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.

* GitHub Issue: #41990

Lead-authored-by: Nate Prewitt <[email protected]>
Co-authored-by: Nate Prewitt <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)

### Rationale for this change

We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138

### What changes are included in this PR?

Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).

### Are these changes tested?
Tes.

### Are there any user-facing changes?
No.
* GitHub Issue: #49138

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)

### Rationale for this change

Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).

The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.

### What changes are included in this PR?

- Changed validation from `start >= stop` to `start > stop` 
- Updated error message
- Added test cases

### Are these changes tested?

Yes, tests were added.

### Are there any user-facing changes?

Yes.

```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```

Before:

```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```

After:

```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
  []
]
```
* GitHub Issue: #33459

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)

Closes https://github.com/apache/arrow/issues/41863

### Rationale for this change

Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/

`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. 

However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:

```
ArrowException: Unsupported compression: lz4_raw
```

This is a friction issue, and confusing for some users who are aware of the differences.

### What changes are included in this PR?

- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, an additive change to the accepted codec names.

* GitHub Issue: #41863

Authored-by: Nick Woolmer <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-48868: [Doc] Document security model for the Arrow formats (#48870)

### Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

### What changes are included in this PR?

Add a Security Considerations page in the Format section.

**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

### Are these changes tested?

N/A

### Are there any user-facing changes?

No.
* GitHub Issue: #48868

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)

### Rationale for this change
#49004 

### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note:  `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050

### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004

Lead-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)

### Rationale for this change
#49092

### What changes are included in this PR?

-  Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.

Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.

### Are these changes tested?

Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26

### Are there any user-facing changes?

Yes, the nightly ODBC file names will be changed as described above. 

* GitHub Issue: #49092

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49156: [Python] Require GIL for string comparison (#49161)

### Rationale for this change

With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.

### What changes are included in this PR?

Moving statement out of the `with nogil` context manager.

### Are these changes tested?

Existing CI builds pyarrow.

### Are there any user-facing changes?

No
* GitHub Issue: #49156

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)

### Rationale for this change
#48575

### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A

* GitHub Issue: #48575

Lead-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: justing-bq <[email protected]>
Co-authored-by: Victor Tsang <[email protected]>
Co-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: vic-tsang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)

### Rationale for this change

Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.

### What changes are included in this PR?

Use the variable name directly (no `${}`).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.
* GitHub Issue: #49164

Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48132: [Ruby] Add support for writing dictionary array (#49175)

### Rationale for this change

Delta dictionary message support is out of scope.

### What changes are included in this PR?

* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48132

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)

### Rationale for this change

Correct variant extension according to arrow's specification.

### What changes are included in this PR?

Modified variant's hardcoded extension name.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49081

Authored-by: Zehua Zou <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)

### Rationale for this change

This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.

### What changes are included in this PR?

This PR establishes infrastructure for type checking:

- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow

### Are these changes tested?

No. This is mostly a CI change.

### Are there any user-facing changes?

This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)

### Rationale for this change
See #49190

### What changes are included in this PR?

Fix `unknown job 'odbc' error` caused by typo

### Are these changes tested?

Tested in CI

### Are there any user-facing changes?

N/A

* GitHub Issue: #49190

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)

Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@​dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect…
cbb330 added a commit to cbb330/arrow that referenced this pull request Feb 20, 2026
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)

### Rationale for this change

Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 |           __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
      |                            
```

### What changes are included in this PR?

Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48965

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)

### What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths

Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:
* Add convenience functions for integer overflow checking

### Are these changes tested?

Yes, by existing and improved tests.

### Are there any user-facing changes?

Bug fixes.

**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.

* GitHub Issue: #48924

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)


### Rationale for this change

The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled

### What changes are included in this PR?

1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.

### Are these changes tested?
Manually on Windows, and CI

### Are there any user-facing changes?

No
* GitHub Issue: #48966

Authored-by: jianfengmao <[email protected]>
Signed-off-by: David Li <[email protected]>

* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)

### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

### What changes are included in this PR?
Early check the array is not all null values before serialize it

### Are these changes tested?

Added tests.
### Are there any user-facing changes?

No

* GitHub Issue: #48691

Authored-by: rexan <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)

### Rationale for this change

As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix

### What changes are included in this PR?

- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #48947

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48990: [Ruby] Add support for writing date arrays (#48991)

### Rationale for this change

There are date32 and date64 variants for date arrays.

### What changes are included in this PR?

* Add `ArrowFormat::DateType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48990

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)

### Rationale for this change

It's a large variant of UTF-8 array.

### What changes are included in this PR?

* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48992

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)

### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
  - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
  - Deprecate the old Status/out-parameter overloads
  - Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
                                     std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);

virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      const std::vector<int>& column_indices,
                                      std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949

Lead-authored-by: fenfeng9 <[email protected]>
Co-authored-by: fenfeng9 <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)

### Rationale for this change

Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.

### What changes are included in this PR?

* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
 
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48985

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)

### Rationale for this change

When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
 + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.

### What changes are included in this PR?

Correct logic and only return 404 if the file requested wasn't found.

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #47692

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)

### Rationale for this change

Benchmark failing since C++20 upgrade due to lack of C++20 configuration

### What changes are included in this PR?

Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.  

Description as follows:

> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. 
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.                                                               

### Are these changes tested?

I got :robot:  to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.

>  Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.                                       

### Are there any user-facing changes?

Nope
* GitHub Issue: #48912

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)

### Rationale for this change

Fixes https://github.com/apache/arrow/issues/36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
```

### What changes are included in this PR?

The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content

The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

### Are these changes tested?

Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table

### Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

* GitHub Issue: #36889

Lead-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)

### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package 
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release

* GitHub Issue: #48932

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)

### Rationale for this change

Add guidance re AI tooling

### What changes are included in this PR?

Updates to main docs and links to it from new contributor's guide

### Are these changes tested?

No but I'll built the docs

### Are there any user-facing changes?

Just docs

:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49029: [Doc] Run sphinx-build in parallel (#49026)

### Rationale for this change

`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49029

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)

### Rationale for this change

This functionality is unused now that we have a proper atfork facility.

### Are these changes tested?

By existing CI tests.

### Are there any user-facing changes?

Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).

* GitHub Issue: #33450

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)

### Rationale for this change

The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.

It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.

The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.

### What changes are included in this PR?

Removed the outdated TODO comment that referenced GH-35437.

### Are these changes tested?

I did not test.

### Are there any user-facing changes?

No.
* GitHub Issue: #35437

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)

### Rationale for this change

When running the python-sdist job we are currently not uploading the build artifact to the job.

### What changes are included in this PR?

Upload artifact as part of building the job so it's easier to test and validate contents if necessary.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48586

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)

### Rationale for this change

CI needs updating to test old R package versions

### What changes are included in this PR?

Add 22.0.0.1

### Are these changes tested?

Nah, it's CI stuff

### Are there any user-facing changes?

No

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)

### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.

### Are these changes tested?
Yes, locally.

### Are there any user-facing changes?
No.

Closes #48961 
* GitHub Issue: #48961

Authored-by: Tadeja Kadunc <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking  (#49038)

### Rationale for this change

Slow benchmarks due to conda duckdb building from source

### What changes are included in this PR?

Try ditching conda and installing R via rig and using PPM binaries

### Are these changes tested?

I'll try running

### Are there any user-facing changes?
 
Nope
* GitHub Issue: #49037

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49042: [C++] Remove mimalloc patch (#49041)

### Rationale for this change

This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139

### Are these changes tested?

By existing CI.

### Are there any user-facing changes?

No.
* GitHub Issue: #49042

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49024: [CI] Update Debian version in `.env` (#49032)

### Rationale for this change

Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49024

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49027: [Ruby] Add support for writing time arrays (#49028)

### Rationale for this change

There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.

### What changes are included in this PR?

* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #49027

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)

### Rationale for this change

It's a fixed size variant of binary array.

### What changes are included in this PR?

* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49030

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)

### Rationale for this change

Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.

### What changes are included in this PR?

- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #48866

Authored-by: Arkadii Kravchuk <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)

### Rationale for this change

This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.

I could not find the relevant example to demonstrate within this project but assume that we have a test such as:

(Generated by ChatGPT)

```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
            "Error details: Time format: 12:34:56, Key: value\n"
            "parser_test.cc:940  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
      "Error details: Time format: 12:34:56, Key: value";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}

// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
            "URL: http://arrow.apache.org:8080/api\n"
            "parser_test.cc:974  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
      "URL: http://arrow.apache.org:8080/api";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```

then it fails.

### What changes are included in this PR?

Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.

### Are these changes tested?

Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.

### Are there any user-facing changes?

No, test-only.

* GitHub Issue: #48673

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)

### Rationale for this change

See: #49044

### What changes are included in this PR?

Urllib now request with `"user-agent": "pyarrow"`

### Are these changes tested?

It's a CI fix.

### Are there any user-facing changes?

No, just a CI test fix.
* GitHub Issue: #49044

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)

### Rationale for this change

Currently the files are missing from the published wheels.

### What changes are included in this PR?

- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist

### Are these changes tested?

Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
 AssertionError: LICENSE.txt is missing from the wheel.
```

### Are there any user-facing changes?

No

* GitHub Issue: #48983

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)

### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)

### Rationale for this change

Decimal128/256 arrays are only supported.

### What changes are included in this PR?

Add `ArrowFormat::DecimalType#to_flatbuffers`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49055

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49053: [Ruby] Add support for writing timestamp array (#49054)

### Rationale for this change

It has `unit` and `time_zone` parameters.

### What changes are included in this PR?

* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49053

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)

### Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.

### What changes are included in this PR?

IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building  of the documentation.

### Are these changes tested?

Yes, with the CI.

### Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`

* GitHub Issue: #28859

Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: tadeja <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)

### Rationale for this change

The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.

### What changes are included in this PR?

Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.

### Are these changes tested?

Yes, existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #49065

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)

### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.

### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.

Before the change (3 TargetMachines created):

First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.

Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.

Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.

After the change (1 TargetMachine created):

The key changes are:

Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).

Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.

### Are these changes tested?
Yes, unit and integration.

### Are there any user-facing changes?
No.

* GitHub Issue: #48159

Lead-authored-by: [email protected] <[email protected]>
Co-authored-by: Logan Riggs <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)

### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043

### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
 
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.

The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.

### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes. 
* GitHub Issue: #49043

Authored-by: Thomas Newton <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)

### Rationale for this change

The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.

### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions. 
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.

### Are these changes tested?
Yes, unit and integration testing.

### Are there any user-facing changes?
No.

* GitHub Issue: #49034

Authored-by: Logan Riggs <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)

### Rationale for this change

Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).

### What changes are included in this PR?

Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.

### Are these changes tested?

Yes, through CI build and existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #48980

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)

### Rationale for this change

The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.

### What changes are included in this PR?

- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders

### Are these changes tested?

Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.

### Are there any user-facing changes?

No.
* GitHub Issue: #49069

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)

### Rationale for this change

The current version of vcpkg used is a from April 2025

### What changes are included in this PR?

Update baseline to newer version.

### Are these changes tested?

Yes on CI. I've validated for example that xsimd 14 will be pulled.

### Are there any user-facing changes?
No

* GitHub Issue: #49076

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49074: [Ruby] Add support for writing interval arrays (#49075)

### Rationale for this change

There are year month/day time/month day nano variants.

### What changes are included in this PR?

* Add `ArrowFormat::IntervalType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49074

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)

### Rationale for this change

They use different offset size.

### What changes are included in this PR?

* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49071

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)

### Rationale for this change

LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.

### What changes are included in this PR?

Use `llvm::Optional` with LLVM 15 or earlier.

### Are these changes tested?

Yes, compiling.

### Are there any user-facing changes?

No

* GitHub Issue: #49087

Authored-by: [email protected] <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)

### Rationale for this change

The Swift documentation link in the implementations.rst file was broken and returned a 404 error.

### What changes are included in this PR?

Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49100

Lead-authored-by: ChiLin Chiu <[email protected]>
Co-authored-by: Chilin <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49096: [Ruby] Add support for writing struct array (#49097)

### Rationale for this change

It's a nested array.

### What changes are included in this PR?

* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49096

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49093: [Ruby] Add support for writing duration array (#49094)

### Rationale for this change

It has unit parameter.

### What changes are included in this PR?

* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49093

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)

### Rationale for this change

Documents for libarrow-cuda-glib are generated but they aren't packaged.

### What changes are included in this PR?

Package documents for libarrow-cuda-glib.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49098

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48764: [C++] Update xsimd (#48765)

### Rationale for this change
Homogenized versions used

### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.

### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.

### Are there any user-facing changes?
No.

* GitHub Issue: #48764

Authored-by: AntoinePrv <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)

### Rationale for this change

As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.

### What changes are included in this PR?

Remove asv benchmarking related files and docs.

### Are these changes tested?

No, Validate CI and run preview-docs to validate docs.

### Are there any user-facing changes?

No
* GitHub Issue: #46008

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)

### Rationale for this change

`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.

### What changes are included in this PR?

Add f prefix to the string in `SparseCOOTensor.__repr__`.

### Are these changes tested?

Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```

### Are there any user-facing changes?

a bug that caused incorrect or invalid data to be produced:

```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```

* GitHub Issue: #49108

Authored-by: Chilin <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)

### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).

### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.

### Are these changes tested?
Yes, with extendeed dask build.

### Are there any user-facing changes?
No.
* GitHub Issue: #49083

Authored-by: AlenkaF <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-49117: [Ruby] Add support for writing union arrays (#49118)

### Rationale for this change

There are dense and sparse variants.

### What changes are included in this PR?

* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49117

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49119: [Ruby] Add support for writing map array (#49120)

### Rationale for this change

It's a list based array.

### What changes are included in this PR?

* Add `ArrowFormat::MapType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49119

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)

### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.

### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No
* GitHub Issue: #48922

Authored-by: Abhishek Bansal <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)

### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.

`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. 

### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.

### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.

### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.

With this patch, the CSV reader in PyArrow outputs:

```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```

Closes #49003 

* GitHub Issue: #49003

Authored-by: Alvaro-Kothe <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)

### Rationale for this change

The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

### What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.

### Are these changes tested?

There are existent tests for JSON.

### Are there any user-facing changes?

No, test-only.
* GitHub Issue: #48941

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49067: [R] Disable GCS on macos (#49068)

### Rationale for this change
Builds that complete on CRAN

### What changes are included in this PR?
Disable GCS by default

### Are these changes tested?

### Are there any user-facing changes?
Hopefully not 

**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)

**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)

* GitHub Issue: #49067

---------

Co-authored-by: Nic Crane <[email protected]>

* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)

### Rationale for this change

Current wheels are failing to be built due to old version of vcpkg failing with our latest main.

### What changes are included in this PR?

- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.

### Are these changes tested?

Yes on CI

### Are there any user-facing changes?

No

* GitHub Issue: #49115

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)

### Rationale for this change

Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:

```python
import pyarrow as pa
import pyarrow.compute as pc

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```

I believe it does not make sense to specifically disallow this in dictionaries at this point.

### What changes are included in this PR?

Added a unittest for null sorting behaviour.

### Are these changes tested?

Yes, the unittest was added.

### Are there any user-facing changes?

No.
* GitHub Issue: #48954

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-36193: [R] arm64 binaries for R  (#48574)

### Rationale for this change

Issues building on ARM

### What changes are included in this PR?

CI job and nixlibs update

### Are these changes tested?

On CI

### Are there any user-facing changes?

No

AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) 

* GitHub Issue: #36193

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)

### Rationale for this change

Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.

### What changes are included in this PR?

Update docs.

### Are these changes tested?

Will preview docs build.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: #48397

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)

### Rationale for This Change

The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.

### What Changes Are Included in This PR?

This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.

### Are These Changes Tested?

Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.

### Are There Any User-Facing Changes?

No. This change improves internal safety and robustness without altering public APIs or observable user behavior.

* GitHub Issue: #49104

Lead-authored-by: Alirana2829 <[email protected]>
Co-authored-by: Ali Mahmood Rana <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* MINOR: [Docs] Add links to AI-generated code guidance (#49131)

### Rationale for this change

Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though

### What changes are included in this PR?

Add link to AI-generated code guidance

### Are these changes tested?

No

### Are there any user-facing changes?

No

Lead-authored-by: Nic Crane <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* MINOR: [R] Add new vignette to pkgdown config (#49145)

### Rationale for this change

CI failing on preview-docs; see #49141

### What changes are included in this PR?

Add the vignette created in #49068 to pkgdown config

### Are these changes tested?

I'll trigger CI

### Are there any user-facing changes?

Nah

Authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>

* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)

Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381

### Rationale for this change

Fix CI failures

### What changes are included in this PR?

Tests are made more general to allow for Pandas 2 and Pandas 3 style string types

### Are these changes tested?

By CI

### Are there any user-facing changes?

No
* GitHub Issue: #49150

Authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)

Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.

### Rationale for this change

I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.

### What changes are included in this PR?

AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.

This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.

Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.

### Are these changes tested?

I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.

One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.

### Are there any user-facing changes?

Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.

* GitHub Issue: #41990

Lead-authored-by: Nate Prewitt <[email protected]>
Co-authored-by: Nate Prewitt <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)

### Rationale for this change

We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138

### What changes are included in this PR?

Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).

### Are these changes tested?
Tes.

### Are there any user-facing changes?
No.
* GitHub Issue: #49138

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)

### Rationale for this change

Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).

The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.

### What changes are included in this PR?

- Changed validation from `start >= stop` to `start > stop` 
- Updated error message
- Added test cases

### Are these changes tested?

Yes, tests were added.

### Are there any user-facing changes?

Yes.

```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```

Before:

```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```

After:

```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
  []
]
```
* GitHub Issue: #33459

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)

Closes https://github.com/apache/arrow/issues/41863

### Rationale for this change

Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/

`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. 

However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:

```
ArrowException: Unsupported compression: lz4_raw
```

This is a friction issue, and confusing for some users who are aware of the differences.

### What changes are included in this PR?

- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, an additive change to the accepted codec names.

* GitHub Issue: #41863

Authored-by: Nick Woolmer <[email protected]>
Signed-off-by: AlenkaF <[email protected]>

* GH-48868: [Doc] Document security model for the Arrow formats (#48870)

### Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

### What changes are included in this PR?

Add a Security Considerations page in the Format section.

**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

### Are these changes tested?

N/A

### Are there any user-facing changes?

No.
* GitHub Issue: #48868

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>

* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)

### Rationale for this change
#49004 

### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note:  `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050

### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004

Lead-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)

### Rationale for this change
#49092

### What changes are included in this PR?

-  Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.

Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.

### Are these changes tested?

Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26

### Are there any user-facing changes?

Yes, the nightly ODBC file names will be changed as described above. 

* GitHub Issue: #49092

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49156: [Python] Require GIL for string comparison (#49161)

### Rationale for this change

With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.

### What changes are included in this PR?

Moving statement out of the `with nogil` context manager.

### Are these changes tested?

Existing CI builds pyarrow.

### Are there any user-facing changes?

No
* GitHub Issue: #49156

Authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)

### Rationale for this change
#48575

### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A

* GitHub Issue: #48575

Lead-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: justing-bq <[email protected]>
Co-authored-by: Victor Tsang <[email protected]>
Co-authored-by: Alina (Xi) Li <[email protected]>
Co-authored-by: vic-tsang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)

### Rationale for this change

Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.

### What changes are included in this PR?

Use the variable name directly (no `${}`).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.
* GitHub Issue: #49164

Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-48132: [Ruby] Add support for writing dictionary array (#49175)

### Rationale for this change

Delta dictionary message support is out of scope.

### What changes are included in this PR?

* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48132

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)

### Rationale for this change

Correct variant extension according to arrow's specification.

### What changes are included in this PR?

Modified variant's hardcoded extension name.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49081

Authored-by: Zehua Zou <[email protected]>
Signed-off-by: Gang Wu <[email protected]>

* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)

### Rationale for this change

This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.

### What changes are included in this PR?

This PR establishes infrastructure for type checking:

- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow

### Are these changes tested?

No. This is mostly a CI change.

### Are there any user-facing changes?

This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102

Lead-authored-by: Rok Mihevc <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>

* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)

### Rationale for this change
See #49190

### What changes are included in this PR?

Fix `unknown job 'odbc' error` caused by typo

### Are these changes tested?

Tested in CI

### Are there any user-facing changes?

N/A

* GitHub Issue: #49190

Authored-by: Alina (Xi) Li <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>

* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)

Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@​dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants