Skip to content

Python: pyarrow.lib.ArrowInvalid: Expected last offset >= 0 #3611

@mwinters0

Description

@mwinters0

What happened?

When selecting a specific 30 second time range from my database, I get this pyarrow error from cursor.fetch_arrow_table(). However, if I bisect this time range and query the first 15 seconds and the latter 15 seconds separately, they each work.

Additionally, if I export the full 30 second range to CSV with psql I can import it with pyarrow.csv.read_csv() without issue.

What I've tried:

  • Disabling USE_COPY had no effect.
  • I can query for this problematic range with psycopg and psql without issue, but not ADBC.
  • If I export this time range and then re-import it into a new table containing only the exported data, ADBC can query it without issue.
  • I've examined the time range export in duckdb and can't find anything unusual. The rows have many near-duplicates because they are the result of several LEFT JOINs, which is intended.

Zstd was able to compress the 3.0Gb psql exported time ranges down to 1.3Mb (!!) so I've attached them here. I had to add gzip because Github doesn't accept .zst files.

gunzip bad.csv.zst.gz && zstd -d bad.csv.zst

bad.csv.zst.gz

bad.text.zst.gz

bad.bin.zst.gz

Stack Trace

  [...]
  File "/mnt/ssd/fedora/nomaste/nomaste/workflow/db.py", line 272, in _generate_normalized_time_chunks
    raw_chunk_table = fetch_raw_time_chunk(
        params.conn, chunk_start_date, chunk_end_date, params.topic
    )
  File "/mnt/ssd/fedora/nomaste/nomaste/workflow/db.py", line 248, in fetch_raw_time_chunk
    t = cur.fetch_arrow_table()
  File "/mnt/ssd/fedora/nomaste/.venv/lib/python3.13/site-packages/adbc_driver_manager/dbapi.py", line 1179, in fetch_arrow_table
    return self._results.fetch_arrow_table()
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/mnt/ssd/fedora/nomaste/.venv/lib/python3.13/site-packages/adbc_driver_manager/dbapi.py", line 1346, in fetch_arrow_table
    return _blocking_call(self.reader.read_all, (), {}, self._stmt.cancel)
  File "adbc_driver_manager/_lib.pyx", line 1749, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_lib.pyx", line 1742, in adbc_driver_manager._lib._blocking_call_impl
  File "adbc_driver_manager/_reader.pyx", line 91, in adbc_driver_manager._reader.AdbcRecordBatchReader.read_all
  File "adbc_driver_manager/_reader.pyx", line 43, in adbc_driver_manager._reader._AdbcErrorHelper.check_error
  File "adbc_driver_manager/_reader.pyx", line 89, in adbc_driver_manager._reader.AdbcRecordBatchReader.read_all
  File "pyarrow/ipc.pxi", line 794, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected last offset >= 0 but found -1976420846

How can we reproduce the bug?

I'm not sure. I'm open to any ideas for how to troubleshoot this / make it reproducible. I've already run a full VACUUM ANALYZE. I'm not a conda / miniforge user (I'm not even a Python dev by trade!), otherwise I would have already tried a local build. I may try it yet if I can get my head around those ecosystems.

Environment/Setup

  • Python 3.13.7 on x86_64. I've tried both the uv managed version and the system versions from Fedora and Arch.
  • Postgres is running docker tag timescale/timescaledb-ha:pg13.22-ts2.15.3-oss from this Dockerfile
% uv tree
Resolved 14 packages in 0.50ms
bus2parq v0.1.0
├── adbc-driver-postgresql v1.8.0
│   ├── adbc-driver-manager v1.8.0
│   │   └── typing-extensions v4.15.0
│   └── importlib-resources v6.5.2
├── backports-zstd v0.5.0
├── click v8.3.0
├── pyarrow v21.0.0
└── pytest v8.4.2 (group: dev)
    ├── iniconfig v2.1.0
    ├── packaging v25.0
    ├── pluggy v1.6.0
    └── pygments v2.19.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions