exporting symbol columns to parquet can be incompatible with strict parquet readers

### To reproduce

We export readable columns that (iirc) miss some padding in the Dictionary encoding/RLE. Strict parquet readers do not accept this.

https://github.com/jorgecarleitao/parquet2/issues/227
https://github.com/apache/arrow/issues/47981

There are a couple of errors you can get (paraphrased):
- more than one dictionary
- invalid indices: 0

Worked okay in Arrow 21 and 23, 22 is the specific reproducing version. Also breaks on Trino.

Repro:

```python
#!/bin/bash
set -e

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
VENV_DIR="$SCRIPT_DIR/pyarrow_bug_venv"
PARQUET_FILE="$SCRIPT_DIR/symbol_test_export.parquet"

# Ensure venv exists with pyarrow 22
if [ ! -d "$VENV_DIR" ]; then
    echo "Creating virtual environment..."
    python3 -m venv "$VENV_DIR"
    source "$VENV_DIR/bin/activate"
    pip install --quiet pyarrow==22.0.0
else
    source "$VENV_DIR/bin/activate"
fi

echo "=== Step 1: Create table with SYMBOL columns in QuestDB ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null

curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
CREATE TABLE symbol_bug_test (
    exchange SYMBOL,
    ticker SYMBOL,
    side SYMBOL,
    price DOUBLE,
    quantity DOUBLE,
    ts TIMESTAMP
) TIMESTAMP(ts) PARTITION BY DAY
" > /dev/null

echo "Table created."

echo "=== Step 2: Insert 10M rows with symbols ==="
# Use QuestDB's random functions to generate data efficiently
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
INSERT INTO symbol_bug_test
SELECT
    rnd_symbol('NYSE','NASDAQ','LSE','TSE','HKEX') as exchange,
    rnd_symbol('AAPL','GOOGL','MSFT','AMZN','META','NVDA','TSLA','JPM','BAC','WMT') as ticker,
    rnd_symbol('BUY','SELL') as side,
    rnd_double() * 1000 as price,
    rnd_double() * 10000 as quantity,
    timestamp_sequence('2024-01-01', 1000000) as ts
FROM long_sequence(10000000)
"

echo "Data inserted."

echo "=== Step 3: Export to Parquet ==="
rm -f "$PARQUET_FILE"
curl -s -G "http://localhost:9000/exp?fmt=parquet" \
    --data-urlencode "query=SELECT * FROM symbol_bug_test" \
    -o "$PARQUET_FILE"

echo "Exported to $PARQUET_FILE ($(du -h "$PARQUET_FILE" | cut -f1))"

echo "=== Step 4: Test reading with PyArrow 22 ==="
python3 << 'PYTHON_SCRIPT'
import pyarrow as pa
import pyarrow.parquet as pq
import sys

parquet_file = "symbol_test_export.parquet"
print(f"PyArrow version: {pa.__version__}")
print(f"\nReading: {parquet_file}")

# Inspect schema
print("\n--- Schema ---")
schema = pq.read_schema(parquet_file)
print(schema)

print("\n--- Attempting to read full table ---")
try:
    table = pq.read_table(parquet_file)
    print(f"SUCCESS: Read {table.num_rows} rows, {table.num_columns} columns")
except Exception as e:
    print(f"FAILED: {type(e).__name__}: {e}")

print("\n--- Testing individual columns ---")
for field in schema:
    try:
        table = pq.read_table(parquet_file, columns=[field.name])
        print(f"  ✓ {field.name} ({field.type})")
    except Exception as e:
        print(f"  ✗ {field.name} ({field.type}): {type(e).__name__}: {e}")
PYTHON_SCRIPT

echo ""
echo "=== Cleanup ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null
echo "Dropped test table."

```


### QuestDB version:

Master

### OS, in case of Docker specify Docker and the Host OS:

N/A

### File System, in case of Docker specify Host File System:

N/A

### Full Name:

Nick Woolmer

### Affiliation:

QuestDB

### Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

- [x] Yes, I have

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exporting symbol columns to parquet can be incompatible with strict parquet readers #6692

To reproduce

QuestDB version:

OS, in case of Docker specify Docker and the Host OS:

File System, in case of Docker specify Host File System:

Full Name:

Affiliation:

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

exporting symbol columns to parquet can be incompatible with strict parquet readers #6692

Description

To reproduce

QuestDB version:

OS, in case of Docker specify Docker and the Host OS:

File System, in case of Docker specify Host File System:

Full Name:

Affiliation:

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions