Skip to content

exporting symbol columns to parquet can be incompatible with strict parquet readers #6692

@nwoolmer

Description

@nwoolmer

To reproduce

We export readable columns that (iirc) miss some padding in the Dictionary encoding/RLE. Strict parquet readers do not accept this.

jorgecarleitao/parquet2#227
apache/arrow#47981

There are a couple of errors you can get (paraphrased):

  • more than one dictionary
  • invalid indices: 0

Worked okay in Arrow 21 and 23, 22 is the specific reproducing version. Also breaks on Trino.

Repro:

#!/bin/bash
set -e

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
VENV_DIR="$SCRIPT_DIR/pyarrow_bug_venv"
PARQUET_FILE="$SCRIPT_DIR/symbol_test_export.parquet"

# Ensure venv exists with pyarrow 22
if [ ! -d "$VENV_DIR" ]; then
    echo "Creating virtual environment..."
    python3 -m venv "$VENV_DIR"
    source "$VENV_DIR/bin/activate"
    pip install --quiet pyarrow==22.0.0
else
    source "$VENV_DIR/bin/activate"
fi

echo "=== Step 1: Create table with SYMBOL columns in QuestDB ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null

curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
CREATE TABLE symbol_bug_test (
    exchange SYMBOL,
    ticker SYMBOL,
    side SYMBOL,
    price DOUBLE,
    quantity DOUBLE,
    ts TIMESTAMP
) TIMESTAMP(ts) PARTITION BY DAY
" > /dev/null

echo "Table created."

echo "=== Step 2: Insert 10M rows with symbols ==="
# Use QuestDB's random functions to generate data efficiently
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
INSERT INTO symbol_bug_test
SELECT
    rnd_symbol('NYSE','NASDAQ','LSE','TSE','HKEX') as exchange,
    rnd_symbol('AAPL','GOOGL','MSFT','AMZN','META','NVDA','TSLA','JPM','BAC','WMT') as ticker,
    rnd_symbol('BUY','SELL') as side,
    rnd_double() * 1000 as price,
    rnd_double() * 10000 as quantity,
    timestamp_sequence('2024-01-01', 1000000) as ts
FROM long_sequence(10000000)
"

echo "Data inserted."

echo "=== Step 3: Export to Parquet ==="
rm -f "$PARQUET_FILE"
curl -s -G "http://localhost:9000/exp?fmt=parquet" \
    --data-urlencode "query=SELECT * FROM symbol_bug_test" \
    -o "$PARQUET_FILE"

echo "Exported to $PARQUET_FILE ($(du -h "$PARQUET_FILE" | cut -f1))"

echo "=== Step 4: Test reading with PyArrow 22 ==="
python3 << 'PYTHON_SCRIPT'
import pyarrow as pa
import pyarrow.parquet as pq
import sys

parquet_file = "symbol_test_export.parquet"
print(f"PyArrow version: {pa.__version__}")
print(f"\nReading: {parquet_file}")

# Inspect schema
print("\n--- Schema ---")
schema = pq.read_schema(parquet_file)
print(schema)

print("\n--- Attempting to read full table ---")
try:
    table = pq.read_table(parquet_file)
    print(f"SUCCESS: Read {table.num_rows} rows, {table.num_columns} columns")
except Exception as e:
    print(f"FAILED: {type(e).__name__}: {e}")

print("\n--- Testing individual columns ---")
for field in schema:
    try:
        table = pq.read_table(parquet_file, columns=[field.name])
        print(f"  ✓ {field.name} ({field.type})")
    except Exception as e:
        print(f"  ✗ {field.name} ({field.type}): {type(e).__name__}: {e}")
PYTHON_SCRIPT

echo ""
echo "=== Cleanup ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null
echo "Dropped test table."

QuestDB version:

Master

OS, in case of Docker specify Docker and the Host OS:

N/A

File System, in case of Docker specify Host File System:

N/A

Full Name:

Nick Woolmer

Affiliation:

QuestDB

Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?

  • Yes, I have

Additional context

No response

Metadata

Metadata

Assignees

Labels

CompatibilityCompatibility with third-party tools and servicesCoreRelated to storage, data type, etc.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions