-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
CompatibilityCompatibility with third-party tools and servicesCompatibility with third-party tools and servicesCoreRelated to storage, data type, etc.Related to storage, data type, etc.
Description
To reproduce
We export readable columns that (iirc) miss some padding in the Dictionary encoding/RLE. Strict parquet readers do not accept this.
jorgecarleitao/parquet2#227
apache/arrow#47981
There are a couple of errors you can get (paraphrased):
- more than one dictionary
- invalid indices: 0
Worked okay in Arrow 21 and 23, 22 is the specific reproducing version. Also breaks on Trino.
Repro:
#!/bin/bash
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
VENV_DIR="$SCRIPT_DIR/pyarrow_bug_venv"
PARQUET_FILE="$SCRIPT_DIR/symbol_test_export.parquet"
# Ensure venv exists with pyarrow 22
if [ ! -d "$VENV_DIR" ]; then
echo "Creating virtual environment..."
python3 -m venv "$VENV_DIR"
source "$VENV_DIR/bin/activate"
pip install --quiet pyarrow==22.0.0
else
source "$VENV_DIR/bin/activate"
fi
echo "=== Step 1: Create table with SYMBOL columns in QuestDB ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
CREATE TABLE symbol_bug_test (
exchange SYMBOL,
ticker SYMBOL,
side SYMBOL,
price DOUBLE,
quantity DOUBLE,
ts TIMESTAMP
) TIMESTAMP(ts) PARTITION BY DAY
" > /dev/null
echo "Table created."
echo "=== Step 2: Insert 10M rows with symbols ==="
# Use QuestDB's random functions to generate data efficiently
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=
INSERT INTO symbol_bug_test
SELECT
rnd_symbol('NYSE','NASDAQ','LSE','TSE','HKEX') as exchange,
rnd_symbol('AAPL','GOOGL','MSFT','AMZN','META','NVDA','TSLA','JPM','BAC','WMT') as ticker,
rnd_symbol('BUY','SELL') as side,
rnd_double() * 1000 as price,
rnd_double() * 10000 as quantity,
timestamp_sequence('2024-01-01', 1000000) as ts
FROM long_sequence(10000000)
"
echo "Data inserted."
echo "=== Step 3: Export to Parquet ==="
rm -f "$PARQUET_FILE"
curl -s -G "http://localhost:9000/exp?fmt=parquet" \
--data-urlencode "query=SELECT * FROM symbol_bug_test" \
-o "$PARQUET_FILE"
echo "Exported to $PARQUET_FILE ($(du -h "$PARQUET_FILE" | cut -f1))"
echo "=== Step 4: Test reading with PyArrow 22 ==="
python3 << 'PYTHON_SCRIPT'
import pyarrow as pa
import pyarrow.parquet as pq
import sys
parquet_file = "symbol_test_export.parquet"
print(f"PyArrow version: {pa.__version__}")
print(f"\nReading: {parquet_file}")
# Inspect schema
print("\n--- Schema ---")
schema = pq.read_schema(parquet_file)
print(schema)
print("\n--- Attempting to read full table ---")
try:
table = pq.read_table(parquet_file)
print(f"SUCCESS: Read {table.num_rows} rows, {table.num_columns} columns")
except Exception as e:
print(f"FAILED: {type(e).__name__}: {e}")
print("\n--- Testing individual columns ---")
for field in schema:
try:
table = pq.read_table(parquet_file, columns=[field.name])
print(f" ✓ {field.name} ({field.type})")
except Exception as e:
print(f" ✗ {field.name} ({field.type}): {type(e).__name__}: {e}")
PYTHON_SCRIPT
echo ""
echo "=== Cleanup ==="
curl -s -G "http://localhost:9000/exec" --data-urlencode "query=DROP TABLE IF EXISTS symbol_bug_test" > /dev/null
echo "Dropped test table."QuestDB version:
Master
OS, in case of Docker specify Docker and the Host OS:
N/A
File System, in case of Docker specify Host File System:
N/A
Full Name:
Nick Woolmer
Affiliation:
QuestDB
Have you followed Linux, MacOs kernel configuration steps to increase Maximum open files and Maximum virtual memory areas limit?
- Yes, I have
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
CompatibilityCompatibility with third-party tools and servicesCompatibility with third-party tools and servicesCoreRelated to storage, data type, etc.Related to storage, data type, etc.