-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Versions:
- json2parquet 0.6.0 with the following Cargo packages:
- parquet = "29.0.0" (this is in the main branch but the file metadata states 23.0.0 for some reason)
- arrow = "29.0.0"
- arrow-schema = { version = "29.0.0", features = ["serde"] }
- PyArrow 10.0.1
- ClickHouse 22.13.1.1119
I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.
I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
| jq -c '.properties * {geom: .geometry|tostring}' \
> California.jsonl
$ head -n1 California.jsonl | jq .{
"release": 1,
"capture_dates_range": "",
"geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.
$ cat California.jsonl \
| clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
> cali.snappy.pqThe following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).
$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
target/release/json2parquet \
-c snappy \
California.jsonl \
California.snappy.pqThe above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.
In [1]: import pyarrow.parquet as pq
In [2]: pf = pq.ParquetFile('California.snappy.pq')
In [3]: pf.schema
Out[3]:
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
optional binary field_id=-1 capture_dates_range (String);
optional binary field_id=-1 geom (String);
optional int64 field_id=-1 release;
}
In [4]: pf.metadata
Out[4]:
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
created_by: parquet-rs version 29.0.0
num_columns: 3
num_rows: 11542912
num_row_groups: 12
format_version: 1.0
serialized_size: 7969The ClickHouse-produced PQ file has 306 row groups.
In [1]: pf = pq.ParquetFile('cali.snappy.pq')
In [2]: pf.schema
Out[2]:
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
optional int64 field_id=-1 release;
optional binary field_id=-1 capture_dates_range;
optional binary field_id=-1 geom;
}
In [3]: pf.metadata
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 11542912
num_row_groups: 306
format_version: 1.0
serialized_size: 228389I'm not sure if the row group sizes played into the performance delta.
Is there anything I can do to my compilation settings to speed up Parquet generation?
I checked with the author of json2parquet and he's certain there aren't issues within his code domoritz/json2parquet#116