Skip to content

Improve Performance of JSON Reader #3441

@marklit

Description

@marklit

Versions:

  • json2parquet 0.6.0 with the following Cargo packages:
    • parquet = "29.0.0" (this is in the main branch but the file metadata states 23.0.0 for some reason)
    • arrow = "29.0.0"
    • arrow-schema = { version = "29.0.0", features = ["serde"] }
  • PyArrow 10.0.1
  • ClickHouse 22.13.1.1119

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up Parquet generation?

I checked with the author of json2parquet and he's certain there aren't issues within his code domoritz/json2parquet#116

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions