Skip to content

Blocked GZIP reading is slow #901

@mrocklin

Description

@mrocklin

GZip files are not efficiently seekable

In [1]: import gzip

In [2]: f = gzip.open('trip_data_1_03.csv.gz')

In [3]: %time f.seek(0)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 26.2 µs
Out[3]: 0

In [4]: %time f.seek(1000000)
CPU times: user 14.1 ms, sys: 0 ns, total: 14.1 ms
Wall time: 14.4 ms
Out[4]: 1000000

In [5]: %time f.seek(10000000)
CPU times: user 114 ms, sys: 4 µs, total: 114 ms
Wall time: 115 ms
Out[5]: 10000000

In [6]: %time f.seek(100000000)
CPU times: user 1.27 s, sys: 15.7 ms, total: 1.29 s
Wall time: 1.29 s
Out[6]: 100000000

In [7]: %time f.seek(0)
CPU times: user 212 µs, sys: 1 µs, total: 213 µs
Wall time: 155 µs
Out[7]: 0

In [8]: %time len(f.read(10000000))
CPU times: user 126 ms, sys: 28 ms, total: 154 ms
Wall time: 156 ms
Out[8]: 10000000

In many cases we have functions that incorrectly assume this, like dask.dataframe.read_csv(..., compression='gzip'), which uses textblock in an embarrassingly parallel fashion.

dsk = {('df', 0): (pd.read_csv, (textblock, filename, start=0, end=1000)),
       ('df', 1): (pd.read_csv, (textblock, filename, start=1000, end=2000)),
       ('df', 2): (pd.read_csv, (textblock, filename, start=2000, end=3000)),
       ...

In these cases we should either

  1. Use larger chunkbytes settings so that we read single files at once
  2. Make sequential tasks depend on each other, so that we stream through the file rather than access it randomly

A quick benchmark of a dask.dataframe.read_csv call shows that we're spending around 70% of our computation time seeking a collection of twelve 150MB files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions