Blocked GZIP reading is slow

GZip files are not efficiently seekable

``` python
In [1]: import gzip

In [2]: f = gzip.open('trip_data_1_03.csv.gz')

In [3]: %time f.seek(0)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 26.2 µs
Out[3]: 0

In [4]: %time f.seek(1000000)
CPU times: user 14.1 ms, sys: 0 ns, total: 14.1 ms
Wall time: 14.4 ms
Out[4]: 1000000

In [5]: %time f.seek(10000000)
CPU times: user 114 ms, sys: 4 µs, total: 114 ms
Wall time: 115 ms
Out[5]: 10000000

In [6]: %time f.seek(100000000)
CPU times: user 1.27 s, sys: 15.7 ms, total: 1.29 s
Wall time: 1.29 s
Out[6]: 100000000

In [7]: %time f.seek(0)
CPU times: user 212 µs, sys: 1 µs, total: 213 µs
Wall time: 155 µs
Out[7]: 0

In [8]: %time len(f.read(10000000))
CPU times: user 126 ms, sys: 28 ms, total: 154 ms
Wall time: 156 ms
Out[8]: 10000000
```

In many cases we have functions that incorrectly assume this, like `dask.dataframe.read_csv(..., compression='gzip')`, which uses `textblock` in an embarrassingly parallel fashion.

``` python
dsk = {('df', 0): (pd.read_csv, (textblock, filename, start=0, end=1000)),
       ('df', 1): (pd.read_csv, (textblock, filename, start=1000, end=2000)),
       ('df', 2): (pd.read_csv, (textblock, filename, start=2000, end=3000)),
       ...
```

In these cases we should either 
1.  Use larger `chunkbytes` settings so that we read single files at once
2.  Make sequential tasks depend on each other, so that we stream through the file rather than access it randomly

A quick benchmark of a dask.dataframe.read_csv call shows that we're spending around 70% of our computation time seeking a collection of twelve 150MB files


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Blocked GZIP reading is slow #901

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Blocked GZIP reading is slow #901

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions