-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Description
GZip files are not efficiently seekable
In [1]: import gzip
In [2]: f = gzip.open('trip_data_1_03.csv.gz')
In [3]: %time f.seek(0)
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 26.2 µs
Out[3]: 0
In [4]: %time f.seek(1000000)
CPU times: user 14.1 ms, sys: 0 ns, total: 14.1 ms
Wall time: 14.4 ms
Out[4]: 1000000
In [5]: %time f.seek(10000000)
CPU times: user 114 ms, sys: 4 µs, total: 114 ms
Wall time: 115 ms
Out[5]: 10000000
In [6]: %time f.seek(100000000)
CPU times: user 1.27 s, sys: 15.7 ms, total: 1.29 s
Wall time: 1.29 s
Out[6]: 100000000
In [7]: %time f.seek(0)
CPU times: user 212 µs, sys: 1 µs, total: 213 µs
Wall time: 155 µs
Out[7]: 0
In [8]: %time len(f.read(10000000))
CPU times: user 126 ms, sys: 28 ms, total: 154 ms
Wall time: 156 ms
Out[8]: 10000000In many cases we have functions that incorrectly assume this, like dask.dataframe.read_csv(..., compression='gzip'), which uses textblock in an embarrassingly parallel fashion.
dsk = {('df', 0): (pd.read_csv, (textblock, filename, start=0, end=1000)),
('df', 1): (pd.read_csv, (textblock, filename, start=1000, end=2000)),
('df', 2): (pd.read_csv, (textblock, filename, start=2000, end=3000)),
...In these cases we should either
- Use larger
chunkbytessettings so that we read single files at once - Make sequential tasks depend on each other, so that we stream through the file rather than access it randomly
A quick benchmark of a dask.dataframe.read_csv call shows that we're spending around 70% of our computation time seeking a collection of twelve 150MB files
Metadata
Metadata
Assignees
Labels
No labels