Refactor data ingest and read_csv #1116

mrocklin · 2016-04-23T20:16:38Z

We refactor how data gets ingested from different sources.

Locations

Systems like local, s3, and hdfs (will be in distributed repo) implement functions read_bytes, open_files and, if available, open_text_files.

>>> sample, values = read_bytes('mydata.*.csv', delimiter=b'\n', blocksize=1000)
>>> sample
b'name,amount,id\nAlice,100,1\n...'
>>> values
[[Delayed('mydata.1.csv-0-1000'), Delayed('mydata-1.csv-1000-2000')],
 [Delayed('mydata.1.csv-0-1000')],
 ...
]

>>> files = open_files('mydata.*.csv')
>>> files = compute(*files)
>>> next(files[0])
b'name,amount,id\nAlice,100,1\n...'

>>> files = open_text_files('mydata.*.csv', encoding='utf-8')
>>> files = compute(*files)
>>> next(files[0])
'name,amount,id\nAlice,100,1\n...'

There is a bytes.core.{read_bytes,open_files,open_text_files} function that intelligently dispatches to the correct location based on protocol. dd.read_csv('s3://bucket/myfiles.*.csv') correctly dispatches to bytes.s3.read_bytes.

Compression

We have a variety of compression formats filled out and rely heavily on File-like objects found in those libraries. If such a file-like object is registered then all of the read_bytes and open_files functions support that compression automatically. If that file object is seekable then we can use blocksize= options to split data. Again, this all happens automatically. There is no need for downstream code to care about it

Formats

We've hooked this in to back both dd.read_csv and db.read_text with good success. They get extra locations and compression for free.

cc @jcrist cc @wabu I'd love review on this before I move forward much more. @wabu in particular if you can tear holes in this I'd love to know sooner rather than later.

mrocklin · 2016-04-23T23:25:54Z

I've added compression through File objects

mrocklin · 2016-04-23T23:46:00Z

dask/dataframe/tests/test_csv.py

+    files2 = valmap(compress, files)
+    with filetexts(files2, mode='b'):
+        df = read_csv('2014-01-*.csv', compression=fmt)
+        eq(df, expected, check_dtype=False)


@wabu can you try this implementation against your dataset and compare performance? I would actually be pretty interested in the result of profiling. I might do this in the following way.

import dask from dask.dataframe.csv import read_csv df = read_csv(path, compression='xz')

I would do this in the IPython console and then use the %prun magic

%prun df.some_column.sum().compute(get=dask.get) # single threaded scheduler

mrocklin · 2016-04-24T01:29:31Z

It looks like backports.lzma.LZMAFile doesn't support seek.

mrocklin · 2016-04-24T01:58:54Z

@martindurant this could use your review if you have time sometime next week.

This is an attempt to refactor all of the location/compression/format work we prototyped in distributed.

mrocklin · 2016-04-24T22:19:31Z

I've now updated this to support bag.read_text as well as dataframe.read_csv.

mrocklin · 2016-04-25T15:20:06Z

Playing with a small xz file

In [1]: from dask.bag.text import read_text

In [2]: %time read_text('trip_data_1_00.csv.xz').count().compute()  # single core streams through file sequentialy
CPU times: user 25.7 ms, sys: 17.2 ms, total: 42.9 ms
Wall time: 2.47 s
Out[2]: 1000000

In [3]: %time read_text('trip_data_1_00.csv.xz', blocksize=1000000).count().compute()  # multiple cores take random access chunks off one at a time
CPU times: user 61.9 ms, sys: 25.9 ms, total: 87.8 ms
Wall time: 4.15 s
Out[3]: 126523

In [4]: %time read_text('trip_data_1_00.csv.xz', blocksize=10000000).count().compute()  # multiple cores take random access chunks off one at a time
CPU times: user 21 ms, sys: 13.1 ms, total: 34 ms
Wall time: 1.41 s
Out[4]: 180746

In [5]: !du trip_data_1_00.csv.xz
20328   trip_data_1_00.csv.xz

In [6]: !du trip_data_1_00.csv.xz -hs
20M trip_data_1_00.csv.xz

Presumably this would work better for a smaller xz blocksize to dask.bag blocksize ratio

mrocklin · 2016-04-25T15:27:27Z

Hrm, noting now that I'm getting different outputs, which is concerning.

mrocklin · 2016-04-25T16:23:25Z

Ah, I was handling getsize incorrectly. Things are accurate now, but profiling shows that we're much slower on xz files when using chunks than we should be if the random access guarantee is accurate.

mrocklin · 2016-04-25T16:41:54Z

@wabu do you have a dataset I can play with or a nice way to produce a similar dataset?

Also, are you confident that the python lzma module supports random access? Do you know of ways to produce xz files with small blocks using the command line?

mrocklin · 2016-04-25T16:44:38Z

@updiversity regarding #1115 can you try the solution in this branch?

$ pip install git+https://github.com/mrocklin/dask.bit@bytes --upgrade

import dask.bag as db
from dask.bag.text import read_text
b = read_text('s3://bucket/keys.*.txt', **s3_params)
b.take(0)

jcrist · 2016-04-25T18:30:08Z

Also, are you confident that the python lzma module supports random access?

For this I think you need the lzmaffi module (as seen in this comment: #1096 (comment))

mrocklin · 2016-04-25T23:01:23Z

Successfully replaced both db.from_filenames and dd.read_csv

mrocklin · 2016-04-26T00:07:18Z

Getting some odd errors in S3 tests like this: https://travis-ci.org/dask/dask/jobs/125713226.

Some possible causes:

Too many open network files (lots of warnings) I wonder if there is anything we can do in s3fs to manage these better and retry/reopen connections if closed
Moto not behaving correctly?

CCing @martindurant

mrocklin · 2016-04-26T00:34:08Z

Yeah, something fishy is definitely going on.

(Pdb) pp compute(values[0][0])
('{"amount": 100, "name": "Alice"}\n{"amount": 200, "name": "Bob"}\n{"amount": 300, "name": "Charlie"}\n{"amount": 400, "name": "Dennis"}\n',)
(Pdb) pp compute(values[1][0])
('{"amount": 500, "name": "Alice"}\n{"amount": 600, "name": "Bob"}\n{"amount": 700, "name": "Charlie"}\n{"amount": 800, "name": "Dennis"}\n',)
(Pdb) pp compute(values[0][0], values[1][0])
('{"amount": 500, "name": "Alice"}\n{"amount": 600, "name": "Bob"}\n{"amount": 700, "name": "Charlie"}\n{"amount": 800, "name": "Dennis"}\n',
 '{"amount": 500, "name": "Alice"}\n{"amount": 600, "name": "Bob"}\n{"amount": 700, "name": "Charlie"}\n{"amount": 800, "name": "Dennis"}\n')

(Pdb) pp values[0][0].dask
{'read_block_from_s3-dee321a7-acb2-4fbb-a142-03cf788ab0a8': (<function read_block_from_s3 at 0x7f6cfd5c6c08>,
                                                             'test/test/accounts.1.json',
                                                             0,
                                                             134217728,
                                                             (<type 'dict'>,
                                                              []),
                                                             None,
                                                             None)}
(Pdb) pp values[1][0].dask
{'read_block_from_s3-4aca7fe2-75d1-432a-ad82-be2eed056a19': (<function read_block_from_s3 at 0x7f6cfd5c6c08>,
                                                             'test/test/accounts.2.json',
                                                             0,
                                                             134217728,
                                                             (<type 'dict'>,
                                                              []),
                                                             None,
                                                             None)}

mrocklin · 2016-04-26T00:59:21Z

Turns out moto isn't threadsafe. getmoto/moto#313

Switching to single thread. All is well.

Removing the WIP label

mrocklin · 2016-04-26T01:05:13Z

For this I think you need the lzmaffi module (as seen in this comment: #1096 (comment))

Using the xz --list command line utility I also learned that my file just had a single block.

martindurant · 2016-04-27T14:56:54Z

dask/bag/text.py

+
+def read_text(path, blocksize=None, compression='infer',
+              encoding=system_encoding, errors='strict',
+              linedelimiter=os.linesep, collection=True, **kwargs):


Isn't '\n' more likely to be the line delimiter for any non-local file?

That's a good point. Maybe we keep this as None and let the bytes backends use their own defaults.

mrocklin · 2016-04-28T18:43:38Z

Alright, merging.

mrocklin force-pushed the bytes branch from b06a9c2 to 33263bc Compare April 23, 2016 23:27

mrocklin reviewed Apr 23, 2016
View reviewed changes

mrocklin force-pushed the bytes branch from 2bed39e to 6f17259 Compare April 24, 2016 01:35

mrocklin force-pushed the bytes branch 3 times, most recently from 748079f to 985d6d2 Compare April 24, 2016 17:25

mrocklin force-pushed the bytes branch from b8e7605 to 799e964 Compare April 25, 2016 17:50

mrocklin changed the title ~~[WIP] Refactor data loading to manage blocks of bytes~~ Refactor data loading to manage blocks of bytes Apr 26, 2016

mrocklin changed the title ~~Refactor data loading to manage blocks of bytes~~ Refactor data ingest and read_csv Apr 26, 2016

mrocklin mentioned this pull request Apr 26, 2016

ENH BlockIOReader instead of textblock for read_csv #1108

Closed

mrocklin force-pushed the bytes branch from e84c596 to 1957894 Compare April 26, 2016 18:55

This was referenced Apr 26, 2016

file_size fails on files larger than 4 GB with gzip compression #576

Closed

Creating several partitions on a single file #287

Closed

Blocked GZIP reading is slow #901

Closed

martindurant reviewed Apr 27, 2016
View reviewed changes

mrocklin added 21 commits April 28, 2016 11:17

add local getsize

c1ffad6

support single files in csv

4b7c18b

add bytes module to setup.py

9df3ba1

use moto in bytes/s3 tests

992a9b0

refactor encoding tests for read_text

fced5c6

add s3 open_text_files test

ab05d01

add collection= to read_text

9cc3df9

redirect bag functions to use new text module

bde6042

use modification time in bytes/local.py tokenize

6e18514

tweak read_csv to satisfy old test suite

4ed1bf4

replace old read_csv function

7707d98

fix from_s3 test

d7f4ded

parametrize s3 tests

4d28405

Replace db.from_filenames with db.read_text

af74d71

clean up compression in bytes/s3

84fc9b3

docstrings

1dc0acf

Add deterministic naming to s3 read_bytes/open_files

c945dd1

Add deterministic names for local open_files

b1e58d3

py26/7 compat

d9ae19a

support lineterminator in read_csv, don't pass to pd

f604ce8

check read_csv dtypes at runtime

2a8a8cf

mrocklin force-pushed the bytes branch from 78b6d1c to 2950e4a Compare April 28, 2016 18:17

Respond to comments on ingest

2950e4a

mrocklin merged commit ec6d1e0 into dask:master Apr 28, 2016

mrocklin deleted the bytes branch April 28, 2016 18:43

wabu mentioned this pull request Apr 29, 2016

CParserError in read_csv after #1116 #1132

Closed

This was referenced May 2, 2016

read_csv refactor and test suite #1022

Closed

Rewrite data ingest to use recent dask work dask/distributed#253

Merged

sinhrks added this to the 0.9.0 milestone May 11, 2016

Uh oh!

Refactor data ingest and read_csv #1116

Refactor data ingest and read_csv #1116

Uh oh!

Conversation

mrocklin commented Apr 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Locations

Compression

Formats

Uh oh!

mrocklin commented Apr 23, 2016

Uh oh!

mrocklin Apr 23, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 24, 2016

Uh oh!

mrocklin commented Apr 24, 2016

Uh oh!

mrocklin commented Apr 24, 2016

Uh oh!

mrocklin commented Apr 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 25, 2016

Uh oh!

mrocklin commented Apr 25, 2016

Uh oh!

mrocklin commented Apr 25, 2016

Uh oh!

mrocklin commented Apr 25, 2016

Uh oh!

jcrist commented Apr 25, 2016

Uh oh!

mrocklin commented Apr 25, 2016

Uh oh!

mrocklin commented Apr 26, 2016

Uh oh!

mrocklin commented Apr 26, 2016

Uh oh!

mrocklin commented Apr 26, 2016

Uh oh!

mrocklin commented Apr 26, 2016

Uh oh!

martindurant Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrocklin commented Apr 23, 2016 •

edited

Loading

mrocklin commented Apr 25, 2016 •

edited

Loading