read_csv refactor and test suite

A decent fraction of dask issues today are about `dask.dataframe.read_csv` failing on a new keyword argument.  Additionally, as `distributed` csv readers come online we're going to want to pull out logic from the current `read_csv` function.  It might be worth spending some quality time thinking about how best to make `read_csv` robust and how to make it more generalizable.
### Current state

Generally the current `dd.read_csv` algorithm works as follows:
1.  Use `pd.read_csv` and the provided keyword arguments to create a dataframe from the bit the top of the CSV file
2.  Use this head dataframe to produce a _new set of keyword arguments_ that will produce the same kind of dataframe when given a block of bytes from the center of the file (given clean endline boundaries).
3.  Create a graph that calls `pd.read_csv` on those blocks with those keyword arguments.
### Things to fix

There are some known bugs in existing issues.  

There is also the possibility of rethinking how we approach this.  Arguments that change the column names like `parse_dates` introduce a lot of headache into constructing the right kwargs.  I wonder if it might be cleaner to actually pull off the header line from the CSV file and stitch it onto each of the block reads before calling `pd.read_csv`.  This might require less error-prone logic on our part and be more robust generally.

We need to pull error-prone logic outside of the `dd.read_csv` function for reuse in other functions, like `distributed.hdfs.read_csv`.

Depending on how tricky our solution is we might want a more extensive test suite.  I hear that Pandas has a decent one for this.  Perhaps we can steal something.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv refactor and test suite #1022

Current state

Things to fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_csv refactor and test suite #1022

Description

Current state

Things to fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions