Skip to content

read_csv refactor and test suite #1022

@mrocklin

Description

@mrocklin

A decent fraction of dask issues today are about dask.dataframe.read_csv failing on a new keyword argument. Additionally, as distributed csv readers come online we're going to want to pull out logic from the current read_csv function. It might be worth spending some quality time thinking about how best to make read_csv robust and how to make it more generalizable.

Current state

Generally the current dd.read_csv algorithm works as follows:

  1. Use pd.read_csv and the provided keyword arguments to create a dataframe from the bit the top of the CSV file
  2. Use this head dataframe to produce a new set of keyword arguments that will produce the same kind of dataframe when given a block of bytes from the center of the file (given clean endline boundaries).
  3. Create a graph that calls pd.read_csv on those blocks with those keyword arguments.

Things to fix

There are some known bugs in existing issues.

There is also the possibility of rethinking how we approach this. Arguments that change the column names like parse_dates introduce a lot of headache into constructing the right kwargs. I wonder if it might be cleaner to actually pull off the header line from the CSV file and stitch it onto each of the block reads before calling pd.read_csv. This might require less error-prone logic on our part and be more robust generally.

We need to pull error-prone logic outside of the dd.read_csv function for reuse in other functions, like distributed.hdfs.read_csv.

Depending on how tricky our solution is we might want a more extensive test suite. I hear that Pandas has a decent one for this. Perhaps we can steal something.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions