-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
A decent fraction of dask issues today are about dask.dataframe.read_csv failing on a new keyword argument. Additionally, as distributed csv readers come online we're going to want to pull out logic from the current read_csv function. It might be worth spending some quality time thinking about how best to make read_csv robust and how to make it more generalizable.
Current state
Generally the current dd.read_csv algorithm works as follows:
- Use
pd.read_csvand the provided keyword arguments to create a dataframe from the bit the top of the CSV file - Use this head dataframe to produce a new set of keyword arguments that will produce the same kind of dataframe when given a block of bytes from the center of the file (given clean endline boundaries).
- Create a graph that calls
pd.read_csvon those blocks with those keyword arguments.
Things to fix
There are some known bugs in existing issues.
There is also the possibility of rethinking how we approach this. Arguments that change the column names like parse_dates introduce a lot of headache into constructing the right kwargs. I wonder if it might be cleaner to actually pull off the header line from the CSV file and stitch it onto each of the block reads before calling pd.read_csv. This might require less error-prone logic on our part and be more robust generally.
We need to pull error-prone logic outside of the dd.read_csv function for reuse in other functions, like distributed.hdfs.read_csv.
Depending on how tricky our solution is we might want a more extensive test suite. I hear that Pandas has a decent one for this. Perhaps we can steal something.