Determine accurate index in read_csv with full data read

`dask.dataframe.read_csv` works by breaking up a text file across byte boundaries (say, every 100MB), seeking to nearby endlines, and then passing off that block of bytes to `pandas.read_csv`.  However, we only read the first bit of the first block at construction time in order to determine dtypes and column names; we don't read the rest of the blocks.  This results in a fast return to the user.

However, this also means that we have no ability to determine index values.  If there is an actual index (say, for a time series) then we can't leverage this.  If there is no index then we don't have the ability to accurately add an increasing index like `1, 2, 3, ...`.  In some cases these are really important.  

So at times it might make sense to suffer the additional pass through the data so that we can collect index values for `divisions`.  Presumably this could be achieved by using `read_csv` normally, and then calling an appropriate `map_partitions` function to determine `(partition.index.min(), partition.index.max())` or whatnot.

This has come up a few times, but most recently in #897


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Determine accurate index in read_csv with full data read #898

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Determine accurate index in read_csv with full data read #898

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions