Skip to content

Determine accurate index in read_csv with full data read #898

@mrocklin

Description

@mrocklin

dask.dataframe.read_csv works by breaking up a text file across byte boundaries (say, every 100MB), seeking to nearby endlines, and then passing off that block of bytes to pandas.read_csv. However, we only read the first bit of the first block at construction time in order to determine dtypes and column names; we don't read the rest of the blocks. This results in a fast return to the user.

However, this also means that we have no ability to determine index values. If there is an actual index (say, for a time series) then we can't leverage this. If there is no index then we don't have the ability to accurately add an increasing index like 1, 2, 3, .... In some cases these are really important.

So at times it might make sense to suffer the additional pass through the data so that we can collect index values for divisions. Presumably this could be achieved by using read_csv normally, and then calling an appropriate map_partitions function to determine (partition.index.min(), partition.index.max()) or whatnot.

This has come up a few times, but most recently in #897

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions