-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
dask.dataframe.read_csv works by breaking up a text file across byte boundaries (say, every 100MB), seeking to nearby endlines, and then passing off that block of bytes to pandas.read_csv. However, we only read the first bit of the first block at construction time in order to determine dtypes and column names; we don't read the rest of the blocks. This results in a fast return to the user.
However, this also means that we have no ability to determine index values. If there is an actual index (say, for a time series) then we can't leverage this. If there is no index then we don't have the ability to accurately add an increasing index like 1, 2, 3, .... In some cases these are really important.
So at times it might make sense to suffer the additional pass through the data so that we can collect index values for divisions. Presumably this could be achieved by using read_csv normally, and then calling an appropriate map_partitions function to determine (partition.index.min(), partition.index.max()) or whatnot.
This has come up a few times, but most recently in #897