Skip to content

API: read_csv should default to ignore_index=True when no index_col #1047

@jreback

Description

@jreback

xref pandas-dev/pandas#12618

In [13]: import dask

In [14]: dask.__version__
Out[14]: '0.8.0'

In [15]: pd.__version__
Out[15]: '0.18.0+7.g17a5982'
In [4]: import glob

In [5]: DataFrame(np.arange(10).reshape(-1,2)).to_csv('file1.csv')

In [6]: DataFrame(np.arange(10).reshape(-1,2)).to_csv('file2.csv')

In [7]: pd.concat([ pd.read_csv(f, index_col=0) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[7]: 
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
5  0  1
6  2  3
7  4  5
8  6  7
9  8  9

In [11]: import dask.dataframe as dd

In [12]: dd.read_csv('file*.csv', header=0, usecols=[1,2]).compute()
Out[12]: 
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
  • the concat should be ignore_index=True, otherwise you are implicitly setting the index

Side issue (maybe an example in the docs would help), about how to ignore an index.

  • I'd really like to do:

    dd.read_csv('file*.csv', header=0, index_col=0).compute()

(see the above linked issue). This requires me to .set_index(...) after on a dummy column though

I know I have an index, but I

In [17]: dd.read_csv('file*.csv', header=0, names=['col',0,1]).compute()
Out[17]: 
   col  0  1
0    0  0  1
1    1  2  3
2    2  4  5
3    3  6  7
4    4  8  9
0    0  0  1
1    1  2  3
2    2  4  5
3    3  6  7
4    4  8  9

Then drop it or .reset_index

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions