-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Description
In [13]: import dask
In [14]: dask.__version__
Out[14]: '0.8.0'
In [15]: pd.__version__
Out[15]: '0.18.0+7.g17a5982'
In [4]: import glob
In [5]: DataFrame(np.arange(10).reshape(-1,2)).to_csv('file1.csv')
In [6]: DataFrame(np.arange(10).reshape(-1,2)).to_csv('file2.csv')
In [7]: pd.concat([ pd.read_csv(f, index_col=0) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[7]:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
5 0 1
6 2 3
7 4 5
8 6 7
9 8 9
In [11]: import dask.dataframe as dd
In [12]: dd.read_csv('file*.csv', header=0, usecols=[1,2]).compute()
Out[12]:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
- the concat should be
ignore_index=True, otherwise you are implicitly setting the index
Side issue (maybe an example in the docs would help), about how to ignore an index.
-
I'd really like to do:
dd.read_csv('file*.csv', header=0, index_col=0).compute()
(see the above linked issue). This requires me to .set_index(...) after on a dummy column though
I know I have an index, but I
In [17]: dd.read_csv('file*.csv', header=0, names=['col',0,1]).compute()
Out[17]:
col 0 1
0 0 0 1
1 1 2 3
2 2 4 5
3 3 6 7
4 4 8 9
0 0 0 1
1 1 2 3
2 2 4 5
3 3 6 7
4 4 8 9
Then drop it or .reset_index