-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Consider the following script operating on 15 csv files each about 16MB in size (each csv is an array of floats that is 10 columns and 100,000 rows).
import dask.dataframe as dd
df = dd.read_csv("test_data_small/*.csv")
print(df.npartitions)
print(df.mean().compute())
print(df.map_partitions(len,columns=df.columns).compute())Running this script on my machine (OS X (10.11.1), Python 2.7.11 [Anaconda 2.3.0 (x86_64)], pandas 0.17.1) produces (in approximately the same frequencies and in random order) one of the following results (for each result I give first a description and then stdout)
1. npartitions completes, mean completes, map_partitions seg faults
15
0 -0.000135
1 -0.000011
10 -0.000939
2 0.000771
3 0.000309
4 0.000573
5 0.000635
6 0.000770
7 0.000713
8 0.001237
9 -0.000440
Unnamed: 0 49999.500000
dtype: float64
Segmentation fault: 11
2. npartitions completes, Fatal Python Error at mean
15
Fatal Python error: GC object already tracked
Abort trap: 6
3. npartitions completes, mean completes, Fatal Python Error at map_partitions
15
0 -0.000135
1 -0.000011
10 -0.000939
2 0.000771
3 0.000309
4 0.000573
5 0.000635
6 0.000770
7 0.000713
8 0.001237
9 -0.000440
Unnamed: 0 49999.500000
dtype: float64
Fatal Python error: GC object already tracked
Abort trap: 6
4. npartitions completes, seg fault at mean
15
Segmentation fault: 11
5. Everything completes
15
0 -0.000135
1 -0.000011
10 -0.000939
2 0.000771
3 0.000309
4 0.000573
5 0.000635
6 0.000770
7 0.000713
8 0.001237
9 -0.000440
Unnamed: 0 49999.500000
dtype: float64
(100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001)
I reverted pandas back to 0.17.0 based on a suggestion from Matt [here] http://stackoverflow.com/questions/34128540/why-is-running-a-compute-in-dask-causing-fatal-python-error-gc-object-alrea but that hasn't noticeably changed the results.
I'm happy to debug this in dask, but there is at least some chance that this could be a pandas problem. Has anyone else seen this kind of behavior? If so, were you able to find an answer?