Unstable dask._Frame.map_partitions behavior

Consider the following script operating on 15 csv files each about 16MB in size (each csv is an array of floats that is 10 columns and 100,000 rows).  

``` python
import dask.dataframe as dd
df = dd.read_csv("test_data_small/*.csv")
print(df.npartitions)
print(df.mean().compute())
print(df.map_partitions(len,columns=df.columns).compute())
```

Running this script on my machine (OS X (10.11.1), Python 2.7.11 [Anaconda 2.3.0 (x86_64)], pandas 0.17.1) produces (in approximately the same frequencies and in random order) one of the following results (for each result I give first a description and then stdout)

**1. npartitions completes, mean completes, map_partitions seg faults** 

``` text
15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
Segmentation fault: 11
```

**2. npartitions completes, Fatal Python Error at mean**

``` text
15
Fatal Python error: GC object already tracked
Abort trap: 6
```

**3. npartitions completes, mean completes, Fatal Python Error at map_partitions**

``` text
15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
Fatal Python error: GC object already tracked
Abort trap: 6
```

**4.  npartitions completes, seg fault at mean**

``` text
15
Segmentation fault: 11
```

**5.  Everything completes**

``` text
15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
(100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001)
```

I reverted pandas back to 0.17.0 based on a suggestion from Matt [here] http://stackoverflow.com/questions/34128540/why-is-running-a-compute-in-dask-causing-fatal-python-error-gc-object-alrea  but that hasn't noticeably changed the results.

I'm happy to debug this in dask, but there is at least some chance that this could be a pandas problem.  Has anyone else seen this kind of behavior?  If so, were you able to find an answer?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unstable dask._Frame.map_partitions behavior #888

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unstable dask._Frame.map_partitions behavior #888

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions