Skip to content

Unstable dask._Frame.map_partitions behavior #888

@sscondie

Description

@sscondie

Consider the following script operating on 15 csv files each about 16MB in size (each csv is an array of floats that is 10 columns and 100,000 rows).

import dask.dataframe as dd
df = dd.read_csv("test_data_small/*.csv")
print(df.npartitions)
print(df.mean().compute())
print(df.map_partitions(len,columns=df.columns).compute())

Running this script on my machine (OS X (10.11.1), Python 2.7.11 [Anaconda 2.3.0 (x86_64)], pandas 0.17.1) produces (in approximately the same frequencies and in random order) one of the following results (for each result I give first a description and then stdout)

1. npartitions completes, mean completes, map_partitions seg faults

15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
Segmentation fault: 11

2. npartitions completes, Fatal Python Error at mean

15
Fatal Python error: GC object already tracked
Abort trap: 6

3. npartitions completes, mean completes, Fatal Python Error at map_partitions

15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
Fatal Python error: GC object already tracked
Abort trap: 6

4. npartitions completes, seg fault at mean

15
Segmentation fault: 11

5. Everything completes

15
0                -0.000135
1                -0.000011
10               -0.000939
2                 0.000771
3                 0.000309
4                 0.000573
5                 0.000635
6                 0.000770
7                 0.000713
8                 0.001237
9                -0.000440
Unnamed: 0    49999.500000
dtype: float64
(100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001, 100001)

I reverted pandas back to 0.17.0 based on a suggestion from Matt [here] http://stackoverflow.com/questions/34128540/why-is-running-a-compute-in-dask-causing-fatal-python-error-gc-object-alrea but that hasn't noticeably changed the results.

I'm happy to debug this in dask, but there is at least some chance that this could be a pandas problem. Has anyone else seen this kind of behavior? If so, were you able to find an answer?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions