Intermittent segfault when using `dataframe.read_csv`

Appears to happen on the compute call

```
[py-dev] c:\Data\test
λ python test_dask.py
Platform:        Windows-2008ServerR2-6.1.7601-SP1
Python:  3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Fatal Python error: GC object already tracked

Thread 0x000010d4 (most recent call first):
  File "C:\Python\envs\py-dev\lib\linecache.py", line 95 in updatecache
  File "C:\Python\envs\py-dev\lib\linecache.py", line 47 in getlines
  File "C:\Python\envs\py-dev\lib\linecache.py", line 16 in getline
  File "C:\Python\envs\py-dev\lib\traceback.py", line 282 in line
  File "C:\Python\envs\py-dev\lib\traceback.py", line 358 in extract
  File "C:\Python\envs\py-dev\lib\traceback.py", line 474 in __init__
  File "C:\Python\envs\py-dev\lib\traceback.py", line 117 in format_exception
  File "C:\Python\envs\py-dev\lib\traceback.py", line 163 in format_exc
  File "C:\Python\envs\py-dev\lib\threading.py", line 924 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Current thread 0x00001fd8 (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 241 in __exit__
  File "C:\Python\envs\py-dev\lib\queue.py", line 176 in get
  File "C:\Python\envs\py-dev\lib\multiprocessing\pool.py", line 376 in _handle_tasks
  File "C:\Python\envs\py-dev\lib\threading.py", line 862 in run
  File "C:\Python\envs\py-dev\lib\threading.py", line 914 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Thread 0x000009b0 (most recent call first):
  File "C:\Python\envs\py-dev\lib\multiprocessing\pool.py", line 367 in _handle_workers
  File "C:\Python\envs\py-dev\lib\threading.py", line 862 in run
  File "C:\Python\envs\py-dev\lib\threading.py", line 914 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Thread 0x00001348 (most recent call first):

Thread 0x00001edc (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 293 in wait

Thread 0x00001624 (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 293 in wait
  File "C:\Python\envs\py-dev\lib\queue.py", line 164 in get
  File "C:\Python\envs\py-dev\lib\site-packages\dask\async.py", line 467 in get_async
  File "C:\Python\envs\py-dev\lib\site-packages\dask\threaded.py", line 57 in get
  File "C:\Python\envs\py-dev\lib\site-packages\dask\base.py", line 110 in compute
  File "C:\Python\envs\py-dev\lib\site-packages\dask\base.py", line 37 in compute
  File "test_dask.py", line 44 in <module>

[py-dev] c:\Data\test
λ
```

**`test_dask.py`**

``` python
import faulthandler
faulthandler.enable()

from itertools import repeat
import sys

import dask as dsk
import dask.dataframe
from IPython import sys_info
import numpy as np
from numpy.random import randint
import pandas as pd


info = eval(sys_info())

print('Platform:\t', info['platform'])
print('Python:\t', info['sys_version'])


N = 100


dates = pd.date_range('01-Feb-2016', '29-Feb-2016')
region = np.asarray(['ABC', 'DEF', 'GHI', 'JKL'])
product = np.asarray(['A', 'B', 'C', 'D'])

header = ['DATE', 'REGION', 'MISSING', 'PRODUCT']

for date in dates:
    data = zip(
        repeat(date, N),
        region[randint(0, region.size, N)],
        repeat('', N),
        product[randint(0, product.size, N)],
    )
    with open("dummy_{:%Y%m%d}.csv".format(date), 'w') as fid:
        fid.write(','.join(header) + '\n')
        for row in data:
            fid.write(','.join(map(str, row)) + '\n')


data = dsk.dataframe.read_csv('dummy_*.csv', parse_dates=['DATE'], dayfirst=True)
max_date = data.DATE.max().compute()

print(max_date)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Intermittent segfault when using `dataframe.read_csv` #1039

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Intermittent segfault when using dataframe.read_csv #1039

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Intermittent segfault when using `dataframe.read_csv` #1039