Skip to content

Intermittent segfault when using dataframe.read_csv #1039

@dhirschfeld

Description

@dhirschfeld

Appears to happen on the compute call

[py-dev] c:\Data\test
λ python test_dask.py
Platform:        Windows-2008ServerR2-6.1.7601-SP1
Python:  3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Fatal Python error: GC object already tracked

Thread 0x000010d4 (most recent call first):
  File "C:\Python\envs\py-dev\lib\linecache.py", line 95 in updatecache
  File "C:\Python\envs\py-dev\lib\linecache.py", line 47 in getlines
  File "C:\Python\envs\py-dev\lib\linecache.py", line 16 in getline
  File "C:\Python\envs\py-dev\lib\traceback.py", line 282 in line
  File "C:\Python\envs\py-dev\lib\traceback.py", line 358 in extract
  File "C:\Python\envs\py-dev\lib\traceback.py", line 474 in __init__
  File "C:\Python\envs\py-dev\lib\traceback.py", line 117 in format_exception
  File "C:\Python\envs\py-dev\lib\traceback.py", line 163 in format_exc
  File "C:\Python\envs\py-dev\lib\threading.py", line 924 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Current thread 0x00001fd8 (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 241 in __exit__
  File "C:\Python\envs\py-dev\lib\queue.py", line 176 in get
  File "C:\Python\envs\py-dev\lib\multiprocessing\pool.py", line 376 in _handle_tasks
  File "C:\Python\envs\py-dev\lib\threading.py", line 862 in run
  File "C:\Python\envs\py-dev\lib\threading.py", line 914 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Thread 0x000009b0 (most recent call first):
  File "C:\Python\envs\py-dev\lib\multiprocessing\pool.py", line 367 in _handle_workers
  File "C:\Python\envs\py-dev\lib\threading.py", line 862 in run
  File "C:\Python\envs\py-dev\lib\threading.py", line 914 in _bootstrap_inner
  File "C:\Python\envs\py-dev\lib\threading.py", line 882 in _bootstrap

Thread 0x00001348 (most recent call first):

Thread 0x00001edc (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 293 in wait

Thread 0x00001624 (most recent call first):
  File "C:\Python\envs\py-dev\lib\threading.py", line 293 in wait
  File "C:\Python\envs\py-dev\lib\queue.py", line 164 in get
  File "C:\Python\envs\py-dev\lib\site-packages\dask\async.py", line 467 in get_async
  File "C:\Python\envs\py-dev\lib\site-packages\dask\threaded.py", line 57 in get
  File "C:\Python\envs\py-dev\lib\site-packages\dask\base.py", line 110 in compute
  File "C:\Python\envs\py-dev\lib\site-packages\dask\base.py", line 37 in compute
  File "test_dask.py", line 44 in <module>

[py-dev] c:\Data\test
λ

test_dask.py

import faulthandler
faulthandler.enable()

from itertools import repeat
import sys

import dask as dsk
import dask.dataframe
from IPython import sys_info
import numpy as np
from numpy.random import randint
import pandas as pd


info = eval(sys_info())

print('Platform:\t', info['platform'])
print('Python:\t', info['sys_version'])


N = 100


dates = pd.date_range('01-Feb-2016', '29-Feb-2016')
region = np.asarray(['ABC', 'DEF', 'GHI', 'JKL'])
product = np.asarray(['A', 'B', 'C', 'D'])

header = ['DATE', 'REGION', 'MISSING', 'PRODUCT']

for date in dates:
    data = zip(
        repeat(date, N),
        region[randint(0, region.size, N)],
        repeat('', N),
        product[randint(0, product.size, N)],
    )
    with open("dummy_{:%Y%m%d}.csv".format(date), 'w') as fid:
        fid.write(','.join(header) + '\n')
        for row in data:
            fid.write(','.join(map(str, row)) + '\n')


data = dsk.dataframe.read_csv('dummy_*.csv', parse_dates=['DATE'], dayfirst=True)
max_date = data.DATE.max().compute()

print(max_date)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions