Skip to content

Fail to serialize dict_keys argument to read_csv #3893

@brl0

Description

@brl0

What happened:
Attempting to read a csv with usecols=columns.keys() while using the distributed scheduler throws an error and hangs.

distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
  File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/core.py", line 40, in dumps
    for key, value in data.items()
  File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/core.py", line 41, in <dictcomp>
    if type(value) is Serialize
  File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 245, in serialize
    raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type tuple.', "(<function check_meta at 0x7f4dbc0f1710>, (<function apply at 0x7f4de0e80cb0>, <function pandas_read_text at 0x7f4db9fd70e0>, [<function _make_parser_function.<locals>.parser_f at 0x7f4dbc7c0050>, (<function read_block_from_file at 0x7f4db9fdd7a0>, <OpenFile '/tmp/tmpgou80l9m/31fae6b6-ad97-11ea-b877-fab4a0258d35.csv'>, 0, 64000000, b'\\n'), b'a,b,c\\n', (<class 'dict'>, [['usecols', dict_keys(['a', 'c'])]]), (<class 'dict'>, [['a', dtype('int64')], ['c', dtype('int64')]]), ['a', 'c']], (<class 'dict'>, [['write_header', False], ['enforce', False], ['path', None]])), Empty DataFrame\nColumns: [a, c]\nIndex: [], 'from_delayed')")

What you expected to happen:
Expected read_csv to function as it does without distributed.

Minimal Complete Verifiable Example:

from pathlib import Path
from tempfile import TemporaryDirectory
from uuid import uuid1
import dask.dataframe as dd
from dask.distributed import Client
client = Client()

tmp = TemporaryDirectory()
tmp_path = Path(tmp.name)
tmp_file = tmp_path / f"{uuid1()}.csv"
content = "a,b,c\n1,2,3"
tmp_file.write_text(content)

columns = {"a": "object", "c": "object"}
dd.read_csv(tmp_file, usecols=columns.keys()).compute()

Anything else we need to know?:
The issue seems to be related to using a dict_keys object with usecols. Wrapping in a list works.
This has previously been noted in this comment: #2597 (comment)

I am looking for opportunities to contribute, so I am happy to assist with some guidance.

Environment:

  • Dask version: 2.16, 2.18.1
  • Python version: 3.7.6
  • Operating System: Debian GNU/Linux 10 (buster)
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions