-
-
Notifications
You must be signed in to change notification settings - Fork 749
Closed
Description
What happened:
Attempting to read a csv with usecols=columns.keys() while using the distributed scheduler throws an error and hangs.
distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/core.py", line 40, in dumps
for key, value in data.items()
File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/core.py", line 41, in <dictcomp>
if type(value) is Serialize
File "/opt/conda/envs/default/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 245, in serialize
raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type tuple.', "(<function check_meta at 0x7f4dbc0f1710>, (<function apply at 0x7f4de0e80cb0>, <function pandas_read_text at 0x7f4db9fd70e0>, [<function _make_parser_function.<locals>.parser_f at 0x7f4dbc7c0050>, (<function read_block_from_file at 0x7f4db9fdd7a0>, <OpenFile '/tmp/tmpgou80l9m/31fae6b6-ad97-11ea-b877-fab4a0258d35.csv'>, 0, 64000000, b'\\n'), b'a,b,c\\n', (<class 'dict'>, [['usecols', dict_keys(['a', 'c'])]]), (<class 'dict'>, [['a', dtype('int64')], ['c', dtype('int64')]]), ['a', 'c']], (<class 'dict'>, [['write_header', False], ['enforce', False], ['path', None]])), Empty DataFrame\nColumns: [a, c]\nIndex: [], 'from_delayed')")
What you expected to happen:
Expected read_csv to function as it does without distributed.
Minimal Complete Verifiable Example:
from pathlib import Path
from tempfile import TemporaryDirectory
from uuid import uuid1
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
tmp = TemporaryDirectory()
tmp_path = Path(tmp.name)
tmp_file = tmp_path / f"{uuid1()}.csv"
content = "a,b,c\n1,2,3"
tmp_file.write_text(content)
columns = {"a": "object", "c": "object"}
dd.read_csv(tmp_file, usecols=columns.keys()).compute()Anything else we need to know?:
The issue seems to be related to using a dict_keys object with usecols. Wrapping in a list works.
This has previously been noted in this comment: #2597 (comment)
I am looking for opportunities to contribute, so I am happy to assist with some guidance.
Environment:
- Dask version: 2.16, 2.18.1
- Python version: 3.7.6
- Operating System: Debian GNU/Linux 10 (buster)
- Install method (conda, pip, source): conda
Metadata
Metadata
Assignees
Labels
No labels