Include filename in read_csv #3908

jsignell · 2018-08-27T20:50:50Z

Tests added / passed
Passes flake8 dask
Closes Keep original filenames in dask.dataframe.read_csv #2802

Adding a kwarg include_path_col to read_csv:

df = dd.read_csv('data*.csv', include_path_col=True)

This kwarg can be a bool or a string indicating the name of the column in the resulting dataframe

df = dd.read_csv('data*.csv', include_path_col='name_of_path_col')

Example using data1.csv and data2.csv from docs:

In this PR I return the same paths that we pass to pd.read_csv. That seemed like the safest/easiest option, but I'm open to suggestions for other approaches.

mrocklin

This looks great @jsignell ! Thanks!

A couple questions below for @martindurant about how best to handle the changes in dask.bytes (he tends to play in this space the most) and a small suggestion for windows compatibility.

I'm looking forward to seeing this in.

mrocklin · 2018-08-28T16:46:46Z

dask/bytes/core.py

+        if include_path:
+            out.append((path, values))
+        else:
+            out.append(values)


cc @martindurant . What are your thoughts on this change to dask.bytes?

I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).

mrocklin · 2018-08-28T16:50:50Z

dask/dataframe/io/tests/test_csv.py

+def test_read_csv_include_path_col(dd_read, files):
+    with filetexts(files, mode='b'):
+        df = dd_read('2014-01-*.csv', include_path_col=True)
+        filenames = df.path.map(lambda x: x.split('/')[-1]).compute()


This will likely fail on windows where the separator might differ. You might want to try os.path.split instead.

mrocklin · 2018-08-28T16:51:59Z

dask/bytes/core.py

+        Whether or not to include the path with the bytes representing a particular file.
+        If True ``blocks`` is a list of tuples where the first item is path and the second
+        is a list of ``dask.Delayed`` where each delayed object computes to a
+        block of bytes from that file.


We might also consider leaving blocks as-is, but adding a new list of paths that has the same length.

I suspect that @martindurant has preferences here.

Yeah I couldn't figure out which approach was better so I decided to stop waffling and do something. Definitely open to suggestions and preferences :)

jsignell · 2018-08-28T21:26:01Z

Added converters parsing for path_col like this:

mrocklin · 2018-08-28T22:06:21Z

Also cc @TomAugspurger just for a sanity check on the API

mrocklin · 2018-08-28T22:08:20Z

This failure on windows seems genuine:

================================== FAILURES ===================================
974________________________ test_read_bytes_include_path _________________________
975
976    def test_read_bytes_include_path():
977        with filetexts(files, mode='b'):
978            sample, values = read_bytes('.test.accounts.*', include_path=True)
979>           assert {path.split('/')[-1] for path, _ in values} == set(files.keys())
980E           AssertionError: assert {'C:\\Users\\...ounts.2.json'} == {'.test.accoun...ounts.2.json'}
981E             Extra items in the left set:
982E             'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpnnkzwur2\\.test.accounts.1.json'
983E             'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpnnkzwur2\\.test.accounts.2.json'
984E             Extra items in the right set:
985E             '.test.accounts.1.json'
986E             '.test.accounts.2.json'
987E             Use -v to get the full diff
988
989

Maybe map os.path.abspath on each side before calling set?

TomAugspurger · 2018-08-29T12:54:08Z

Pandas has been moving keywords from col to column, so I would suggest include_path_column.

Whats the dtype of the path column? We might want it to be a Categorical, to save on memory usage.

TomAugspurger · 2018-08-29T12:56:41Z

dask/dataframe/io/csv.py

    elif columns:
        df.columns = columns
+    if path:
+        df = df.assign(**path)


We should take some care to not clobber an existing column named path here.

Typically pandas will append .1, .2, etc. to column names if there are duplicates. IIRC this is controlled by mangle_dupe_cols.

Right but since we are doing this after the read_csv, I don't we really need mangle_dupe_cols, we probably just need to check for path in df.columns and append .n to it until we find something that isn't in df.columns or throw an error and tell the user to specify the include_path_column name.

jsignell · 2018-08-29T12:58:24Z

Pandas has been moving keywords from col to column, so I would suggest include_path_column

Sounds good.

Whats the dtype of the path column? We might want it to be a Categorical, to save on memory usage.

I'm not explicitly setting the dtype, but logically the paths are Categoricals. So we should probably set it.

jsignell · 2018-08-29T16:55:10Z

Ok so I tried to make path categorical. I'm not entirely sure I did it right, so please take a close look :)

martindurant · 2018-08-29T17:05:32Z

dask/dataframe/io/csv.py

    elif columns:
        df.columns = columns
+    if path:
+        df = df.assign(**{path[0]: path[1]})


You would avoid building an array of objects, parsing for uniques and making the categorical by doing directly

df.assign(**{path[0]: pd.Categorical.from_codes(np.zeros(len(df)), [path[1]])})

(yes, I know that looks ugly!)

That's find if that is the recommended approach. Does that work for all versions of pandas?

from_codes has been around a while. @TomAugspurger ?

Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file. So maybe:

df = df.assign(**{path[0]: pd.Categorical.from_codes([0] * len(df), [path[1]])})

If we want known categories (which I think we do), all filenames should be passed to all calls to read_text to be added to their categories. So the path arg above would also need to include that information, and this would look more like:

colname, path, paths = path code = paths.index(path) df.assign(**{colname: pd.Categorical.from_codes(np.full(len(df), code), paths)})

Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file.

Pandas requires numpy, and imports it internally - there's no harm in importing it here.

jcrist

Overall this looks pretty good to me. Thanks Julia!

jcrist · 2018-08-29T17:14:18Z

dask/bytes/core.py

+        if include_path:
+            out.append((path, values))
+        else:
+            out.append(values)


I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).

jcrist · 2018-08-29T17:16:15Z

dask/dataframe/io/csv.py

        A dictionary of keyword arguments to be passed to ``reader``
    dtypes : dict
        DTypes to assign to columns
+    path: tuple


nit: space needed before the colon for "proper" (most but not all tools seem to accept no space as well) numpydoc formatting (see https://numpydoc.readthedocs.io/en/latest/format.html)

The colon must be preceded by a space, or omitted if the type is absent.

jcrist · 2018-08-29T17:23:19Z

dask/dataframe/io/csv.py

+        if include_path_column:
+            head = head.assign(**{include_path_column: 'path_to_file'})
+            head[include_path_column] = head[include_path_column].astype('category')
+            unknown_categoricals = head.select_dtypes(include=['category']).columns


The categories are known though, so you'll want to do something like (untested):

# Assumes you have a list of the paths, as returned from `read_bytes` head = head.assign(**{include_path_column: pd.Categorical.from_codes(range(len(paths)), paths)})

More information on how dask handles categories (known/unknown) is here: http://dask.pydata.org/en/latest/dataframe-design.html#categoricals

hmm. Wouldn't that require that the length of head be at least n files?

Oop, good catch (hence the untested :)). Should be more like:

head = head.assign(**{include_path_column: pd.Categorical.from_codes(np.zeros(len(head)), paths)})

The intent is that head includes the categoricals in the proper order for all files, it doesn't really matter what the value is set to for those rows. Head just includes the correct dtypes.

Cool! So based on this discussion it seems like we want a list of paths from read_bytes rather than a tuple of (path, blocklist) for each file.

jcrist · 2018-08-29T17:25:51Z

dask/dataframe/io/csv.py

    else:
        lineterminator = '\n'
+    if include_path_column:
+        include_path_column = 'path' if isinstance(include_path_column, bool) else include_path_column


Slight preference to write this as (shorter lines, no change if not a bool):

if not isinstance(include_path_column, bool): include_path_column = 'path'

jcrist · 2018-08-29T19:12:16Z

dask/bytes/core.py

            sample = read_block(f, 0, nbytes, delimiter)
-
+    if include_path:
+        return sample, (paths, out)


I'd do this just as sample, out, paths (no nesting, just additional paths output if requested). Not a strong preference though.

yeesh thanks! I couldn't decide :)

jcrist

One last nit, otherwise I think this is good to go (provided tests pass).

jcrist · 2018-08-29T19:13:49Z

dask/dataframe/io/csv.py

 storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc.
+include_path_column: bool or str, optional


jsignell · 2018-08-30T13:09:32Z

Woohoo, they passed! Good to merge?

jcrist · 2018-08-30T13:35:19Z

dask/dataframe/io/csv.py

+    # Use sample to infer dtypes and check for presense of include_path_column
    head = reader(BytesIO(b_sample), **kwargs)
+    if include_path_column and (include_path_column in head.columns):
+        raise KeyError("Files already contain the column name: %s, so the "


Oop, sorry, missed this one. Why is this a KeyError? I'd expect a ValueError instead, as there's an issue with the value the user provided, and a different value could fix it.

Whoops. I guess I was thinking column names are like keys. Fixed.

jcrist · 2018-08-30T13:41:08Z

Looks good to me, will merge once tests pass. Thanks Julia!

jsignell · 2018-08-30T14:51:25Z

Thanks for all the help :)

mrocklin · 2018-08-30T14:51:59Z

Thanks for doing all the work :)

…

On Thu, Aug 30, 2018 at 10:51 AM, Julia Signell ***@***.***> wrote: Thanks for all the help :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3908 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszJFmE6bPREj6hkchGetKzy81fNfXks5uV_vtgaJpZM4WOewF> .

jsignell added 2 commits August 27, 2018 15:51

Adding include_path_col to read_csv

5156de3

Adding tests for read_bytes and read_csv/table with include_path

2e2524d

mrocklin reviewed Aug 28, 2018

View reviewed changes

Allowing path converters and fixing up test for windows compat

d8cb9b2

Fixing test for windows

f69b870

TomAugspurger reviewed Aug 29, 2018

View reviewed changes

jsignell added 3 commits August 29, 2018 09:46

Checking whether include_path_col name is duplicate of cols in files

696c351

include_path_col --> include_path_column

2e9d5cd

Making path categorical

54610b3

martindurant reviewed Aug 29, 2018

View reviewed changes

jcrist reviewed Aug 29, 2018

View reviewed changes

jsignell added 2 commits August 29, 2018 14:23

Making categories known

debd9e3

Adding a space

8492d2f

jcrist reviewed Aug 29, 2018

View reviewed changes

jsignell added 2 commits August 29, 2018 15:32

Unnesting output from read_bytes

7e676b9

Fixing up column error handling

6df7e92

jcrist reviewed Aug 30, 2018

View reviewed changes

KeyError --> ValueError

73ef66d

jcrist merged commit 9d4b99d into dask:master Aug 30, 2018

jsignell mentioned this pull request Nov 8, 2018

Include filename or path in open_mfdataset pydata/xarray#2550

Closed

beckernick mentioned this pull request Jul 15, 2019

[FEA] Support including filename column for read_csv rapidsai/cudf#2269

Closed

bryanwweber mentioned this pull request Jan 25, 2022

Add include_path_column arg to read_json #8603

Merged

3 tasks

Uh oh!

Include filename in read_csv #3908

Include filename in read_csv #3908

Uh oh!

Conversation

jsignell commented Aug 27, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell commented Aug 28, 2018

Uh oh!

mrocklin commented Aug 28, 2018

Uh oh!

mrocklin commented Aug 28, 2018

Uh oh!

TomAugspurger commented Aug 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell commented Aug 29, 2018

Uh oh!

jsignell commented Aug 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell commented Aug 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Aug 29, 2018 •

edited

Loading

jcrist Aug 29, 2018 •

edited

Loading