Skip to content

Conversation

@jsignell
Copy link
Member

Adding a kwarg include_path_col to read_csv:

df = dd.read_csv('data*.csv', include_path_col=True)

This kwarg can be a bool or a string indicating the name of the column in the resulting dataframe

df = dd.read_csv('data*.csv', include_path_col='name_of_path_col')

Example using data1.csv and data2.csv from docs:

screen shot 2018-08-27 at 4 48 30 pm

In this PR I return the same paths that we pass to pd.read_csv. That seemed like the safest/easiest option, but I'm open to suggestions for other approaches.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jsignell ! Thanks!

A couple questions below for @martindurant about how best to handle the changes in dask.bytes (he tends to play in this space the most) and a small suggestion for windows compatibility.

I'm looking forward to seeing this in.

if include_path:
out.append((path, values))
else:
out.append(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @martindurant . What are your thoughts on this change to dask.bytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).

def test_read_csv_include_path_col(dd_read, files):
with filetexts(files, mode='b'):
df = dd_read('2014-01-*.csv', include_path_col=True)
filenames = df.path.map(lambda x: x.split('/')[-1]).compute()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely fail on windows where the separator might differ. You might want to try os.path.split instead.

Whether or not to include the path with the bytes representing a particular file.
If True ``blocks`` is a list of tuples where the first item is path and the second
is a list of ``dask.Delayed`` where each delayed object computes to a
block of bytes from that file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also consider leaving blocks as-is, but adding a new list of paths that has the same length.

I suspect that @martindurant has preferences here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I couldn't figure out which approach was better so I decided to stop waffling and do something. Definitely open to suggestions and preferences :)

@jsignell
Copy link
Member Author

Added converters parsing for path_col like this:
screen shot 2018-08-28 at 3 34 08 pm

@mrocklin
Copy link
Member

Also cc @TomAugspurger just for a sanity check on the API

@mrocklin
Copy link
Member

This failure on windows seems genuine:

================================== FAILURES ===================================
974________________________ test_read_bytes_include_path _________________________
975
976    def test_read_bytes_include_path():
977        with filetexts(files, mode='b'):
978            sample, values = read_bytes('.test.accounts.*', include_path=True)
979>           assert {path.split('/')[-1] for path, _ in values} == set(files.keys())
980E           AssertionError: assert {'C:\\Users\\...ounts.2.json'} == {'.test.accoun...ounts.2.json'}
981E             Extra items in the left set:
982E             'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpnnkzwur2\\.test.accounts.1.json'
983E             'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpnnkzwur2\\.test.accounts.2.json'
984E             Extra items in the right set:
985E             '.test.accounts.1.json'
986E             '.test.accounts.2.json'
987E             Use -v to get the full diff
988
989

Maybe map os.path.abspath on each side before calling set?

@TomAugspurger
Copy link
Member

Pandas has been moving keywords from col to column, so I would suggest include_path_column.

Whats the dtype of the path column? We might want it to be a Categorical, to save on memory usage.

elif columns:
df.columns = columns
if path:
df = df.assign(**path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take some care to not clobber an existing column named path here.

Typically pandas will append .1, .2, etc. to column names if there are duplicates. IIRC this is controlled by mangle_dupe_cols.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but since we are doing this after the read_csv, I don't we really need mangle_dupe_cols, we probably just need to check for path in df.columns and append .n to it until we find something that isn't in df.columns or throw an error and tell the user to specify the include_path_column name.

@jsignell
Copy link
Member Author

Pandas has been moving keywords from col to column, so I would suggest include_path_column

Sounds good.

Whats the dtype of the path column? We might want it to be a Categorical, to save on memory usage.

I'm not explicitly setting the dtype, but logically the paths are Categoricals. So we should probably set it.

@jsignell
Copy link
Member Author

Ok so I tried to make path categorical. I'm not entirely sure I did it right, so please take a close look :)

elif columns:
df.columns = columns
if path:
df = df.assign(**{path[0]: path[1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would avoid building an array of objects, parsing for uniques and making the categorical by doing directly

df.assign(**{path[0]: pd.Categorical.from_codes(np.zeros(len(df)), [path[1]])})

(yes, I know that looks ugly!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's find if that is the recommended approach. Does that work for all versions of pandas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_codes has been around a while. @TomAugspurger ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file. So maybe:

df = df.assign(**{path[0]: pd.Categorical.from_codes([0] * len(df), [path[1]])})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want known categories (which I think we do), all filenames should be passed to all calls to read_text to be added to their categories. So the path arg above would also need to include that information, and this would look more like:

colname, path, paths = path
code = paths.index(path)
df.assign(**{colname: pd.Categorical.from_codes(np.full(len(df), code), paths)})

Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file.

Pandas requires numpy, and imports it internally - there's no harm in importing it here.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good to me. Thanks Julia!

if include_path:
out.append((path, values))
else:
out.append(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).

A dictionary of keyword arguments to be passed to ``reader``
dtypes : dict
DTypes to assign to columns
path: tuple
Copy link
Member

@jcrist jcrist Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space needed before the colon for "proper" (most but not all tools seem to accept no space as well) numpydoc formatting (see https://numpydoc.readthedocs.io/en/latest/format.html)

The colon must be preceded by a space, or omitted if the type is absent.

if include_path_column:
head = head.assign(**{include_path_column: 'path_to_file'})
head[include_path_column] = head[include_path_column].astype('category')
unknown_categoricals = head.select_dtypes(include=['category']).columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The categories are known though, so you'll want to do something like (untested):

# Assumes you have a list of the paths, as returned from `read_bytes`
head = head.assign(**{include_path_column: pd.Categorical.from_codes(range(len(paths)), paths)})

More information on how dask handles categories (known/unknown) is here: http://dask.pydata.org/en/latest/dataframe-design.html#categoricals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. Wouldn't that require that the length of head be at least n files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, good catch (hence the untested :)). Should be more like:

head = head.assign(**{include_path_column: pd.Categorical.from_codes(np.zeros(len(head)), paths)})

The intent is that head includes the categoricals in the proper order for all files, it doesn't really matter what the value is set to for those rows. Head just includes the correct dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! So based on this discussion it seems like we want a list of paths from read_bytes rather than a tuple of (path, blocklist) for each file.

else:
lineterminator = '\n'
if include_path_column:
include_path_column = 'path' if isinstance(include_path_column, bool) else include_path_column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference to write this as (shorter lines, no change if not a bool):

if not isinstance(include_path_column, bool):
    include_path_column = 'path'

sample = read_block(f, 0, nbytes, delimiter)

if include_path:
return sample, (paths, out)
Copy link
Member

@jcrist jcrist Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this just as sample, out, paths (no nesting, just additional paths output if requested). Not a strong preference though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeesh thanks! I couldn't decide :)

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit, otherwise I think this is good to go (provided tests pass).

storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc.
include_path_column: bool or str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space

@jsignell
Copy link
Member Author

Woohoo, they passed! Good to merge?

# Use sample to infer dtypes and check for presense of include_path_column
head = reader(BytesIO(b_sample), **kwargs)
if include_path_column and (include_path_column in head.columns):
raise KeyError("Files already contain the column name: %s, so the "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, sorry, missed this one. Why is this a KeyError? I'd expect a ValueError instead, as there's an issue with the value the user provided, and a different value could fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I guess I was thinking column names are like keys. Fixed.

@jcrist
Copy link
Member

jcrist commented Aug 30, 2018

Looks good to me, will merge once tests pass. Thanks Julia!

@jcrist jcrist merged commit 9d4b99d into dask:master Aug 30, 2018
@jsignell
Copy link
Member Author

Thanks for all the help :)

@mrocklin
Copy link
Member

mrocklin commented Aug 30, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants