-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Include filename in read_csv #3908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jsignell ! Thanks!
A couple questions below for @martindurant about how best to handle the changes in dask.bytes (he tends to play in this space the most) and a small suggestion for windows compatibility.
I'm looking forward to seeing this in.
dask/bytes/core.py
Outdated
| if include_path: | ||
| out.append((path, values)) | ||
| else: | ||
| out.append(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @martindurant . What are your thoughts on this change to dask.bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).
dask/dataframe/io/tests/test_csv.py
Outdated
| def test_read_csv_include_path_col(dd_read, files): | ||
| with filetexts(files, mode='b'): | ||
| df = dd_read('2014-01-*.csv', include_path_col=True) | ||
| filenames = df.path.map(lambda x: x.split('/')[-1]).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely fail on windows where the separator might differ. You might want to try os.path.split instead.
dask/bytes/core.py
Outdated
| Whether or not to include the path with the bytes representing a particular file. | ||
| If True ``blocks`` is a list of tuples where the first item is path and the second | ||
| is a list of ``dask.Delayed`` where each delayed object computes to a | ||
| block of bytes from that file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also consider leaving blocks as-is, but adding a new list of paths that has the same length.
I suspect that @martindurant has preferences here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I couldn't figure out which approach was better so I decided to stop waffling and do something. Definitely open to suggestions and preferences :)
|
Also cc @TomAugspurger just for a sanity check on the API |
|
This failure on windows seems genuine: Maybe map |
|
Pandas has been moving keywords from Whats the dtype of the |
dask/dataframe/io/csv.py
Outdated
| elif columns: | ||
| df.columns = columns | ||
| if path: | ||
| df = df.assign(**path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should take some care to not clobber an existing column named path here.
Typically pandas will append .1, .2, etc. to column names if there are duplicates. IIRC this is controlled by mangle_dupe_cols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but since we are doing this after the read_csv, I don't we really need mangle_dupe_cols, we probably just need to check for path in df.columns and append .n to it until we find something that isn't in df.columns or throw an error and tell the user to specify the include_path_column name.
Sounds good.
I'm not explicitly setting the dtype, but logically the paths are Categoricals. So we should probably set it. |
|
Ok so I tried to make path categorical. I'm not entirely sure I did it right, so please take a close look :) |
dask/dataframe/io/csv.py
Outdated
| elif columns: | ||
| df.columns = columns | ||
| if path: | ||
| df = df.assign(**{path[0]: path[1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would avoid building an array of objects, parsing for uniques and making the categorical by doing directly
df.assign(**{path[0]: pd.Categorical.from_codes(np.zeros(len(df)), [path[1]])})
(yes, I know that looks ugly!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's find if that is the recommended approach. Does that work for all versions of pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_codes has been around a while. @TomAugspurger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file. So maybe:
df = df.assign(**{path[0]: pd.Categorical.from_codes([0] * len(df), [path[1]])})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want known categories (which I think we do), all filenames should be passed to all calls to read_text to be added to their categories. So the path arg above would also need to include that information, and this would look more like:
colname, path, paths = path
code = paths.index(path)
df.assign(**{colname: pd.Categorical.from_codes(np.full(len(df), code), paths)})Is it better to use a list rather than a numpy array? I just noticed that there are no other references to numpy in the file.
Pandas requires numpy, and imports it internally - there's no harm in importing it here.
jcrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks pretty good to me. Thanks Julia!
dask/bytes/core.py
Outdated
| if include_path: | ||
| out.append((path, values)) | ||
| else: | ||
| out.append(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather build up a separate list of paths, and return that as well, rather than returning a list of tuples of (path, list).
dask/dataframe/io/csv.py
Outdated
| A dictionary of keyword arguments to be passed to ``reader`` | ||
| dtypes : dict | ||
| DTypes to assign to columns | ||
| path: tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space needed before the colon for "proper" (most but not all tools seem to accept no space as well) numpydoc formatting (see https://numpydoc.readthedocs.io/en/latest/format.html)
The colon must be preceded by a space, or omitted if the type is absent.
dask/dataframe/io/csv.py
Outdated
| if include_path_column: | ||
| head = head.assign(**{include_path_column: 'path_to_file'}) | ||
| head[include_path_column] = head[include_path_column].astype('category') | ||
| unknown_categoricals = head.select_dtypes(include=['category']).columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The categories are known though, so you'll want to do something like (untested):
# Assumes you have a list of the paths, as returned from `read_bytes`
head = head.assign(**{include_path_column: pd.Categorical.from_codes(range(len(paths)), paths)})
More information on how dask handles categories (known/unknown) is here: http://dask.pydata.org/en/latest/dataframe-design.html#categoricals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. Wouldn't that require that the length of head be at least n files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, good catch (hence the untested :)). Should be more like:
head = head.assign(**{include_path_column: pd.Categorical.from_codes(np.zeros(len(head)), paths)})
The intent is that head includes the categoricals in the proper order for all files, it doesn't really matter what the value is set to for those rows. Head just includes the correct dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! So based on this discussion it seems like we want a list of paths from read_bytes rather than a tuple of (path, blocklist) for each file.
dask/dataframe/io/csv.py
Outdated
| else: | ||
| lineterminator = '\n' | ||
| if include_path_column: | ||
| include_path_column = 'path' if isinstance(include_path_column, bool) else include_path_column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference to write this as (shorter lines, no change if not a bool):
if not isinstance(include_path_column, bool):
include_path_column = 'path'
dask/bytes/core.py
Outdated
| sample = read_block(f, 0, nbytes, delimiter) | ||
|
|
||
| if include_path: | ||
| return sample, (paths, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do this just as sample, out, paths (no nesting, just additional paths output if requested). Not a strong preference though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeesh thanks! I couldn't decide :)
jcrist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit, otherwise I think this is good to go (provided tests pass).
dask/dataframe/io/csv.py
Outdated
| storage_options : dict, optional | ||
| Extra options that make sense for a particular storage connection, e.g. | ||
| host, port, username, password, etc. | ||
| include_path_column: bool or str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space
|
Woohoo, they passed! Good to merge? |
dask/dataframe/io/csv.py
Outdated
| # Use sample to infer dtypes and check for presense of include_path_column | ||
| head = reader(BytesIO(b_sample), **kwargs) | ||
| if include_path_column and (include_path_column in head.columns): | ||
| raise KeyError("Files already contain the column name: %s, so the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop, sorry, missed this one. Why is this a KeyError? I'd expect a ValueError instead, as there's an issue with the value the user provided, and a different value could fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. I guess I was thinking column names are like keys. Fixed.
|
Looks good to me, will merge once tests pass. Thanks Julia! |
|
Thanks for all the help :) |
|
Thanks for doing all the work :)
…On Thu, Aug 30, 2018 at 10:51 AM, Julia Signell ***@***.***> wrote:
Thanks for all the help :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3908 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszJFmE6bPREj6hkchGetKzy81fNfXks5uV_vtgaJpZM4WOewF>
.
|

flake8 daskAdding a kwarg
include_path_coltoread_csv:This kwarg can be a bool or a string indicating the name of the column in the resulting dataframe
Example using data1.csv and data2.csv from docs:
In this PR I return the same paths that we pass to
pd.read_csv. That seemed like the safest/easiest option, but I'm open to suggestions for other approaches.