Add more tests for read_parquet(engine='pyarrow') by xhochy · Pull Request #2822 · dask/dask

xhochy · 2017-10-26T11:54:02Z

I'm not happy with the lines added to _read_arrow_parquet_piece. We may need to add tests for MultiIndex in the future to ensure that they also work.

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

cc @wesm @fjetter

wesm · 2017-10-27T18:40:35Z

dask/dataframe/io/parquet.py

+        pandas_metadata = json.loads(schema.metadata[b'pandas'].decode('utf8'))
+        index_col = pandas_metadata.get('index_columns', None)
+    else:
+        index_col = index


Can any of this metadata wrangling be pushed down into pyarrow?

The retrieval of the index column should be definitely pushed down to pyarrow. Currently there is no such functionality on the ParquetDataset. I can add that for future releases. This PR mainly covers the parts where we can satisfy the tests without developing new features in pyarrow.

mrocklin · 2017-10-28T21:21:32Z

From my perspective this looks good. I suspect that @martindurant would have a more discerning eye here.

martindurant · 2017-10-29T18:56:03Z

Not much to say, looks fair enough.
Do we test multi-column indexes at all? I notice you added a specific provision for this.

xhochy · 2017-10-30T07:43:52Z

@martindurant No, I have not seen any multi-column index tests yet in the Parquet engine tests. We should have a look at them in a follow-up PR.

mrocklin · 2017-10-30T16:43:42Z

I plan to merge this tomorrow if there are no further comments.

mrocklin · 2017-11-01T09:05:49Z

Merged. Thanks @xhochy !

xhochy added 2 commits October 26, 2017 13:53

Add more tests for read_parquet(engine='pyarrow')

49c57f0

Also test test_empty_index with pyarrow

101753d

wesm reviewed Oct 27, 2017

View reviewed changes

mrocklin merged commit f0f0c86 into dask:master Nov 1, 2017

fjetter mentioned this pull request Nov 1, 2017

Bugfix: Filesystem object not passed to pyarrow reader #2527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more tests for read_parquet(engine='pyarrow')#2822

Add more tests for read_parquet(engine='pyarrow')#2822
mrocklin merged 2 commits intodask:masterfrom
xhochy:add-more-pyarrow-read-checks

xhochy commented Oct 26, 2017 •

edited

Loading

Uh oh!

wesm Oct 27, 2017

Uh oh!

xhochy Oct 28, 2017

Uh oh!

mrocklin commented Oct 28, 2017

Uh oh!

martindurant commented Oct 29, 2017

Uh oh!

xhochy commented Oct 30, 2017

Uh oh!

mrocklin commented Oct 30, 2017

Uh oh!

mrocklin commented Nov 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

xhochy commented Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm Oct 27, 2017

Choose a reason for hiding this comment

Uh oh!

xhochy Oct 28, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Oct 28, 2017

Uh oh!

martindurant commented Oct 29, 2017

Uh oh!

xhochy commented Oct 30, 2017

Uh oh!

mrocklin commented Oct 30, 2017

Uh oh!

mrocklin commented Nov 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xhochy commented Oct 26, 2017 •

edited

Loading