Add more tests for read_parquet(engine='pyarrow')#2822
Conversation
| pandas_metadata = json.loads(schema.metadata[b'pandas'].decode('utf8')) | ||
| index_col = pandas_metadata.get('index_columns', None) | ||
| else: | ||
| index_col = index |
There was a problem hiding this comment.
Can any of this metadata wrangling be pushed down into pyarrow?
There was a problem hiding this comment.
The retrieval of the index column should be definitely pushed down to pyarrow. Currently there is no such functionality on the ParquetDataset. I can add that for future releases. This PR mainly covers the parts where we can satisfy the tests without developing new features in pyarrow.
|
From my perspective this looks good. I suspect that @martindurant would have a more discerning eye here. |
|
Not much to say, looks fair enough. |
|
@martindurant No, I have not seen any multi-column index tests yet in the Parquet engine tests. We should have a look at them in a follow-up PR. |
|
I plan to merge this tomorrow if there are no further comments. |
|
Merged. Thanks @xhochy ! |
I'm not happy with the lines added to
_read_arrow_parquet_piece. We may need to add tests forMultiIndexin the future to ensure that they also work.flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new APIcc @wesm @fjetter