-
-
Notifications
You must be signed in to change notification settings - Fork 211
Parquet Support #1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Support #1029
Conversation
i.e. if the dataset was initially retrieved with download_data=False, make sure to download the dataset on first get_data call.
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a brief look and I think this looks good and I'm really looking forward to having this live instead of arff files :)
| raise | ||
|
|
||
| arff_file = _get_dataset_arff(description) if download_data else None | ||
| if "oml:minio_url" in description and download_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why we potentially download both files? I guess this is easier to handle at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is that the test server currently always returns a minio_url entry, regardless of whether or not the parquet file actually exists. I suppose you could turn it around and only download the arff file of the parquet file needed to download.
Some people also might still be interested in having the arff file (for now), and we would not have a public API for downloading that file, so in the interest of merging this before a PyPI release, I figured keeping that behavior for now is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Do you think we should open an issue to track the move from arff to parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea: #1032.
Fixes a bug where the parquet files would simply be overwritten. Also now only save the local files to members only if they actually exist.
mfeurer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and the code looks good enough to be merged. However, it currently breaks documentation building and dataset upload, could you please ahve a look at that?
|
I was working on the last errors but was having a lunch break 😓 wasn't ready for re-review yet. |
Sorry, I was too eager :) |
|
@mfeurer I can't reproduce the doc errors locally and the given errors are not informative. Are you able to reproduce the errors locally? |
No, I'm not :( And it seems that the errors are parsed wrongly in that they don't show the
Good question. Would it be possible that you have a look whether the tests could make sense for Windows, too? |
I suppose so, for debugging on CI at least. Hopefully it explains why
I tried some simple stuff to make it cross-platform, but it's not working so far. Are you ok with it being skipped on Windows for now? Adding unit tests for a completely unrelated issue seems like a good candidate for a separate PR anyway. |
* Store the minio_url from description xml * Add minio dependency * Add call for downloading file from minio bucket * Allow objects to be located in directories * add parquet equivalent of _get_dataset_arff * Store parquet alongside arff, if available * Deal with unknown buckets, fix path expectation * Update test to reflect parquet file is downloaded * Download parquet file through lazy loading i.e. if the dataset was initially retrieved with download_data=False, make sure to download the dataset on first get_data call. * Load data from parquet if available * Update (doc) strings * Cast to signify url is str * Make cache file path generation extension agnostic Fixes a bug where the parquet files would simply be overwritten. Also now only save the local files to members only if they actually exist. * Remove return argument * Add clear test messages, update minio urls * Debugging on CI with print * Add pyarrow dependency for loading parquet * Remove print
Download the Parquet file from the MinIO server, if a link to it is provided by the server in the dataset description xml (currently only on the test server).
Unlike the ARFF equivalent, checksums are not available yet, so there's no internal file integrity check.
On Friday tests were green, but there are currently authentication issues with the test server, @sahithyaravi1493 is looking into it.