Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

Download the Parquet file from the MinIO server, if a link to it is provided by the server in the dataset description xml (currently only on the test server).
Unlike the ARFF equivalent, checksums are not available yet, so there's no internal file integrity check.

On Friday tests were green, but there are currently authentication issues with the test server, @sahithyaravi1493 is looking into it.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a brief look and I think this looks good and I'm really looking forward to having this live instead of arff files :)

raise

arff_file = _get_dataset_arff(description) if download_data else None
if "oml:minio_url" in description and download_data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we potentially download both files? I guess this is easier to handle at the moment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is that the test server currently always returns a minio_url entry, regardless of whether or not the parquet file actually exists. I suppose you could turn it around and only download the arff file of the parquet file needed to download.
Some people also might still be interested in having the arff file (for now), and we would not have a public API for downloading that file, so in the interest of merging this before a PyPI release, I figured keeping that behavior for now is better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Do you think we should open an issue to track the move from arff to parquet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea: #1032.

Fixes a bug where the parquet files would simply be overwritten.
Also now only save the local files to members only if they actually
exist.
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and the code looks good enough to be merged. However, it currently breaks documentation building and dataset upload, could you please ahve a look at that?

@PGijsbers
Copy link
Collaborator Author

I was working on the last errors but was having a lunch break 😓 wasn't ready for re-review yet.

@mfeurer
Copy link
Collaborator

mfeurer commented Feb 25, 2021

wasn't ready for re-review yet.

Sorry, I was too eager :)

@PGijsbers
Copy link
Collaborator Author

PGijsbers commented Mar 1, 2021

@mfeurer I can't reproduce the doc errors locally and the given errors are not informative. Are you able to reproduce the errors locally?
Appveyor seems to fail due to two tests that I think should be unix only.

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 1, 2021

Are you able to reproduce the errors locally?

No, I'm not :( And it seems that the errors are parsed wrongly in that they don't show the from e. Maybe you can add a print statement and raise the original exception?

Appveyor seems to fail due to two tests that I think should be unix only.

Good question. Would it be possible that you have a look whether the tests could make sense for Windows, too?

@PGijsbers
Copy link
Collaborator Author

Maybe you can add a print statement and raise the original exception?

I suppose so, for debugging on CI at least. Hopefully it explains why from e doesn't work in this scenario (I've seen it fail locally earlier in the development, but then from e correctly showed both exceptions/tracebacks as it should).

Would it be possible that you have a look whether the tests could make sense for Windows, too?

I tried some simple stuff to make it cross-platform, but it's not working so far. Are you ok with it being skipped on Windows for now? Adding unit tests for a completely unrelated issue seems like a good candidate for a separate PR anyway.

@PGijsbers PGijsbers requested a review from mfeurer March 2, 2021 12:06
@PGijsbers PGijsbers merged commit 38f9bf0 into develop Mar 4, 2021
@PGijsbers PGijsbers deleted the parquet branch March 4, 2021 08:28
github-actions bot pushed a commit that referenced this pull request Mar 4, 2021
PGijsbers added a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
* Store the minio_url from description xml

* Add minio dependency

* Add call for downloading file from minio bucket

* Allow objects to be located in directories

* add parquet equivalent of _get_dataset_arff

* Store parquet alongside arff, if available

* Deal with unknown buckets, fix path expectation

* Update test to reflect parquet file is downloaded

* Download parquet file through lazy loading

i.e. if the dataset was initially retrieved with download_data=False,
make sure to download the dataset on first get_data call.

* Load data from parquet if available

* Update (doc) strings

* Cast to signify url is str

* Make cache file path generation extension agnostic

Fixes a bug where the parquet files would simply be overwritten.
Also now only save the local files to members only if they actually
exist.

* Remove return argument

* Add clear test messages, update minio urls

* Debugging on CI with print

* Add pyarrow dependency for loading parquet

* Remove print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants