Parquet Support #1029

PGijsbers · 2021-02-22T15:21:39Z

Download the Parquet file from the MinIO server, if a link to it is provided by the server in the dataset description xml (currently only on the test server).
Unlike the ARFF equivalent, checksums are not available yet, so there's no internal file integrity check.

On Friday tests were green, but there are currently authentication issues with the test server, @sahithyaravi1493 is looking into it.

i.e. if the dataset was initially retrieved with download_data=False, make sure to download the dataset on first get_data call.

mfeurer

I just had a brief look and I think this looks good and I'm really looking forward to having this live instead of arff files :)

mfeurer · 2021-02-24T21:12:34Z

openml/datasets/functions.py

                raise

        arff_file = _get_dataset_arff(description) if download_data else None
+        if "oml:minio_url" in description and download_data:


Could you please explain why we potentially download both files? I guess this is easier to handle at the moment?

The main reason is that the test server currently always returns a minio_url entry, regardless of whether or not the parquet file actually exists. I suppose you could turn it around and only download the arff file of the parquet file needed to download.
Some people also might still be interested in having the arff file (for now), and we would not have a public API for downloading that file, so in the interest of merging this before a PyPI release, I figured keeping that behavior for now is better.

That makes sense. Do you think we should open an issue to track the move from arff to parquet?

Seems like a good idea: #1032.

openml/_api_calls.py

tests/test_datasets/test_dataset_functions.py

Fixes a bug where the parquet files would simply be overwritten. Also now only save the local files to members only if they actually exist.

mfeurer

This looks good and the code looks good enough to be merged. However, it currently breaks documentation building and dataset upload, could you please ahve a look at that?

PGijsbers · 2021-02-25T12:29:23Z

I was working on the last errors but was having a lunch break 😓 wasn't ready for re-review yet.

mfeurer · 2021-02-25T13:40:43Z

wasn't ready for re-review yet.

Sorry, I was too eager :)

PGijsbers · 2021-03-01T16:03:07Z

@mfeurer I can't reproduce the doc errors locally and the given errors are not informative. Are you able to reproduce the errors locally?
Appveyor seems to fail due to two tests that I think should be unix only.

mfeurer · 2021-03-01T18:45:35Z

Are you able to reproduce the errors locally?

No, I'm not :( And it seems that the errors are parsed wrongly in that they don't show the from e. Maybe you can add a print statement and raise the original exception?

Appveyor seems to fail due to two tests that I think should be unix only.

Good question. Would it be possible that you have a look whether the tests could make sense for Windows, too?

PGijsbers · 2021-03-01T19:44:03Z

Maybe you can add a print statement and raise the original exception?

I suppose so, for debugging on CI at least. Hopefully it explains why from e doesn't work in this scenario (I've seen it fail locally earlier in the development, but then from e correctly showed both exceptions/tracebacks as it should).

Would it be possible that you have a look whether the tests could make sense for Windows, too?

I tried some simple stuff to make it cross-platform, but it's not working so far. Are you ok with it being skipped on Windows for now? Adding unit tests for a completely unrelated issue seems like a good candidate for a separate PR anyway.

* Store the minio_url from description xml * Add minio dependency * Add call for downloading file from minio bucket * Allow objects to be located in directories * add parquet equivalent of _get_dataset_arff * Store parquet alongside arff, if available * Deal with unknown buckets, fix path expectation * Update test to reflect parquet file is downloaded * Download parquet file through lazy loading i.e. if the dataset was initially retrieved with download_data=False, make sure to download the dataset on first get_data call. * Load data from parquet if available * Update (doc) strings * Cast to signify url is str * Make cache file path generation extension agnostic Fixes a bug where the parquet files would simply be overwritten. Also now only save the local files to members only if they actually exist. * Remove return argument * Add clear test messages, update minio urls * Debugging on CI with print * Add pyarrow dependency for loading parquet * Remove print

PGijsbers added 12 commits February 18, 2021 11:13

Store the minio_url from description xml

f9f50ac

Add minio dependency

049196c

Add call for downloading file from minio bucket

fe3dbe2

Allow objects to be located in directories

1a5ed88

add parquet equivalent of _get_dataset_arff

3905d66

Store parquet alongside arff, if available

ca27f18

Deal with unknown buckets, fix path expectation

b4ed955

Update test to reflect parquet file is downloaded

7cecfb9

Download parquet file through lazy loading

a0ab074

i.e. if the dataset was initially retrieved with download_data=False, make sure to download the dataset on first get_data call.

Load data from parquet if available

36ede4d

Update (doc) strings

f02eb0f

Cast to signify url is str

7ccbe5e

mfeurer mentioned this pull request Feb 24, 2021

Future of liac-arff renatopp/liac-arff#120

Open

mfeurer reviewed Feb 24, 2021

View reviewed changes

PGijsbers added 2 commits February 25, 2021 11:13

Make cache file path generation extension agnostic

5099c2b

Fixes a bug where the parquet files would simply be overwritten. Also now only save the local files to members only if they actually exist.

Remove return argument

8d86c2b

mfeurer reviewed Feb 25, 2021

View reviewed changes

Add clear test messages, update minio urls

117c671

PGijsbers added 2 commits March 1, 2021 20:49

Debugging on CI with print

56137a1

Add pyarrow dependency for loading parquet

d64257d

PGijsbers mentioned this pull request Mar 2, 2021

Moving from ARFF to Parquet #1032

Open

Remove print

2eeb8bf

PGijsbers requested a review from mfeurer March 2, 2021 12:06

mfeurer approved these changes Mar 4, 2021

View reviewed changes

PGijsbers merged commit 38f9bf0 into develop Mar 4, 2021

PGijsbers deleted the parquet branch March 4, 2021 08:28

github-actions bot pushed a commit that referenced this pull request Mar 4, 2021

PGijsbers: Parquet Support (#1029)

0ceb355

Uh oh!

Parquet Support #1029

Parquet Support #1029

Uh oh!

Conversation

PGijsbers commented Feb 22, 2021

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

mfeurer Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

PGijsbers Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

mfeurer Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

PGijsbers Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

PGijsbers commented Feb 25, 2021

Uh oh!

mfeurer commented Feb 25, 2021

Uh oh!

PGijsbers commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer commented Mar 1, 2021

Uh oh!

PGijsbers commented Mar 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PGijsbers commented Mar 1, 2021 •

edited

Loading