Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Jul 20, 2023

How should this PR be tested?

Execute the following code snippet for a dataset not in your cache:

import openml
DID  = 31
d = openml.datasets.get_dataset(DID)
df, *_ = d.get_data()
print(df)

on develop it currently raises (due to server issues) on get_dataset:

/Users/pietergijsbers/repositories/openml-python/openml/datasets/functions.py:437: FutureWarning: Starting from Version 0.15 `download_data`, `download_qualities`, and `download_features_meta_data` will all be ``False`` instead of ``True`` by default to enable lazy loading. To disable this message until version 0.15 explicitly set `download_data`, `download_qualities`, and `download_features_meta_data` to a bool while calling `get_dataset`.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/pietergijsbers/repositories/openml-python/mwe.py", line 3, in <module>
    d = openml.datasets.get_dataset(DID)
  File "/Users/pietergijsbers/repositories/openml-python/openml/utils.py", line 403, in safe_func
    return func(*args, **kwargs)
  File "/Users/pietergijsbers/repositories/openml-python/openml/datasets/functions.py", line 497, in get_dataset
    parquet_file = _get_dataset_parquet(
  File "/Users/pietergijsbers/repositories/openml-python/openml/datasets/functions.py", line 1095, in _get_dataset_parquet
    openml._api_calls._download_minio_file(
  File "/Users/pietergijsbers/repositories/openml-python/openml/_api_calls.py", line 151, in _download_minio_file
    client.fget_object(
  File "/Users/pietergijsbers/repositories/openml-python/venv/lib/python3.10/site-packages/minio/api.py", line 1042, in fget_object
    stat = self.stat_object(
  File "/Users/pietergijsbers/repositories/openml-python/venv/lib/python3.10/site-packages/minio/api.py", line 1867, in stat_object
    response = self._execute(
  File "/Users/pietergijsbers/repositories/openml-python/venv/lib/python3.10/site-packages/minio/api.py", line 403, in _execute
    return self._url_open(
  File "/Users/pietergijsbers/repositories/openml-python/venv/lib/python3.10/site-packages/minio/api.py", line 368, in _url_open
    raise ServerError(
minio.error.ServerError: server failed with HTTP status code 503

on this branch, it instead only issues a warning and falls back to parquet arff:

/Users/pietergijsbers/repositories/openml-python/openml/datasets/functions.py:438: FutureWarning: Starting from Version 0.15 `download_data`, `download_qualities`, and `download_features_meta_data` will all be ``False`` instead of ``True`` by default to enable lazy loading. To disable this message until version 0.15 explicitly set `download_data`, `download_qualities`, and `download_features_meta_data` to a bool while calling `get_dataset`.
  warnings.warn(
WARNING:openml.datasets.functions:Could not download file from http://openml1.win.tue.nl/dataset31/dataset_31.pq: server failed with HTTP status code 503
WARNING:openml.datasets.functions:Failed to download parquet, fallback on ARFF.
    checking_status  duration                  credit_history              purpose  ...  num_dependents own_telephone foreign_worker  class
0                <0         6  critical/other existing credit             radio/tv  ...               1           yes            yes   good
1          0<=X<200        48                   existing paid             radio/tv  ...               1          none            yes    bad
2       no checking        12  critical/other existing credit            education  ...               2          none            yes   good
3                <0        42                   existing paid  furniture/equipment  ...               2          none            yes   good
4                <0        24              delayed previously              new car  ...               2          none            yes    bad
..              ...       ...                             ...                  ...  ...             ...           ...            ...    ...
995     no checking        12                   existing paid  furniture/equipment  ...               1          none            yes   good
996              <0        30                   existing paid             used car  ...               1           yes            yes   good
997     no checking        12                   existing paid             radio/tv  ...               1          none            yes   good
998              <0        45                   existing paid             radio/tv  ...               1           yes            yes    bad
999        0<=X<200        45  critical/other existing credit             used car  ...               1          none            yes   good

[1000 rows x 21 columns]

source=cast(str, url), destination=output_file_path
)
except (FileNotFoundError, urllib3.exceptions.MaxRetryError) as e:
except (FileNotFoundError, urllib3.exceptions.MaxRetryError, minio.error.ServerError) as e:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could opt to catch all exceptions here instead. It would "future proof" in case new kinds of errors are raised, though generally it is useful to know what the errors are early and evaluate if different mitigation strategies are appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should make sure to see new error messages.

@PGijsbers
Copy link
Collaborator Author

Note that unit tests are largely unaffected by the server issues (though not entirely) as the test server does not provide parquet urls so most tests simply work on ARFF only.

@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.13 ⚠️

Comparison is base (5d2128a) 85.26% compared to head (7aa87dd) 85.13%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1272      +/-   ##
===========================================
- Coverage    85.26%   85.13%   -0.13%     
===========================================
  Files           38       38              
  Lines         5104     5107       +3     
===========================================
- Hits          4352     4348       -4     
- Misses         752      759       +7     
Impacted Files Coverage Δ
openml/datasets/functions.py 90.74% <100.00%> (+0.07%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. But how do we know that this will continue working when MinIO is up again?

source=cast(str, url), destination=output_file_path
)
except (FileNotFoundError, urllib3.exceptions.MaxRetryError) as e:
except (FileNotFoundError, urllib3.exceptions.MaxRetryError, minio.error.ServerError) as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should make sure to see new error messages.

@PGijsbers
Copy link
Collaborator Author

Not sure what you mean, all the changes do is provide a path when ServerError is raised. If that error is not raised then the code path is not affected.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Now I get it, and this makes total sense to have.

@PGijsbers PGijsbers merged commit fb43c8f into develop Jul 20, 2023
@PGijsbers PGijsbers deleted the hotfix/minio-fallback branch July 20, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants