Skip to content

Conversation

@LennartPurucker
Copy link
Contributor

@LennartPurucker LennartPurucker commented Jun 13, 2023

Related to #1034 #1081 #1132
Closes #1081 #1132

In detail, in relation to #1034, I have added a deprecation warning for get_dataset, making clear that we will default to not downloading the data in the future (starting from version 1.4, but this is just my guess for now).

In relation to #1081, I added an option to disable downloading features. Moreover, I adjusted the repr method and added semi-lazy loading for features by the use of the function OpenMLDataset.load_metadata. As part of this, I also added the option to load qualities after calling get_dataset, which seems to have been missing so far.

In relation to #1132, I added the option force_refresh_cache to force refresh the cache by deleting the existing cache initially.

Note, I have also changed the docstring of get_dataset and adjusted the parts about being threading/multiprocessing safe, which were incorrect from my previous experiences and observations w.r.t. the cache. This might open a discussion to make this threading/multiprocessing safe based on an option that only downloads the data and ignores the cache entirely. I advise against this, for now, see here for more details on this.

@LennartPurucker LennartPurucker requested a review from mfeurer June 13, 2023 13:55
@LennartPurucker LennartPurucker marked this pull request as ready for review June 13, 2023 13:56
@LennartPurucker
Copy link
Contributor Author

@mfeurer I have not found a good way to put these functionalities into a general function among all OpenML entity types. Hence, I would like your opinion on the current proposed way of doing it. Does it suffice?
Feel free to remark on additional reference points (entities) that would need such download-related treatment so that I can better understand how to make these changes more abstract.

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't have a close look, had a very small amount of time, still needs a review (not necessarily by me).

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2023

Codecov Report

Patch coverage: 92.45% and project coverage change: -2.97 ⚠️

Comparison is base (a7f2639) 85.03% compared to head (2cb3b57) 82.06%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1256      +/-   ##
===========================================
- Coverage    85.03%   82.06%   -2.97%     
===========================================
  Files           38       38              
  Lines         5012     5075      +63     
===========================================
- Hits          4262     4165      -97     
- Misses         750      910     +160     
Impacted Files Coverage Δ
openml/runs/functions.py 74.21% <ø> (-9.40%) ⬇️
openml/datasets/data_feature.py 70.00% <75.00%> (+0.76%) ⬆️
openml/datasets/functions.py 90.90% <88.57%> (+0.74%) ⬆️
openml/datasets/dataset.py 86.63% <94.23%> (-0.65%) ⬇️
openml/tasks/functions.py 83.95% <100.00%> (+1.89%) ⬆️
openml/utils.py 92.21% <100.00%> (+0.34%) ⬆️

... and 10 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I left some comments. All in all this looks great and I'm optimistic that we can merge this soonish.

Just a side note: we'd also need to update the task loading to disable dataset loading by default.

@LennartPurucker
Copy link
Contributor Author

I have also adjusted task loading. This got a bit hacky so that it is backward compatible and supports all arguments by passing kwargs. I am not sure that this is a good solution. Hence, I would be happy for feedback on how to handle the arguments best.

Moreover, I added actually lazy loading based on properties such that the user does not have to call load_metadata by hand. This is now more in line with how get_data functions as far as I understand it.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good now, I only have a few minor comments, but besides that I'm pretty happy about this.

I only have two somewhat larger requests:

  1. Could you please change the deprecations to 0.15? We will make the deprecations appear in 0.14 with the next large release.
  2. Could you please add a TODO(0.15) in the code so that we can easily find the changes we need to do in the next version?

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 more change, then we can merge this one!

Co-authored-by: Matthias Feurer <[email protected]>
@LennartPurucker LennartPurucker merged commit 91b4bf0 into develop Jun 15, 2023
@LennartPurucker LennartPurucker deleted the download_updates branch June 15, 2023 14:37
Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look too much at the correctness of the code (since Matthias already had a look), but I think there are several things that could be improved from a code quality point of view. Please have a look at the comments.

LennartPurucker added a commit that referenced this pull request Jun 15, 2023
@LennartPurucker LennartPurucker restored the download_updates branch June 15, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make downloading dataset features optional

5 participants