Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Jan 15, 2021

I stumbled on the dataset loading/caching code and found the flow very difficult to parse. Additionally there was a lot of duplicate code around. The main goal of this PR is to make the flow of loading and storing compressed data easier to read, and reduce the amount of duplicate code. In the process I streamlined the flow a little, which should lead to some performance gains (less excessive data loading).

Differences in behavior:

  • When the OpenMLDataset is constructed, outdated pickle data is no longer pro-actively updated. Instead the numpy pickle files are updated in _load_data.
  • When _load_data is called on a dataset which is not yet compressed (or needs to be updated), it is loaded only once instead of twice.
  • When the OpenMLDataset is constructed, the arff file is not immediately converted to a compressed format. Instead this happens the first time the data is required in _load_data.
  • OpenMLDataset's data_pickle_file, data_feather_file and feather_attribute_file members more accurately reflect the presence of the file. Previously the cache format files for the format that was not used would also be set, while they are never generated.

All unit tests passed without modification, except for the test_get_dataset_cache_format_feather which relied on the assumption that the data was compressed and stored on disk before the first load. However I also updated the pickle test which "tests" if the file can be loaded from the compressed format, otherwise it would now test if they are loaded from arff (the first load is from arff but stores to pickle, subsequent loads are from pickle).

There was a lot of code duplication, and the general flow of
loading/storing the data in compressed format was hard to navigate.
Otherwise the data would actually be loaded from arff (first load).
My editor incorrectly renamed too many instances of 'data_file' to
'arff_file'.
@PGijsbers PGijsbers requested a review from mfeurer January 15, 2021 18:36
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I only have one minor change request.

@codecov-io
Copy link

Codecov Report

Merging #1018 (77ab46d) into develop (fba6aab) will increase coverage by 0.45%.
The diff coverage is 79.16%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1018      +/-   ##
===========================================
+ Coverage    87.62%   88.07%   +0.45%     
===========================================
  Files           36       36              
  Lines         4574     4563      -11     
===========================================
+ Hits          4008     4019      +11     
+ Misses         566      544      -22     
Impacted Files Coverage Δ
openml/datasets/dataset.py 87.92% <78.26%> (+3.48%) ⬆️
openml/utils.py 91.33% <100.00%> (+0.66%) ⬆️
openml/_api_calls.py 89.23% <0.00%> (-3.08%) ⬇️
openml/runs/functions.py 83.16% <0.00%> (+0.25%) ⬆️
openml/testing.py 84.52% <0.00%> (+0.59%) ⬆️
openml/datasets/functions.py 94.42% <0.00%> (+0.98%) ⬆️
openml/exceptions.py 96.77% <0.00%> (+9.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fba6aab...77ab46d. Read the comment docs.

@mfeurer mfeurer merged commit e074c14 into develop Jan 19, 2021
@mfeurer mfeurer deleted the refactor branch January 19, 2021 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants