Refactor data loading/storing #1018
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I stumbled on the dataset loading/caching code and found the flow very difficult to parse. Additionally there was a lot of duplicate code around. The main goal of this PR is to make the flow of loading and storing compressed data easier to read, and reduce the amount of duplicate code. In the process I streamlined the flow a little, which should lead to some performance gains (less excessive data loading).
Differences in behavior:
_load_data._load_datais called on a dataset which is not yet compressed (or needs to be updated), it is loaded only once instead of twice._load_data.data_pickle_file,data_feather_fileandfeather_attribute_filemembers more accurately reflect the presence of the file. Previously the cache format files for the format that was not used would also be set, while they are never generated.All unit tests passed without modification, except for the
test_get_dataset_cache_format_featherwhich relied on the assumption that the data was compressed and stored on disk before the first load. However I also updated the pickle test which "tests" if the file can be loaded from the compressed format, otherwise it would now test if they are loaded from arff (the first load is from arff but stores to pickle, subsequent loads are from pickle).