-
-
Notifications
You must be signed in to change notification settings - Fork 211
Description
Description
When trying to use a pre-built cache in a remote computing environment that did not have access to the internet, OpenML fails due to an inability to access the internet when checking for (potentially) non-existent Parquet files. This error cascades through openml.datasets.get_dataset() and causes the entire cache to be marked as corrupt and deletes it. As a result, the existing cache is lost, so later code that would otherwise have worked perfectly fine by relying on the ARFF cache file fails as well.
Steps/Code to Reproduce
To avoid confusion, I will refer to "local" as a node with internet access and "remote" as the node without internet access. If you observe the code of get_dataset() here, it tries to download the ARFF file, followed by the Parquet file and then sets a flag to indicate that the cache should be retained. The workflow on local is, therefore, this:
- The ARFF file is downloaded,
- The Parquet file is NOT downloaded as there is either a silent failure or no Parquet file exists,
- The cache deletion flag is flipped and the cache is retained.
However, when I copy the cache over to remote and try accessing the dataset again, the workflow changes since remote does not have internet access. In this case, the following workflow occurs:
- The ARFF file cache is successfully read,
- The Parquet file cache is checked and not found,
- OpenML tries to access the internet to download the Parquet file and fails,
- This particular error is not handled appropriately and therefore cascades down to the
finallyblock without resetting the cache clear flag, and the entire cache gets deleted, including the perfectly valid ARFF file.
Thus, the later sections of the program that would otherwise have run using the cached ARFF file also fail. For a minimum working example, the following code was run on both local and remote:
import openml
d = openml.datasets.get_dataset(4135)The cache from local was copied to remote before executing the above code on remote and verified to be visible to openml.config.
Expected Results
The dataset should be successfully read on remote from the ARFF file even if the Parquet file is missing.
Actual Results
This error on remote (which was expected) is not handled properly and causes the cache itself to be deleted.
Versions
Linux-4.4.180-94.100-default-x86_64-with-SuSE-12-x86_64
Python 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
NumPy 1.19.4
SciPy 1.5.3
Scikit-Learn 0.23.2
OpenML 0.12.1
P.S.: The OpenML docs have a broken link for how to set up the apikey parameter for the config file. This threw me off for a spin since I initially thought that the source of the error has to do with an authentication failure while downloading a Parquet file, but I'm not sure if it warrants a whole issue to itself.