Skip to content

Cached datasets unusable without internet access due to missing Parquet files #1084

@NeoChaos12

Description

@NeoChaos12

Description

When trying to use a pre-built cache in a remote computing environment that did not have access to the internet, OpenML fails due to an inability to access the internet when checking for (potentially) non-existent Parquet files. This error cascades through openml.datasets.get_dataset() and causes the entire cache to be marked as corrupt and deletes it. As a result, the existing cache is lost, so later code that would otherwise have worked perfectly fine by relying on the ARFF cache file fails as well.

Steps/Code to Reproduce

To avoid confusion, I will refer to "local" as a node with internet access and "remote" as the node without internet access. If you observe the code of get_dataset() here, it tries to download the ARFF file, followed by the Parquet file and then sets a flag to indicate that the cache should be retained. The workflow on local is, therefore, this:

  1. The ARFF file is downloaded,
  2. The Parquet file is NOT downloaded as there is either a silent failure or no Parquet file exists,
  3. The cache deletion flag is flipped and the cache is retained.

However, when I copy the cache over to remote and try accessing the dataset again, the workflow changes since remote does not have internet access. In this case, the following workflow occurs:

  1. The ARFF file cache is successfully read,
  2. The Parquet file cache is checked and not found,
  3. OpenML tries to access the internet to download the Parquet file and fails,
  4. This particular error is not handled appropriately and therefore cascades down to the finally block without resetting the cache clear flag, and the entire cache gets deleted, including the perfectly valid ARFF file.

Thus, the later sections of the program that would otherwise have run using the cached ARFF file also fail. For a minimum working example, the following code was run on both local and remote:

import openml
d = openml.datasets.get_dataset(4135)

The cache from local was copied to remote before executing the above code on remote and verified to be visible to openml.config.

Expected Results

The dataset should be successfully read on remote from the ARFF file even if the Parquet file is missing.

Actual Results

This error on remote (which was expected) is not handled properly and causes the cache itself to be deleted.

Versions

Linux-4.4.180-94.100-default-x86_64-with-SuSE-12-x86_64
Python 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
NumPy 1.19.4
SciPy 1.5.3
Scikit-Learn 0.23.2
OpenML 0.12.1

P.S.: The OpenML docs have a broken link for how to set up the apikey parameter for the config file. This threw me off for a spin since I initially thought that the source of the error has to do with an authentication failure while downloading a Parquet file, but I'm not sure if it warrants a whole issue to itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions