Cached datasets unusable without internet access due to missing Parquet files

#### Description
When trying to use a pre-built cache in a remote computing environment that did not have access to the internet, OpenML fails due to an inability to access the internet when checking for (potentially) non-existent Parquet files. This error cascades through `openml.datasets.get_dataset()` and causes the entire cache to be marked as corrupt and deletes it. As a result, the existing cache is lost, so later code that would otherwise have worked perfectly fine by relying on the ARFF cache file fails as well.

#### Steps/Code to Reproduce
To avoid confusion, I will refer to "local" as a node with internet access and "remote" as the node without internet access. If you observe the code of `get_dataset()` [here](https://github.com/openml/openml-python/blob/bb17e72d1866e1d23dcf2eace2ca4bdd73af9d39/openml/datasets/functions.py#L410-L445), it tries to download the ARFF file, followed by the Parquet file and then sets a flag to indicate that the cache should be retained. The workflow on local is, therefore, this: 

1. The ARFF file is downloaded, 
2. The Parquet file is NOT downloaded as there is either a silent failure or no Parquet file exists, 
3. The cache deletion flag is flipped and the cache is retained. 

However, when I copy the cache over to remote and try accessing the dataset again, the workflow changes since remote does not have internet access. In this case, the following workflow occurs: 

1. The ARFF file cache is successfully read, 
2. The Parquet file cache is checked and not found, 
3. OpenML tries to access the internet to download the Parquet file and fails, 
4. This particular error is not handled appropriately and therefore cascades down to the `finally` block without resetting the cache clear flag, and the entire cache gets deleted, including the perfectly valid ARFF file. 

Thus, the later sections of the program that would otherwise have run using the cached ARFF file also fail. For a minimum working example, the following code was run on both local and remote:

```python
import openml
d = openml.datasets.get_dataset(4135)
```

The cache from local was copied to remote before executing the above code on remote and verified to be visible to `openml.config`.

#### Expected Results
The dataset should be successfully read on remote from the ARFF file even if the Parquet file is missing.

#### Actual Results
[This](https://gist.github.com/NeoChaos12/5eb32ee0ba2378b6ad73731ac24ec95c) error on remote (which was expected) is not handled properly and causes the cache itself to be deleted.

#### Versions
Linux-4.4.180-94.100-default-x86_64-with-SuSE-12-x86_64
Python 3.7.10 (default, Feb 26 2021, 18:47:35) 
[GCC 7.3.0]
NumPy 1.19.4
SciPy 1.5.3
Scikit-Learn 0.23.2
OpenML 0.12.1

P.S.: The OpenML [docs](https://openml.github.io/openml-python/develop/usage.html) have a broken link for how to set up the `apikey` parameter for the config file. This threw me off for a spin since I initially thought that the source of the error has to do with an authentication failure while downloading a Parquet file, but I'm not sure if it warrants a whole issue to itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cached datasets unusable without internet access due to missing Parquet files #1084

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cached datasets unusable without internet access due to missing Parquet files #1084

Description

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions