Skip to content

Don't download (large) datasets by default #1034

@joaquinvanschoren

Description

@joaquinvanschoren

Description

In datasets.get_dataset(data_id) the default is currently to always download the dataset:
https://openml.github.io/openml-python/master/generated/openml.datasets.get_dataset.html#openml.datasets.get_dataset

This is problematic for large datasets - it takes a long time and may cause out-of-memory errors. Sometimes we need to look at the full meta-data (of many datasets) without downloading the data. We can do that now with the option download_data=False, but it feels like this should be the default. Some users may also be unaware of this option or the fact that get_dataset will actually download the data and consume resources.

A simple solution would be to make download_data=False the default.

Steps/Code to Reproduce

import openml
openml.datasets.get_dataset(41081)

Expected Results

The dataset metadata within seconds

Actual Results

A long time waiting until the dataset has downloaded and parsed.

Versions

macOS-10.16-x86_64-i386-64bit
Python 3.8.5 (default, Sep 4 2020, 02:22:02)
[Clang 10.0.0 ]
NumPy 1.19.5
SciPy 1.5.2
Scikit-Learn 0.23.2
OpenML 0.11.1dev

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions