Should the pickle speed up be supported after Parquet integration?

I am currently working on adding Parquet support to `openml-python`.
Parquet files will load a lot faster than arff files, so I was wondering if we should still aim to also provide the speed improvement through keeping the pickle files. I did [a small speed test](https://colab.research.google.com/drive/1UE8e3A_gfcku3y5vCzKKjdiNLn7enDjq?usp=sharing) to record speed loading the [cc-18](openml.org/s/99) from disk, the suite has 72 datasets varying from a few kb to 190 mb in size (parquet/pickle size). 
Loading all datasets in the suite takes ~6 seconds for parquet, ~700ms for pickle and ~7min for arff.

We see that loading the difference from pickle and parquet is still big relatively speaking, but in absolute numbers parquet still loads the entire suite of 72 datasets in a ~6 seconds. I think it's worth evaluating whether or not we want to keep pickle files.
The obvious the drawback is slower loads, though the difference might not be noticeable in most cases.
Getting rid of the pickle files would have the following benefits:
 - faster first-time load (no need to pickle the loaded dataframe)
 - save disk space (the pickle is roughly as big as the parquet file, doubling space the cache requires)
 - pickles are more brittle. Different Python/pickle versions may lead to errors (#898), the package versions for storing and loading need to be compatible (#918), and overall it's not that easy (#780, [all the code surrounding pickle loads](https://github.com/openml/openml-python/blob/4ff66ed284790e4ae29245a15e23a3fa1f1c3a6b/openml/datasets/dataset.py#L518)).
  - less code is less maintenance
  
@mfeurer  




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Should the pickle speed up be supported after Parquet integration? #1027

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Should the pickle speed up be supported after Parquet integration? #1027

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions