Unit tests fail due to switch to Parquet

I wanted to investigate which unit tests fail and why (and how to get them working again).
One of the changes that broke multiple tests was the switch from ARFF files to Parquet. For example, a dataset which had a boolean value in ARFF would have be a categorical type with two allowed values - in parquet they simply are a pandas boolean type. 
In another example, see below, the Parquet had stored numeric data as float64 but our tests expect them to be uint8 (because there are no missing values and casting the column to uint8 loses no data).

![image](https://user-images.githubusercontent.com/15890747/183911587-c377a96d-07d0-4dee-a907-07672a7a87c8.png)

How do we proceed with this? I propose updating the unit tests/expected behavior where reasonable (for example, integrating the `bool` dtype instead of holding on to a two-valued categorical). Sometimes it _might_ make sense to instead expect changes to the parquet file (as the example in the image), but at the same time I don't know if the parquet readers in all languages can deal with the different types. Either way we can _also_ update our parquet loading logic to test if type conversions are possible, so that our unit tests should be robust to the changes.

_side note: I am not entirely sure why the data is stored in float64 when AFAIK the openml-python module was used to convert the data.._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unit tests fail due to switch to Parquet #1157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unit tests fail due to switch to Parquet #1157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions