OpenML Python runs may have swapped truth and prediction labels (at least for classification, regression)

#### Description
For a small set of flows, the `predictions.arff` files of some runs contain faulty entries. In these entries, the prediction does not correspond to the class with the highest confidence. 

As far as I was able to find out, all affected flows are sklearn pipelines and published/uploaded using openml-python. 
Moreover, the confidences of these pipelines should be, to the best of my knowledge, representative for the prediction (unlike, for example, the [confidences of SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict_proba)). 
Furthermore, the confidences are off by a large margin. This is not a result of two or more classes having almost equal confidences or a precision problem. 

#### Example
Flow [19039](https://openml.org/f/19039) with Run [10581112](https://openml.org/r/10581112) and the associated [predictions file](https://www.openml.org/data/download/22088492/predictions.arff).

|  row_id |   predicted class in predictions.arff |   confidence.1 |   confidence.2 | prediction based on confidence |
|---|---|---|---|---|
|  95 |            1 |         0.2552 |         0.7448 | 2 |
| 349 |            1 |         0.0601 |         0.9399 | 2 |
| 980 |            2 |         0.6280 |         0.3720 | 1 |


#### Expected Results
The predictions in the `predictions.arff` should correspond to the class with the highest confidence in the `predictions.arff`. 

#### Actual Results
The predictions in the `predictions.arff` correspond to the class with the second highest confidence. In other cases, the prediction does not correspond to a high-confidence class at all but seems to be chosen at random.  

#### Affected Flows
In my research, I have found the following list of flows to run into this problem at least once: [19030, 19037, 19039, 19035, 18818, 17839, 17761].
These include sklearn pipelines using decision trees (19030, 18818), Gradient Boosting (19307,19039), KNN (19035), SGD (17839), and LDA (17761).

#### Versions
I assume that the flows [19030, 19037, 19039, 1903] used the newest version of openml-python based on their upload date and feedback gather by the original uploader. For the other flows, I am not certain which version was used. 





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OpenML Python runs may have swapped truth and prediction labels (at least for classification, regression) #1185

Description

Example

Expected Results

Actual Results

Affected Flows

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

OpenML Python runs may have swapped truth and prediction labels (at least for classification, regression) #1185

Description

Description

Example

Expected Results

Actual Results

Affected Flows

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions