Skip to content

MapieRegressor with prefit optimized model that used training and calibration data #477

@ramy90

Description

@ramy90

Hi,
I would like to use a pretrained xgboost regression model as a "prefit" estimator in MapieRegressor. However, a K-fold cross-validation strategy was already used while training and hyperparameter tunning of the xgboost model. This means, there is no calibration dataset that the model didn't see before. Will there be a data leakage if I use all the training data (that was used in training and optimizing the xgboost model) as X_train in MapieRegressor?

In the regression example provided https://github.com/scikit-learn-contrib/MAPIE/blob/044ae6977a7ed874686b78e278f0e9b433cb2f65/examples/regression/4-tutorials/plot_cqr_tutorial.py#L277 , only training data (X_train) was used and no calibration data was used (X_calib) to fit the MapieRegressor. Why was the calibration data excluded for MapieRegressor in this example?

Also, if I decide to fit the model anyways using the following code:
mapie_xgb = MapieRegressor(xgb_model, cv='prefit')
mapie_xgb.fit(X=X_train_transformed, y=y_train_transformed)
I get the following error:
ValueError: The two functions get_conformity_scores and get_estimation_distribution of the ConformityScore class are not consistent. The following equation must be verified: self.get_estimation_distribution(X, y_pred, self.get_conformity_scores(X, y, y_pred)) == yThe maximum conformity score is 9.5367431640625e-07.The eps attribute may need to be increased if you are sure that the two methods are consistent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions