-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Hi,
I would like to use a pretrained xgboost regression model as a "prefit" estimator in MapieRegressor. However, a K-fold cross-validation strategy was already used while training and hyperparameter tunning of the xgboost model. This means, there is no calibration dataset that the model didn't see before. Will there be a data leakage if I use all the training data (that was used in training and optimizing the xgboost model) as X_train in MapieRegressor?
In the regression example provided https://github.com/scikit-learn-contrib/MAPIE/blob/044ae6977a7ed874686b78e278f0e9b433cb2f65/examples/regression/4-tutorials/plot_cqr_tutorial.py#L277 , only training data (X_train) was used and no calibration data was used (X_calib) to fit the MapieRegressor. Why was the calibration data excluded for MapieRegressor in this example?
Also, if I decide to fit the model anyways using the following code:
mapie_xgb = MapieRegressor(xgb_model, cv='prefit')
mapie_xgb.fit(X=X_train_transformed, y=y_train_transformed)
I get the following error:
ValueError: The two functions get_conformity_scores and get_estimation_distribution of the ConformityScore class are not consistent. The following equation must be verified: self.get_estimation_distribution(X, y_pred, self.get_conformity_scores(X, y, y_pred)) == yThe maximum conformity score is 9.5367431640625e-07.The eps attribute may need to be increased if you are sure that the two methods are consistent.