This is an supervised classification example taken from the KDD 2009 cup. A copy of the data and details can be found here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009. The problem was to predict account cancellation ("churn") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels). In this example we show how to quickly use vtreat to prepare the data for modeling. vtreat takes in Pandas DataFrames and returns both a treatment plan and a clean Pandas DataFrame ready for modeling.
!pip install vtreat !pip install wvpy Load our packages/modules.
import pandas
import xgboost
import vtreat
import vtreat.cross_plan
import numpy.random
import wvpy.util
import scipy.sparseRead in explanitory variables.
# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009
dir = "../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/"
d = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\t', header=0)
vars = [c for c in d.columns]
d.shape(50000, 230)
Read in dependent variable we are trying to predict.
churn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)
churn.columns = ["churn"]
churn.shape(50000, 1)
churn["churn"].value_counts()-1 46328
1 3672
Name: churn, dtype: int64
Arrange test/train split.
numpy.random.seed(2020)
n = d.shape[0]
# https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md
split1 = vtreat.cross_plan.KWayCrossPlanYStratified().split_plan(n_rows=n, k_folds=10, y=churn.iloc[:, 0])
train_idx = set(split1[0]['train'])
is_train = [i in train_idx for i in range(n)]
is_test = numpy.logical_not(is_train)(The reported performance runs of this example were sensitive to the prevalance of the churn variable in the test set, we are cutting down on this source of evaluation variarance by using the stratified split.)
d_train = d.loc[is_train, :].copy()
churn_train = numpy.asarray(churn.loc[is_train, :]["churn"]==1)
d_test = d.loc[is_test, :].copy()
churn_test = numpy.asarray(churn.loc[is_test, :]["churn"]==1)Take a look at the dependent variables. They are a mess, many missing values. Categorical variables that can not be directly used without some re-encoding.
d_train.head()| Var1 | Var2 | Var3 | Var4 | Var5 | Var6 | Var7 | Var8 | Var9 | Var10 | ... | Var221 | Var222 | Var223 | Var224 | Var225 | Var226 | Var227 | Var228 | Var229 | Var230 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | 1526.0 | 7.0 | NaN | NaN | NaN | ... | oslk | fXVEsaq | jySVZNlOJy | NaN | NaN | xb3V | RAYp | F2FyR07IdsN7I | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | 525.0 | 0.0 | NaN | NaN | NaN | ... | oslk | 2Kb5FSF | LM8l689qOp | NaN | NaN | fKCe | RAYp | F2FyR07IdsN7I | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | 5236.0 | 7.0 | NaN | NaN | NaN | ... | Al6ZaUT | NKv4yOc | jySVZNlOJy | NaN | kG3k | Qu4f | 02N6s8f | ib5G6X1eUxUn6 | am7c | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | 1029.0 | 7.0 | NaN | NaN | NaN | ... | oslk | 1J2cvxe | LM8l689qOp | NaN | kG3k | FSa2 | RAYp | F2FyR07IdsN7I | mj86 | NaN |
| 5 | NaN | NaN | NaN | NaN | NaN | 658.0 | 7.0 | NaN | NaN | NaN | ... | zCkv | QqVuch3 | LM8l689qOp | NaN | NaN | Qcbd | 02N6s8f | Zy3gnGM | am7c | NaN |
5 rows × 230 columns
d_train.shape(45000, 230)
Try building a model directly off this data (this will fail).
fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')
try:
fitter.fit(d_train, churn_train)
except Exception as ex:
print(ex)DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields Var191, Var192, Var193, Var194, Var195, Var196, Var197, Var198, Var199, Var200, Var201, Var202, Var203, Var204, Var205, Var206, Var207, Var208, Var210, Var211, Var212, Var213, Var214, Var215, Var216, Var217, Var218, Var219, Var220, Var221, Var222, Var223, Var224, Var225, Var226, Var227, Var228, Var229
Let's quickly prepare a data frame with none of these issues.
We start by building our treatment plan, this has the sklearn.pipeline.Pipeline interfaces.
plan = vtreat.BinomialOutcomeTreatment(
outcome_target=True,
params=vtreat.vtreat_parameters({'filter_to_recommended':True}))Use .fit_transform() to get a special copy of the treated training data that has cross-validated mitigations againsst nested model bias. We call this a "cross frame." .fit_transform() is deliberately a different DataFrame than what would be returned by .fit().transform() (the .fit().transform() would damage the modeling effort due nested model bias, the .fit_transform() "cross frame" uses cross-validation techniques similar to "stacking" to mitigate these issues).
cross_frame = plan.fit_transform(d_train, churn_train)Take a look at the new data. This frame is guaranteed to be all numeric with no missing values, with the rows in the same order as the training data.
cross_frame.head()| Var2_is_bad | Var3_is_bad | Var4_is_bad | Var5_is_bad | Var6_is_bad | Var7_is_bad | Var10_is_bad | Var11_is_bad | Var13_is_bad | Var14_is_bad | ... | Var227_lev_RAYp | Var227_lev_ZI9m | Var228_logit_code | Var228_prevalence_code | Var228_lev_F2FyR07IdsN7I | Var229_logit_code | Var229_prevalence_code | Var229_lev__NA_ | Var229_lev_am7c | Var229_lev_mj86 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.145563 | 0.654178 | 1.0 | 0.180634 | 0.568733 | 1.0 | 0.0 | 0.0 |
| 1 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.150727 | 0.654178 | 1.0 | 0.175825 | 0.568733 | 1.0 | 0.0 | 0.0 |
| 2 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | -0.591072 | 0.053667 | 0.0 | -0.296854 | 0.233689 | 0.0 | 1.0 | 0.0 |
| 3 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.150727 | 0.654178 | 1.0 | -0.292587 | 0.196044 | 0.0 | 0.0 | 1.0 |
| 4 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | -0.323715 | 0.018556 | 0.0 | -0.268261 | 0.233689 | 0.0 | 1.0 | 0.0 |
5 rows × 234 columns
cross_frame.shape(45000, 234)
Pick a recommended subset of the new derived variables.
plan.score_frame_.head()| variable | orig_variable | treatment | y_aware | has_range | PearsonR | R2 | significance | vcount | default_threshold | recommended | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Var1_is_bad | Var1 | missing_indicator | False | True | 0.004328 | 0.000019 | 0.358610 | 193.0 | 0.001036 | False |
| 1 | Var2_is_bad | Var2 | missing_indicator | False | True | 0.016358 | 0.000268 | 0.000520 | 193.0 | 0.001036 | True |
| 2 | Var3_is_bad | Var3 | missing_indicator | False | True | 0.016325 | 0.000266 | 0.000534 | 193.0 | 0.001036 | True |
| 3 | Var4_is_bad | Var4 | missing_indicator | False | True | 0.020327 | 0.000413 | 0.000016 | 193.0 | 0.001036 | True |
| 4 | Var5_is_bad | Var5 | missing_indicator | False | True | 0.017267 | 0.000298 | 0.000249 | 193.0 | 0.001036 | True |
model_vars = numpy.asarray(plan.score_frame_["variable"][plan.score_frame_["recommended"]])
len(model_vars)234
Fit the model
cross_frame.dtypesVar2_is_bad float64
Var3_is_bad float64
Var4_is_bad float64
Var5_is_bad float64
Var6_is_bad float64
...
Var229_logit_code float64
Var229_prevalence_code float64
Var229_lev__NA_ Sparse[float64, 0.0]
Var229_lev_am7c Sparse[float64, 0.0]
Var229_lev_mj86 Sparse[float64, 0.0]
Length: 234, dtype: object
# fails due to sparse columns
# can also work around this by setting the vtreat parameter 'sparse_indicators' to False
try:
cross_sparse = xgboost.DMatrix(data=cross_frame.loc[:, model_vars], label=churn_train)
except Exception as ex:
print(ex)DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields Var191_lev__NA_, Var193_lev_RO12, Var193_lev_2Knk1KF, Var194_lev__NA_, Var194_lev_SEuy, Var195_lev_taul, Var200_lev__NA_, Var201_lev__NA_, Var201_lev_smXZ, Var205_lev_VpdQ, Var206_lev_IYzP, Var206_lev_zm5i, Var206_lev__NA_, Var207_lev_me75fM6ugJ, Var207_lev_7M47J5GA0pTYIFxg5uy, Var210_lev_uKAI, Var211_lev_L84s, Var211_lev_Mtgm, Var212_lev_NhsEn4L, Var212_lev_XfqtO3UdzaXh_, Var213_lev__NA_, Var214_lev__NA_, Var218_lev_cJvF, Var218_lev_UYBR, Var221_lev_oslk, Var221_lev_zCkv, Var225_lev__NA_, Var225_lev_ELof, Var226_lev_FSa2, Var227_lev_RAYp, Var227_lev_ZI9m, Var228_lev_F2FyR07IdsN7I, Var229_lev__NA_, Var229_lev_am7c, Var229_lev_mj86
# also fails
try:
cross_sparse = scipy.sparse.csc_matrix(cross_frame[model_vars])
except Exception as ex:
print(ex)no supported conversion for types: (dtype('O'),)
# works
cross_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(cross_frame[[vi]]) for vi in model_vars])# https://xgboost.readthedocs.io/en/latest/python/python_intro.html
fd = xgboost.DMatrix(
data=cross_sparse,
label=churn_train)x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cv = xgboost.cv(x_parameters, fd, num_boost_round=100, verbose_eval=False)cv.head()| train-error-mean | train-error-std | test-error-mean | test-error-std | |
|---|---|---|---|---|
| 0 | 0.073300 | 0.000709 | 0.073311 | 0.001447 |
| 1 | 0.073322 | 0.000741 | 0.073333 | 0.001415 |
| 2 | 0.073344 | 0.000747 | 0.073467 | 0.001464 |
| 3 | 0.073378 | 0.000725 | 0.073467 | 0.001464 |
| 4 | 0.073356 | 0.000739 | 0.073444 | 0.001450 |
best = cv.loc[cv["test-error-mean"]<= min(cv["test-error-mean"] + 1.0e-9), :]
best
| train-error-mean | train-error-std | test-error-mean | test-error-std | |
|---|---|---|---|---|
| 83 | 0.069933 | 0.000401 | 0.072333 | 0.001093 |
ntree = best.index.values[0]
ntree83
fitter = xgboost.XGBClassifier(n_estimators=ntree, max_depth=3, objective='binary:logistic')
fitterXGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=83, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
model = fitter.fit(cross_sparse, churn_train)Apply the data transform to our held-out data.
test_processed = plan.transform(d_test)Plot the quality of the model on training data (a biased measure of performance).
pf_train = pandas.DataFrame({"churn":churn_train})
pf_train["pred"] = model.predict_proba(cross_sparse)[:, 1]
wvpy.util.plot_roc(pf_train["pred"], pf_train["churn"], title="Model on Train")0.778895961015585
Plot the quality of the model score on the held-out data. This AUC is not great, but in the ballpark of the original contest winners.
test_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processed[[vi]]) for vi in model_vars])
pf = pandas.DataFrame({"churn":churn_test})
pf["pred"] = model.predict_proba(test_sparse)[:, 1]
wvpy.util.plot_roc(pf["pred"], pf["churn"], title="Model on Test")0.7472558854286449
Notice we dealt with many problem columns at once, and in a statistically sound manner. More on the vtreat package for Python can be found here: https://github.com/WinVector/pyvtreat. Details on the R version can be found here: https://github.com/WinVector/vtreat.
We can compare this to the R solution (link).
We can compare the above cross-frame solution to a naive "design transform and model on the same data set" solution as we show below. Note we are leaveing filter to recommended on, to show the non-cross validated methodology still fails in an "easy case."
plan_naive = vtreat.BinomialOutcomeTreatment(
outcome_target=True,
params=vtreat.vtreat_parameters({'filter_to_recommended':True}))
plan_naive.fit(d_train, churn_train)
naive_frame = plan_naive.transform(d_train)/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:235: UserWarning: possibly called transform on same data used to fit
(this causes over-fit, please use fit_transform() instead)
"possibly called transform on same data used to fit\n" +
model_vars = numpy.asarray(plan_naive.score_frame_["variable"][plan_naive.score_frame_["recommended"]])
len(model_vars)230
naive_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(naive_frame[[vi]]) for vi in model_vars])fd_naive = xgboost.DMatrix(data=naive_sparse, label=churn_train)
x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cvn = xgboost.cv(x_parameters, fd_naive, num_boost_round=100, verbose_eval=False)bestn = cvn.loc[cvn["test-error-mean"] <= min(cvn["test-error-mean"] + 1.0e-9), :]
bestn| train-error-mean | train-error-std | test-error-mean | test-error-std | |
|---|---|---|---|---|
| 96 | 0.047833 | 0.000314 | 0.058444 | 0.001457 |
ntreen = bestn.index.values[0]
ntreen96
fittern = xgboost.XGBClassifier(n_estimators=ntreen, max_depth=3, objective='binary:logistic')
fitternXGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=96, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
modeln = fittern.fit(naive_sparse, churn_train)test_processedn = plan_naive.transform(d_test)
test_processedn = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processedn[[vi]]) for vi in model_vars])pfn_train = pandas.DataFrame({"churn":churn_train})
pfn_train["pred_naive"] = modeln.predict_proba(naive_sparse)[:, 1]
wvpy.util.plot_roc(pfn_train["pred_naive"], pfn_train["churn"], title="Overfit Model on Train")0.9496847639151214
pfn = pandas.DataFrame({"churn":churn_test})
pfn["pred_naive"] = modeln.predict_proba(test_processedn)[:, 1]
wvpy.util.plot_roc(pfn["pred_naive"], pfn["churn"], title="Overfit Model on Test")0.598560484633134
Note the naive test performance is worse, despite its far better training performance. This is over-fit due to the nested model bias of using the same data to build the treatment plan and model without any cross-frame mitigations.



