Skip to content

Latest commit

 

History

History
1017 lines (754 loc) · 21.6 KB

File metadata and controls

1017 lines (754 loc) · 21.6 KB

This is an supervised classification example taken from the KDD 2009 cup. A copy of the data and details can be found here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009. The problem was to predict account cancellation ("churn") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels). In this example we show how to quickly use vtreat to prepare the data for modeling. vtreat takes in Pandas DataFrames and returns both a treatment plan and a clean Pandas DataFrame ready for modeling.

to install

!pip install vtreat !pip install wvpy Load our packages/modules.

import pandas
import xgboost
import vtreat
import vtreat.cross_plan
import numpy.random
import wvpy.util
import scipy.sparse

Read in explanitory variables.

# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009
dir = "../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/"
d = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\t', header=0)
vars = [c for c in d.columns]
d.shape
(50000, 230)

Read in dependent variable we are trying to predict.

churn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)
churn.columns = ["churn"]
churn.shape
(50000, 1)
churn["churn"].value_counts()
-1    46328
 1     3672
Name: churn, dtype: int64

Arrange test/train split.

numpy.random.seed(2020)
n = d.shape[0]
# https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md
split1 = vtreat.cross_plan.KWayCrossPlanYStratified().split_plan(n_rows=n, k_folds=10, y=churn.iloc[:, 0])
train_idx = set(split1[0]['train'])
is_train = [i in train_idx for i in range(n)]
is_test = numpy.logical_not(is_train)

(The reported performance runs of this example were sensitive to the prevalance of the churn variable in the test set, we are cutting down on this source of evaluation variarance by using the stratified split.)

d_train = d.loc[is_train, :].copy()
churn_train = numpy.asarray(churn.loc[is_train, :]["churn"]==1)
d_test = d.loc[is_test, :].copy()
churn_test = numpy.asarray(churn.loc[is_test, :]["churn"]==1)

Take a look at the dependent variables. They are a mess, many missing values. Categorical variables that can not be directly used without some re-encoding.

d_train.head()
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 ... Var221 Var222 Var223 Var224 Var225 Var226 Var227 Var228 Var229 Var230
0 NaN NaN NaN NaN NaN 1526.0 7.0 NaN NaN NaN ... oslk fXVEsaq jySVZNlOJy NaN NaN xb3V RAYp F2FyR07IdsN7I NaN NaN
1 NaN NaN NaN NaN NaN 525.0 0.0 NaN NaN NaN ... oslk 2Kb5FSF LM8l689qOp NaN NaN fKCe RAYp F2FyR07IdsN7I NaN NaN
2 NaN NaN NaN NaN NaN 5236.0 7.0 NaN NaN NaN ... Al6ZaUT NKv4yOc jySVZNlOJy NaN kG3k Qu4f 02N6s8f ib5G6X1eUxUn6 am7c NaN
4 NaN NaN NaN NaN NaN 1029.0 7.0 NaN NaN NaN ... oslk 1J2cvxe LM8l689qOp NaN kG3k FSa2 RAYp F2FyR07IdsN7I mj86 NaN
5 NaN NaN NaN NaN NaN 658.0 7.0 NaN NaN NaN ... zCkv QqVuch3 LM8l689qOp NaN NaN Qcbd 02N6s8f Zy3gnGM am7c NaN

5 rows × 230 columns

d_train.shape
(45000, 230)

Try building a model directly off this data (this will fail).

fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')
try:
    fitter.fit(d_train, churn_train)
except Exception as ex:
    print(ex)
DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var191, Var192, Var193, Var194, Var195, Var196, Var197, Var198, Var199, Var200, Var201, Var202, Var203, Var204, Var205, Var206, Var207, Var208, Var210, Var211, Var212, Var213, Var214, Var215, Var216, Var217, Var218, Var219, Var220, Var221, Var222, Var223, Var224, Var225, Var226, Var227, Var228, Var229

Let's quickly prepare a data frame with none of these issues.

We start by building our treatment plan, this has the sklearn.pipeline.Pipeline interfaces.

plan = vtreat.BinomialOutcomeTreatment(
    outcome_target=True,
    params=vtreat.vtreat_parameters({'filter_to_recommended':True}))

Use .fit_transform() to get a special copy of the treated training data that has cross-validated mitigations againsst nested model bias. We call this a "cross frame." .fit_transform() is deliberately a different DataFrame than what would be returned by .fit().transform() (the .fit().transform() would damage the modeling effort due nested model bias, the .fit_transform() "cross frame" uses cross-validation techniques similar to "stacking" to mitigate these issues).

cross_frame = plan.fit_transform(d_train, churn_train)

Take a look at the new data. This frame is guaranteed to be all numeric with no missing values, with the rows in the same order as the training data.

cross_frame.head()
Var2_is_bad Var3_is_bad Var4_is_bad Var5_is_bad Var6_is_bad Var7_is_bad Var10_is_bad Var11_is_bad Var13_is_bad Var14_is_bad ... Var227_lev_RAYp Var227_lev_ZI9m Var228_logit_code Var228_prevalence_code Var228_lev_F2FyR07IdsN7I Var229_logit_code Var229_prevalence_code Var229_lev__NA_ Var229_lev_am7c Var229_lev_mj86
0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 1.0 0.0 0.145563 0.654178 1.0 0.180634 0.568733 1.0 0.0 0.0
1 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 1.0 0.0 0.150727 0.654178 1.0 0.175825 0.568733 1.0 0.0 0.0
2 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 0.0 0.0 -0.591072 0.053667 0.0 -0.296854 0.233689 0.0 1.0 0.0
3 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 1.0 0.0 0.150727 0.654178 1.0 -0.292587 0.196044 0.0 0.0 1.0
4 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 ... 0.0 0.0 -0.323715 0.018556 0.0 -0.268261 0.233689 0.0 1.0 0.0

5 rows × 234 columns

cross_frame.shape
(45000, 234)

Pick a recommended subset of the new derived variables.

plan.score_frame_.head()
variable orig_variable treatment y_aware has_range PearsonR R2 significance vcount default_threshold recommended
0 Var1_is_bad Var1 missing_indicator False True 0.004328 0.000019 0.358610 193.0 0.001036 False
1 Var2_is_bad Var2 missing_indicator False True 0.016358 0.000268 0.000520 193.0 0.001036 True
2 Var3_is_bad Var3 missing_indicator False True 0.016325 0.000266 0.000534 193.0 0.001036 True
3 Var4_is_bad Var4 missing_indicator False True 0.020327 0.000413 0.000016 193.0 0.001036 True
4 Var5_is_bad Var5 missing_indicator False True 0.017267 0.000298 0.000249 193.0 0.001036 True
model_vars = numpy.asarray(plan.score_frame_["variable"][plan.score_frame_["recommended"]])
len(model_vars)
234

Fit the model

cross_frame.dtypes
Var2_is_bad                            float64
Var3_is_bad                            float64
Var4_is_bad                            float64
Var5_is_bad                            float64
Var6_is_bad                            float64
                                  ...         
Var229_logit_code                      float64
Var229_prevalence_code                 float64
Var229_lev__NA_           Sparse[float64, 0.0]
Var229_lev_am7c           Sparse[float64, 0.0]
Var229_lev_mj86           Sparse[float64, 0.0]
Length: 234, dtype: object
# fails due to sparse columns
# can also work around this by setting the vtreat parameter 'sparse_indicators' to False
try:
    cross_sparse = xgboost.DMatrix(data=cross_frame.loc[:, model_vars], label=churn_train)
except Exception as ex:
    print(ex)
DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var191_lev__NA_, Var193_lev_RO12, Var193_lev_2Knk1KF, Var194_lev__NA_, Var194_lev_SEuy, Var195_lev_taul, Var200_lev__NA_, Var201_lev__NA_, Var201_lev_smXZ, Var205_lev_VpdQ, Var206_lev_IYzP, Var206_lev_zm5i, Var206_lev__NA_, Var207_lev_me75fM6ugJ, Var207_lev_7M47J5GA0pTYIFxg5uy, Var210_lev_uKAI, Var211_lev_L84s, Var211_lev_Mtgm, Var212_lev_NhsEn4L, Var212_lev_XfqtO3UdzaXh_, Var213_lev__NA_, Var214_lev__NA_, Var218_lev_cJvF, Var218_lev_UYBR, Var221_lev_oslk, Var221_lev_zCkv, Var225_lev__NA_, Var225_lev_ELof, Var226_lev_FSa2, Var227_lev_RAYp, Var227_lev_ZI9m, Var228_lev_F2FyR07IdsN7I, Var229_lev__NA_, Var229_lev_am7c, Var229_lev_mj86
# also fails
try:
    cross_sparse = scipy.sparse.csc_matrix(cross_frame[model_vars])
except Exception as ex:
    print(ex)
no supported conversion for types: (dtype('O'),)
# works
cross_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(cross_frame[[vi]]) for vi in model_vars])
# https://xgboost.readthedocs.io/en/latest/python/python_intro.html
fd = xgboost.DMatrix(
    data=cross_sparse, 
    label=churn_train)
x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cv = xgboost.cv(x_parameters, fd, num_boost_round=100, verbose_eval=False)
cv.head()
train-error-mean train-error-std test-error-mean test-error-std
0 0.073300 0.000709 0.073311 0.001447
1 0.073322 0.000741 0.073333 0.001415
2 0.073344 0.000747 0.073467 0.001464
3 0.073378 0.000725 0.073467 0.001464
4 0.073356 0.000739 0.073444 0.001450
best = cv.loc[cv["test-error-mean"]<= min(cv["test-error-mean"] + 1.0e-9), :]
best
train-error-mean train-error-std test-error-mean test-error-std
83 0.069933 0.000401 0.072333 0.001093
ntree = best.index.values[0]
ntree
83
fitter = xgboost.XGBClassifier(n_estimators=ntree, max_depth=3, objective='binary:logistic')
fitter
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=83, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
model = fitter.fit(cross_sparse, churn_train)

Apply the data transform to our held-out data.

test_processed = plan.transform(d_test)

Plot the quality of the model on training data (a biased measure of performance).

pf_train = pandas.DataFrame({"churn":churn_train})
pf_train["pred"] = model.predict_proba(cross_sparse)[:, 1]
wvpy.util.plot_roc(pf_train["pred"], pf_train["churn"], title="Model on Train")

png

0.778895961015585

Plot the quality of the model score on the held-out data. This AUC is not great, but in the ballpark of the original contest winners.

test_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processed[[vi]]) for vi in model_vars])
pf = pandas.DataFrame({"churn":churn_test})
pf["pred"] = model.predict_proba(test_sparse)[:, 1]
wvpy.util.plot_roc(pf["pred"], pf["churn"], title="Model on Test")

png

0.7472558854286449

Notice we dealt with many problem columns at once, and in a statistically sound manner. More on the vtreat package for Python can be found here: https://github.com/WinVector/pyvtreat. Details on the R version can be found here: https://github.com/WinVector/vtreat.

We can compare this to the R solution (link).

We can compare the above cross-frame solution to a naive "design transform and model on the same data set" solution as we show below. Note we are leaveing filter to recommended on, to show the non-cross validated methodology still fails in an "easy case."

plan_naive = vtreat.BinomialOutcomeTreatment(
    outcome_target=True,              
    params=vtreat.vtreat_parameters({'filter_to_recommended':True}))
plan_naive.fit(d_train, churn_train)
naive_frame = plan_naive.transform(d_train)
/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:235: UserWarning: possibly called transform on same data used to fit
(this causes over-fit, please use fit_transform() instead)
  "possibly called transform on same data used to fit\n" +
model_vars = numpy.asarray(plan_naive.score_frame_["variable"][plan_naive.score_frame_["recommended"]])
len(model_vars)
230
naive_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(naive_frame[[vi]]) for vi in model_vars])
fd_naive = xgboost.DMatrix(data=naive_sparse, label=churn_train)
x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cvn = xgboost.cv(x_parameters, fd_naive, num_boost_round=100, verbose_eval=False)
bestn = cvn.loc[cvn["test-error-mean"] <= min(cvn["test-error-mean"] + 1.0e-9), :]
bestn
train-error-mean train-error-std test-error-mean test-error-std
96 0.047833 0.000314 0.058444 0.001457
ntreen = bestn.index.values[0]
ntreen
96
fittern = xgboost.XGBClassifier(n_estimators=ntreen, max_depth=3, objective='binary:logistic')
fittern
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=96, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
modeln = fittern.fit(naive_sparse, churn_train)
test_processedn = plan_naive.transform(d_test)
test_processedn = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processedn[[vi]]) for vi in model_vars])
pfn_train = pandas.DataFrame({"churn":churn_train})
pfn_train["pred_naive"] = modeln.predict_proba(naive_sparse)[:, 1]
wvpy.util.plot_roc(pfn_train["pred_naive"], pfn_train["churn"], title="Overfit Model on Train")

png

0.9496847639151214
pfn = pandas.DataFrame({"churn":churn_test})
pfn["pred_naive"] = modeln.predict_proba(test_processedn)[:, 1]
wvpy.util.plot_roc(pfn["pred_naive"], pfn["churn"], title="Overfit Model on Test")

png

0.598560484633134

Note the naive test performance is worse, despite its far better training performance. This is over-fit due to the nested model bias of using the same data to build the treatment plan and model without any cross-frame mitigations.