KDD2009Example.md

This is an supervised classification example taken from the KDD 2009 cup. A copy of the data and details can be found here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009. The problem was to predict account cancellation ("churn") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels). In this example we show how to quickly use vtreat to prepare the data for modeling. vtreat takes in Pandas DataFrames and returns both a treatment plan and a clean Pandas DataFrame ready for modeling.

to install

!pip install vtreat !pip install wvpy Load our packages/modules.

import pandas
import xgboost
import vtreat
import vtreat.cross_plan
import numpy.random
import wvpy.util
import scipy.sparse

Read in explanitory variables.

# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009
dir = "../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/"
d = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\t', header=0)
vars = [c for c in d.columns]
d.shape

(50000, 230)

Read in dependent variable we are trying to predict.

churn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)
churn.columns = ["churn"]
churn.shape

(50000, 1)

churn["churn"].value_counts()

-1    46328
 1     3672
Name: churn, dtype: int64

Arrange test/train split.

numpy.random.seed(2020)
n = d.shape[0]
# https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md
split1 = vtreat.cross_plan.KWayCrossPlanYStratified().split_plan(n_rows=n, k_folds=10, y=churn.iloc[:, 0])
train_idx = set(split1[0]['train'])
is_train = [i in train_idx for i in range(n)]
is_test = numpy.logical_not(is_train)

(The reported performance runs of this example were sensitive to the prevalance of the churn variable in the test set, we are cutting down on this source of evaluation variarance by using the stratified split.)

d_train = d.loc[is_train, :].copy()
churn_train = numpy.asarray(churn.loc[is_train, :]["churn"]==1)
d_test = d.loc[is_test, :].copy()
churn_test = numpy.asarray(churn.loc[is_test, :]["churn"]==1)

Take a look at the dependent variables. They are a mess, many missing values. Categorical variables that can not be directly used without some re-encoding.

d_train.head()

	Var1	Var2	Var3	Var4	Var5	Var6	Var7	Var8	Var9	Var10	...	Var221	Var222	Var223	Var224	Var225	Var226	Var227	Var228	Var229	Var230
0	NaN	NaN	NaN	NaN	NaN	1526.0	7.0	NaN	NaN	NaN	...	oslk	fXVEsaq	jySVZNlOJy	NaN	NaN	xb3V	RAYp	F2FyR07IdsN7I	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	525.0	0.0	NaN	NaN	NaN	...	oslk	2Kb5FSF	LM8l689qOp	NaN	NaN	fKCe	RAYp	F2FyR07IdsN7I	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	5236.0	7.0	NaN	NaN	NaN	...	Al6ZaUT	NKv4yOc	jySVZNlOJy	NaN	kG3k	Qu4f	02N6s8f	ib5G6X1eUxUn6	am7c	NaN
4	NaN	NaN	NaN	NaN	NaN	1029.0	7.0	NaN	NaN	NaN	...	oslk	1J2cvxe	LM8l689qOp	NaN	kG3k	FSa2	RAYp	F2FyR07IdsN7I	mj86	NaN
5	NaN	NaN	NaN	NaN	NaN	658.0	7.0	NaN	NaN	NaN	...	zCkv	QqVuch3	LM8l689qOp	NaN	NaN	Qcbd	02N6s8f	Zy3gnGM	am7c	NaN

5 rows × 230 columns

d_train.shape

(45000, 230)

Try building a model directly off this data (this will fail).

fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')
try:
    fitter.fit(d_train, churn_train)
except Exception as ex:
    print(ex)

DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var191, Var192, Var193, Var194, Var195, Var196, Var197, Var198, Var199, Var200, Var201, Var202, Var203, Var204, Var205, Var206, Var207, Var208, Var210, Var211, Var212, Var213, Var214, Var215, Var216, Var217, Var218, Var219, Var220, Var221, Var222, Var223, Var224, Var225, Var226, Var227, Var228, Var229

Let's quickly prepare a data frame with none of these issues.

We start by building our treatment plan, this has the sklearn.pipeline.Pipeline interfaces.

plan = vtreat.BinomialOutcomeTreatment(
    outcome_target=True,
    params=vtreat.vtreat_parameters({'filter_to_recommended':True}))

Use .fit_transform() to get a special copy of the treated training data that has cross-validated mitigations againsst nested model bias. We call this a "cross frame." .fit_transform() is deliberately a different DataFrame than what would be returned by .fit().transform() (the .fit().transform() would damage the modeling effort due nested model bias, the .fit_transform() "cross frame" uses cross-validation techniques similar to "stacking" to mitigate these issues).

cross_frame = plan.fit_transform(d_train, churn_train)

Take a look at the new data. This frame is guaranteed to be all numeric with no missing values, with the rows in the same order as the training data.

cross_frame.head()

	Var2_is_bad	Var3_is_bad	Var4_is_bad	Var5_is_bad	Var10_is_bad	Var11_is_bad	Var14_is_bad	...	Var227_lev_RAYp	Var228_logit_code	Var228_prevalence_code	Var228_lev_F2FyR07IdsN7I	Var229_logit_code	Var229_prevalence_code	Var229_lev__NA_	Var229_lev_am7c	Var229_lev_mj86
0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	0.145563	0.654178	1.0	0.180634	0.568733	1.0	0.0	0.0
1	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	0.150727	0.654178	1.0	0.175825	0.568733	1.0	0.0	0.0
2	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	0.0	-0.591072	0.053667	0.0	-0.296854	0.233689	0.0	1.0	0.0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	0.150727	0.654178	1.0	-0.292587	0.196044	0.0	0.0	1.0
4	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	0.0	-0.323715	0.018556	0.0	-0.268261	0.233689	0.0	1.0	0.0

5 rows × 234 columns

cross_frame.shape

(45000, 234)

Pick a recommended subset of the new derived variables.

plan.score_frame_.head()

	variable	orig_variable	treatment	y_aware	has_range	PearsonR	R2	significance	vcount	default_threshold	recommended
0	Var1_is_bad	Var1	missing_indicator	False	True	0.004328	0.000019	0.358610	193.0	0.001036	False
1	Var2_is_bad	Var2	missing_indicator	False	True	0.016358	0.000268	0.000520	193.0	0.001036	True
2	Var3_is_bad	Var3	missing_indicator	False	True	0.016325	0.000266	0.000534	193.0	0.001036	True
3	Var4_is_bad	Var4	missing_indicator	False	True	0.020327	0.000413	0.000016	193.0	0.001036	True
4	Var5_is_bad	Var5	missing_indicator	False	True	0.017267	0.000298	0.000249	193.0	0.001036	True

model_vars = numpy.asarray(plan.score_frame_["variable"][plan.score_frame_["recommended"]])
len(model_vars)

Fit the model

cross_frame.dtypes

Var2_is_bad                            float64
Var3_is_bad                            float64
Var4_is_bad                            float64
Var5_is_bad                            float64
Var6_is_bad                            float64
                                  ...         
Var229_logit_code                      float64
Var229_prevalence_code                 float64
Var229_lev__NA_           Sparse[float64, 0.0]
Var229_lev_am7c           Sparse[float64, 0.0]
Var229_lev_mj86           Sparse[float64, 0.0]
Length: 234, dtype: object

# fails due to sparse columns
# can also work around this by setting the vtreat parameter 'sparse_indicators' to False
try:
    cross_sparse = xgboost.DMatrix(data=cross_frame.loc[:, model_vars], label=churn_train)
except Exception as ex:
    print(ex)

DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var191_lev__NA_, Var193_lev_RO12, Var193_lev_2Knk1KF, Var194_lev__NA_, Var194_lev_SEuy, Var195_lev_taul, Var200_lev__NA_, Var201_lev__NA_, Var201_lev_smXZ, Var205_lev_VpdQ, Var206_lev_IYzP, Var206_lev_zm5i, Var206_lev__NA_, Var207_lev_me75fM6ugJ, Var207_lev_7M47J5GA0pTYIFxg5uy, Var210_lev_uKAI, Var211_lev_L84s, Var211_lev_Mtgm, Var212_lev_NhsEn4L, Var212_lev_XfqtO3UdzaXh_, Var213_lev__NA_, Var214_lev__NA_, Var218_lev_cJvF, Var218_lev_UYBR, Var221_lev_oslk, Var221_lev_zCkv, Var225_lev__NA_, Var225_lev_ELof, Var226_lev_FSa2, Var227_lev_RAYp, Var227_lev_ZI9m, Var228_lev_F2FyR07IdsN7I, Var229_lev__NA_, Var229_lev_am7c, Var229_lev_mj86

# also fails
try:
    cross_sparse = scipy.sparse.csc_matrix(cross_frame[model_vars])
except Exception as ex:
    print(ex)

no supported conversion for types: (dtype('O'),)

# works
cross_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(cross_frame[[vi]]) for vi in model_vars])

# https://xgboost.readthedocs.io/en/latest/python/python_intro.html
fd = xgboost.DMatrix(
    data=cross_sparse, 
    label=churn_train)

x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cv = xgboost.cv(x_parameters, fd, num_boost_round=100, verbose_eval=False)

cv.head()

	train-error-mean	train-error-std	test-error-mean	test-error-std
0	0.073300	0.000709	0.073311	0.001447
1	0.073322	0.000741	0.073333	0.001415
2	0.073344	0.000747	0.073467	0.001464
3	0.073378	0.000725	0.073467	0.001464
4	0.073356	0.000739	0.073444	0.001450

best = cv.loc[cv["test-error-mean"]<= min(cv["test-error-mean"] + 1.0e-9), :]
best

	train-error-mean	train-error-std	test-error-mean	test-error-std
83	0.069933	0.000401	0.072333	0.001093

ntree = best.index.values[0]
ntree

fitter = xgboost.XGBClassifier(n_estimators=ntree, max_depth=3, objective='binary:logistic')
fitter

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=83, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

model = fitter.fit(cross_sparse, churn_train)

Apply the data transform to our held-out data.

test_processed = plan.transform(d_test)

Plot the quality of the model on training data (a biased measure of performance).

pf_train = pandas.DataFrame({"churn":churn_train})
pf_train["pred"] = model.predict_proba(cross_sparse)[:, 1]
wvpy.util.plot_roc(pf_train["pred"], pf_train["churn"], title="Model on Train")

0.778895961015585

Plot the quality of the model score on the held-out data. This AUC is not great, but in the ballpark of the original contest winners.

test_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processed[[vi]]) for vi in model_vars])
pf = pandas.DataFrame({"churn":churn_test})
pf["pred"] = model.predict_proba(test_sparse)[:, 1]
wvpy.util.plot_roc(pf["pred"], pf["churn"], title="Model on Test")

0.7472558854286449

Notice we dealt with many problem columns at once, and in a statistically sound manner. More on the vtreat package for Python can be found here: https://github.com/WinVector/pyvtreat. Details on the R version can be found here: https://github.com/WinVector/vtreat.

We can compare this to the R solution (link).

We can compare the above cross-frame solution to a naive "design transform and model on the same data set" solution as we show below. Note we are leaveing filter to recommended on, to show the non-cross validated methodology still fails in an "easy case."

plan_naive = vtreat.BinomialOutcomeTreatment(
    outcome_target=True,              
    params=vtreat.vtreat_parameters({'filter_to_recommended':True}))
plan_naive.fit(d_train, churn_train)
naive_frame = plan_naive.transform(d_train)

/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:235: UserWarning: possibly called transform on same data used to fit
(this causes over-fit, please use fit_transform() instead)
  "possibly called transform on same data used to fit\n" +

model_vars = numpy.asarray(plan_naive.score_frame_["variable"][plan_naive.score_frame_["recommended"]])
len(model_vars)

naive_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(naive_frame[[vi]]) for vi in model_vars])

fd_naive = xgboost.DMatrix(data=naive_sparse, label=churn_train)
x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cvn = xgboost.cv(x_parameters, fd_naive, num_boost_round=100, verbose_eval=False)

bestn = cvn.loc[cvn["test-error-mean"] <= min(cvn["test-error-mean"] + 1.0e-9), :]
bestn

	train-error-mean	train-error-std	test-error-mean	test-error-std
96	0.047833	0.000314	0.058444	0.001457

ntreen = bestn.index.values[0]
ntreen

fittern = xgboost.XGBClassifier(n_estimators=ntreen, max_depth=3, objective='binary:logistic')
fittern

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=96, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

modeln = fittern.fit(naive_sparse, churn_train)

test_processedn = plan_naive.transform(d_test)
test_processedn = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processedn[[vi]]) for vi in model_vars])

pfn_train = pandas.DataFrame({"churn":churn_train})
pfn_train["pred_naive"] = modeln.predict_proba(naive_sparse)[:, 1]
wvpy.util.plot_roc(pfn_train["pred_naive"], pfn_train["churn"], title="Overfit Model on Train")

0.9496847639151214

pfn = pandas.DataFrame({"churn":churn_test})
pfn["pred_naive"] = modeln.predict_proba(test_processedn)[:, 1]
wvpy.util.plot_roc(pfn["pred_naive"], pfn["churn"], title="Overfit Model on Test")

0.598560484633134

Note the naive test performance is worse, despite its far better training performance. This is over-fit due to the nested model bias of using the same data to build the treatment plan and model without any cross-frame mitigations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to install

FilesExpand file tree

KDD2009Example.md

Latest commit

History

KDD2009Example.md

File metadata and controls

to install