Data Mining and Model Selection
Data Mining and Model Selection
Model Selection
Bob Stine
Dept of Statistics, Wharton School
University of Pennsylvania
Wharton
Department of Statistics
Questions
Wharton
Department of Statistics
Over-fitting
Subset selection
Regularization (aka, shrinkage)
Averaging
Cross-validation
Wharton
Department of Statistics
Model Validation
Narrow interpretation
mean
&
variance
Wharton
Department of Statistics
Model Validation
2 RMSE
training
Wharton
Department of Statistics
test
5
Over-Fitting
error rate
Claimed
steadily falls with
over-fitting
Model claims to
predict new cases
better than it will.
Challenge
Wharton
Department of Statistics
Multiplicity
Wharton
Department of Statistics
P(max |z|>1.96)
0.05
0.23
25
0.72
100
0.99
Model Selection
Approaches
Subset selection
Shrinkage
Model averaging
Wharton
Department of Statistics
Next week
8
Subset Solution
Bonferroni procedure
Bonferroni z
Cost of data-driven
2.6
25
100
3.1
3.5
100000
5.0
Wharton
Department of Statistics
hypothesis testing
Discussion
Flexible
Process matters
Wharton
Department of Statistics
10
BIC:
z2 > log n
Aims to identify the true model
RIC:
z2 > 2 log p Bonferroni
Wharton
Department of Statistics
Penalized Likelihood
Penalized methods
Wharton
Department of Statistics
12
JMP output
Example
Osteo example
Results
Add variables so
long as BIC
decreases
Fit extra then
reverts back to
best
AIC vs BIC
13
Shrinkage Solution
Saturated model
Wharton
Department of Statistics
L2
Wharton
Department of Statistics
15
Cross-Validation Solution
Common sense alternative to criteria
No free lunches
Trade-off
Highly variable
Results depend which group was excluded for testing
Multi-fold cross-validation has become common
Optimistic
Only place I know of a random sample from same population
Wharton
Department of Statistics
1
2
3
4
5
16
Variability of CV
Example
Method of validation
Is assessment correct?
Wharton
Department of Statistics
17
Osteo Example
Training
Wharton
Department of Statistics
SD of pred errors
SD of residuals
CV in Data Mining
Caution
Wharton
Department of Statistics
19
Lasso
shrinkage
20
Lasso Example
Implementations
Wharton
Department of Statistics
21
Fit L
Lasso Example
1
regression, Lasso
osteo
model
Where to stop
adding features?
Wharton
Department of Statistics
22
Lasso Example in R
Similar output
Wharton
Department of Statistics
Repeated 10-fold CV
23
Discussion of CV
Population drift
Alternatives?
Bootstrap methods
Wharton
Department of Statistics
24
Take-Aways
Overfitting
Cross validation
Wharton
Department of Statistics
25
26
Thursday
Next Time
Newberry Lab
Friday
Wharton
Department of Statistics
July 4th holiday
27