R Tutorial
Capital One Data Mining Cup UW Statistics Club Saturday, March 23, 2013
Who Will Benefit From This?
Aimed at Students who
Have the statistical background but lack the (R) modelling expertise Never taken a linear regression course (or simply forgot the one they did!)
What We Will Be Doing Today
Walkthrough example of a statistical prediction problem using Kaggle test data (Titanic problem) The goal is to predict who will survive given different factors such as
Age
Ticket Fare Sex Cabin Number of family aboard
R Basics
Opening R (RStudio) Navigating to the working directory Running commands Installing packages Loading packages
Basic Guideline to Data Analysis
1.
2. 3. 4. 5. 6. 7.
Define the question
Define the ideal data set Determine what data you can access Obtain the data Clean the data Exploratory data analysis Statistical prediction/modelling
8.
9. 10. 11.
Interpret results
Challenge results Synthesize/write up results Create reproducible code
Cleaning the Data (skipped)
Fix variable names Merge data sets Fix missing content Fix inconsistent data
Exploratory Data Analysis
Make use of
Aggregation Tables Charts
We use two different R packages here: ggplot2, plyr
Testing Your Model
Before we build our model we need to have a methodology on how we will test it.
A nave analyst would use the entire data set to build the model and then test it on the same data set. This causes overfitting! Instead: partition training data set into a real training set and a validation set. To create validation set use:
Random sub-sampling K-fold
Leave-one-out
What measurement do we use to compare?
Adjusted 2 , AIC, BIC
Building Our First Model - Simple Linear Regression
Why is this a good starting point?
Easy to implement in R
Black box (i.e. no tuning parameters)
Easy to interpret/explain
Disadvantage: performs poorly in non-linear setting
Building Our First Model - Simple Linear Regression
After we have run our first model we want to:
Examine Residuals plot Examine Q-Q plot Use the Model Testing process to pick a proper model
Using the step function in R
Understanding Interaction (optional)
Checking for Multicollinearity (optional)
Multiple predictor variables are highly correlated Can be caused by:
Creating a new predictor variable from existing ones
Having multiple predictors that explain the same thing
Consequence: standard error blows up on estimate Use R to compute correlation between all predictors. If there exists sets of predictors above 0.90 0.95 then either:
Remove all but one Combine into a new composite variable
What Next?
Taking our Simple Linear Regression to the next level
Higher order terms Interaction terms
Data Transformations
Check for multicollinearity
Different Types of Models (not covered here but check the R Code!)
Generalized Linear Models Trees Random Forest
Ensemble Methods