NST 772 Data Mining and Statistical Learning I
Project II
(Due on 12/12, Monday) Instructions: While discussions with classmates are allowed and encouraged, please try to work on the project independently and direct your questions to me. Use copy-and-paste to edit the R output. Only include necessary R results in your nal report. Do not forget to include R programming codes in the appendix. Also, interpretation of the analysis results is required.
Data
In this project, you are asked to apply several classication tools to a real problem. Lets consider the spam data available at [Link] This data set has been used in the textbook for illustration. The description is given at [Link] . And more info on this data set available at the UCI spambase directory: [Link] However, you may use your own dataset. In that case, you need to describe the settings and data clearly and make sure that the data set you choose is appropriate for the classication problems. For example, try to avoid dependent data such as time series or repeated measures collected from clustered or longitudinal studies. Check with me if you are unsure about its appropriateness. The following websites contain rich data from a variety of application elds. Statlib: [Link] UCI Machine Learning Repository: [Link] KD Nuggets: [Link] You may also use other datasets from the textbook web site (HTF, 2009): [Link] [Link]/~tibs/ElemStatLearn/.
Analysis
1. Divide your data sets into two sets: the learning set and the test set using the same indicator as HTF (2009): [Link] You might want to use v-fold cross validation if the data set you select is moderately sized. 1
Follow the steps below to conduct the analysis:
2. Try out the following predictive modeling tools. For each method, use the training set to identify the best model and apply the model to the test set. Then compute the misclassication error rate with cuto point 0.50, the c statistic, and plot its ROC curve, all based on the test set performance. It would be best, but not required, to have the ROC curves plotted on one gure and compare. Logistic regression using lasso; One Single Decision Tree; Random Forests; Boosting. Since each method involves numerous parameters to tune, make sure that important details in the model tting are clearly explained in your report.