0% found this document useful (0 votes)
77 views2 pages

Final

This document provides instructions for Project II, which involves applying classification models to real spam data. Students are asked to: 1) Divide the data into training and test sets. 2) Apply four classification methods - logistic regression with lasso, single decision tree, random forests, and boosting - to the training set. Evaluate the performance of each model on the test set by calculating misclassification error, c-statistic, and plotting the ROC curve. Students must include R code in an appendix and interpret the results of the analysis. The overall goal is to gain experience applying different predictive modeling tools to a real-world classification problem.

Uploaded by

Aabizer Plumber
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views2 pages

Final

This document provides instructions for Project II, which involves applying classification models to real spam data. Students are asked to: 1) Divide the data into training and test sets. 2) Apply four classification methods - logistic regression with lasso, single decision tree, random forests, and boosting - to the training set. Evaluate the performance of each model on the test set by calculating misclassification error, c-statistic, and plotting the ROC curve. Students must include R code in an appendix and interpret the results of the analysis. The overall goal is to gain experience applying different predictive modeling tools to a real-world classification problem.

Uploaded by

Aabizer Plumber
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NST 772 Data Mining and Statistical Learning I

Project II
(Due on 12/12, Monday) Instructions: While discussions with classmates are allowed and encouraged, please try to work on the project independently and direct your questions to me. Use copy-and-paste to edit the R output. Only include necessary R results in your nal report. Do not forget to include R programming codes in the appendix. Also, interpretation of the analysis results is required.

Data

In this project, you are asked to apply several classication tools to a real problem. Lets consider the spam data available at [Link] This data set has been used in the textbook for illustration. The description is given at [Link] . And more info on this data set available at the UCI spambase directory: [Link] However, you may use your own dataset. In that case, you need to describe the settings and data clearly and make sure that the data set you choose is appropriate for the classication problems. For example, try to avoid dependent data such as time series or repeated measures collected from clustered or longitudinal studies. Check with me if you are unsure about its appropriateness. The following websites contain rich data from a variety of application elds. Statlib: [Link] UCI Machine Learning Repository: [Link] KD Nuggets: [Link] You may also use other datasets from the textbook web site (HTF, 2009): [Link] [Link]/~tibs/ElemStatLearn/.

Analysis
1. Divide your data sets into two sets: the learning set and the test set using the same indicator as HTF (2009): [Link] You might want to use v-fold cross validation if the data set you select is moderately sized. 1

Follow the steps below to conduct the analysis:

2. Try out the following predictive modeling tools. For each method, use the training set to identify the best model and apply the model to the test set. Then compute the misclassication error rate with cuto point 0.50, the c statistic, and plot its ROC curve, all based on the test set performance. It would be best, but not required, to have the ROC curves plotted on one gure and compare. Logistic regression using lasso; One Single Decision Tree; Random Forests; Boosting. Since each method involves numerous parameters to tune, make sure that important details in the model tting are clearly explained in your report.

You might also like