0% found this document useful (0 votes)

32 views9 pages

Text

Uploaded by

psbr6m92hj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views9 pages

Text

Uploaded by

psbr6m92hj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exam of Statistical Learning

Alessia Pini

2023-03-14

Theory
Answer to the following questions choosing the correct alternative. Note that only one alternative is true.
Motivate ALL your answers.

Question 1 (1.5 point)

We compute PCA on a sample of size n = 47 and p = 2 covariates. The sample covariance matrix is the
following:

## [,1] [,2]
## [1,] 1.5 0.7
## [2,] 0.7 1.6

a. the variance of the second principal component is 0.7195

b. the loadings of the second principal components are (0.7319, 0.6815)
c. none of the above

Question 2 (1.5 point)

We want to cluster tha data displayed in Figure 1 (n = 450, p = 2). We assume that data belong to two
clusters, and that within each cluster data are normally distributed.

a. k-means is expected to work well in this situation, since the two clusters seem to have a covariance
matrix that is proportional to the identity
b. we should not use k-means, since the two clusters seem to have a covariance matrix that is not
proportional to the identity
c. none of the above

Question 3 (1.5 point)

Based on a training set with sample size n = 20, we fit with least squares the polynomial regression with
response variable Y and covariate X:

Y = β0 + β1 X + β2 X 2 . . . + β30 X 30 + ε.

a. the training set residual sum of squares can be used for judging the validity of the model
b. the least squares problem has no solution
c. none of the above

1
8 10
6
data[,2]

4
2
−2 0

−5 0 5 10 15

data[,1]

Figure 1: Plot of the data

Question 4 (1.5 point)

We perform k-means clustering on a sample of size n = 77 and p = 41 covariates

a. at each iteration, cluster centers are computed as the sample means of the number of points within a
cluster
b. at each iteration, each point is assigned to the cluster with the smallest number of points
c. none of the above

Question 5 (1.5 point)

k-nearest neighbors:

a. can only be applied with Euclidean distance

b. can only be applied to classification problems with 2 classes
c. none of the above

Question 6 (1.5 point)

Consider the plot in Figure 2, reporting a sample of size 942 with two covariates and the value of a two-level
categorical outcome (color). Which method would you suggest to classify the data?

a. maximum margin classifier

b. logistic regression
c. none of the above

2
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Data

3
Exercises
You are asked to model the dependence between the price of a flight from an Italian airport and the following
set of covariates indicating the flight’s and purchase characteristics:

• destination: categorical, indicating the destination (levels: 0=Italy; 1=Europe (outside Italy); 2=Asia;
3=North America; 4=South America; 5=Oceania)
• type: categorical, indicating the type of ticket (levels: 1: one way; 2: round trip)
• airline: categorical, indicating the airline type (levels: 1=Low cost; 2=Mainline)
• promo: categorical, indicating whether there was some special promotions in the date of purchase
(levels: 0: no promotions; 1: promotions)
• class: class of the ticket (levels: 0: economy; 1: business)
• distance: numerical, length or distance of the flight (in km)
• duration: numerical, time duration of the flight (in hours)
• flight_year: numerical, year of the flight (2014 to 2023)
• flight_day: numerical, day of the year of the flight (1 to 365)
• purchase_to_flight: numerical, elapsed time between the purchase and flight (in days)

In all exercises involving supervised methods, you should use a training set containing 70% of
observations, whose raw indexes are selected with the following code, and a random seed equal to 314.

n = dim([Link])[1]
[Link](seed)
[Link] = sample(1:n,n*7/10)
train = [Link][[Link],]
test = [Link][-[Link],]

Answer to all following questions in the markdown file. Pay attention to motivate (with code or text) all
your answers.

4
Exercise 1 (points: 1+2+2)

Run a hierarchical clustering on the numerical covariates based on the euclidean distance and Ward linkage.

a. cut the tree at 2 clusters. How many units are contained into the different clusters?
b. cut the tree at 2 clusters and display the scatterplot of all covariates colored according to cluster labels.
Give a comment on the clusters obtained with the suggested method.

c. Now cluster the data with the average linkage and 2 clusters. Give a comment on the R script
on which linkage method you would suggest to use. Give a comment on the R script on
which linkage method you would suggest to use between the two proposed. Would you
suggest another linkage method?

5
Exercise 2 (points: 1+2+2)

Answer to the following questions.

a. Fit a Lasso regression setting a shrinkage parameter equal to λ = 2.2. Then, compute the vector of
estimated coefficients.
b. Now, compute the optimal value of the shrinkage parameter by cross validation. Use 10-fold cross
validation, a grid of 100 values of λ between 10−6 and 106 , and set a seed equal to 314‘. In the
markdown file, write a comment on the cross-validation results, suggesting a value for the
parameter λ.
c. Refit the model with the optimal value obtained at point b, and compute its MSE. In the markdown
file, write a comment on the coefficients obtained with such optimal value.

6
Exercise 3 (points: 1+2+2)

Using all covariates, fit a regression tree setting the following parameters:

• minimum number of observations to include in either child node: 4

• smallest allowed node size: 8
• minimum reduction of the MSE for considering a node split: 3 × 10−4 .

a. Compute the MSE on the training and test set.

b. Now the previous tree is pruned so that the final number of leaves is equal to 8. Compute the train and
test set MSE of the pruned tree. Which tree size you suggest to use?
c. Plot the resulting regression tree. Write a comment on the structure of the obtained tree.

7
Exercise 4 (points: 1+2+2)

Run a boosting model on the data set, setting the following parameters:

• number of trees: B = 2000

• shrinkage: λ = 0.05
• interaction depth d = 6

Then, answer to the following questions.

a. Compute the train and test set MSE of the boosting model.
b. What can you say about the most informative covariates? Write a comment on the
markdown file.
c. Now, recompute the MSE for d spanning from 1 to 6. Which value would you suggest for the
parameter d? What is the interpretation of such value?

8
Exercise 5 (4 points)

Fit a linear model on the dataset, and compute the test set MSE.
In the markdown file, write a comment on the model that in your opinion performs better
on this data set among the ones that you fitted for explaining the relationship between the
outcome and the covariates. Describe this model in terms of how the covariates influence the
response. Finally, specify whether (and how) you would suggest to improve the model.

Computer Lab 2 Block 1-3
No ratings yet
Computer Lab 2 Block 1-3
7 pages
R and Python Programming Exercises
100% (1)
R and Python Programming Exercises
24 pages
Activity 7
No ratings yet
Activity 7
5 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
Part I: Written Exercises: Homework 3 Submit On NYU Classes by Fri. Oct. 20 at Noon
No ratings yet
Part I: Written Exercises: Homework 3 Submit On NYU Classes by Fri. Oct. 20 at Noon
3 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Dav Pracs
No ratings yet
Dav Pracs
9 pages
试卷2
No ratings yet
试卷2
16 pages
Homework 1
No ratings yet
Homework 1
9 pages
ML External File-43
No ratings yet
ML External File-43
23 pages
Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)
No ratings yet
Model Selection I: Principles of Model Choice and Designed Experiments (Ch. 10)
10 pages
SDSC3006 - Assignment 2
No ratings yet
SDSC3006 - Assignment 2
3 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Homework 2
100% (1)
Homework 2
14 pages
1 Computation Questions: STA3002: Generalized Linear Models Spring 2023
No ratings yet
1 Computation Questions: STA3002: Generalized Linear Models Spring 2023
3 pages
WEEK
No ratings yet
WEEK
17 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Assignment III
No ratings yet
Assignment III
3 pages
Response and Predictor Variables Analysis
No ratings yet
Response and Predictor Variables Analysis
6 pages
Assignment
No ratings yet
Assignment
7 pages
Stats 330 Term Test 2003
No ratings yet
Stats 330 Term Test 2003
8 pages
Week03 Lecture BB
No ratings yet
Week03 Lecture BB
112 pages
K-Fold Cross Validation in R
No ratings yet
K-Fold Cross Validation in R
26 pages
Programming Exercises for Students
No ratings yet
Programming Exercises for Students
45 pages
ENGG1003 06 DataModelingAndVisualization
No ratings yet
ENGG1003 06 DataModelingAndVisualization
28 pages
SDS Solution1
No ratings yet
SDS Solution1
26 pages
Empirical Models and Data Collection
No ratings yet
Empirical Models and Data Collection
14 pages
Statistical Machine Learning Exam Guide
No ratings yet
Statistical Machine Learning Exam Guide
10 pages
Activities Super
No ratings yet
Activities Super
6 pages
Python Data Analytics Techniques
No ratings yet
Python Data Analytics Techniques
10 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
ML File - Merged
No ratings yet
ML File - Merged
24 pages
Stat - Model - Exam - 2017 - DBU
No ratings yet
Stat - Model - Exam - 2017 - DBU
20 pages
SDSC3006 - Assignment 3
No ratings yet
SDSC3006 - Assignment 3
4 pages
04 BasicAnalyses
No ratings yet
04 BasicAnalyses
44 pages
Machine Learning Homework 7 Instructions
No ratings yet
Machine Learning Homework 7 Instructions
5 pages
Machine Learning Assignment 3 Guide
No ratings yet
Machine Learning Assignment 3 Guide
3 pages
Sheet1 1
No ratings yet
Sheet1 1
2 pages
Revision 235
No ratings yet
Revision 235
8 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
SVM Assignment: Computational Stats
No ratings yet
SVM Assignment: Computational Stats
2 pages
Statistical Machine Learning Exam Guide
No ratings yet
Statistical Machine Learning Exam Guide
12 pages
Agniva
No ratings yet
Agniva
16 pages
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
No ratings yet
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
6 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Data Mining Test Questions and Answers
0% (1)
Data Mining Test Questions and Answers
5 pages
CS 3035 Machine Learning Exam 2023
No ratings yet
CS 3035 Machine Learning Exam 2023
3 pages
Classification: K N X X X y I y
No ratings yet
Classification: K N X X X y I y
6 pages
Section 1 - Introduction To Regression
No ratings yet
Section 1 - Introduction To Regression
8 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
Week-2 NK
No ratings yet
Week-2 NK
12 pages
Bias:Variance Tradeoff
No ratings yet
Bias:Variance Tradeoff
6 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Discussion 3 Supervised
No ratings yet
Discussion 3 Supervised
14 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
BSC First Year Syllabus
100% (1)
BSC First Year Syllabus
6 pages
Litun Research 501
No ratings yet
Litun Research 501
6 pages
Pengaruh Presepsi Harga Dan Minat Beli Terhdapa Pembelian Smartphone
No ratings yet
Pengaruh Presepsi Harga Dan Minat Beli Terhdapa Pembelian Smartphone
10 pages
Ribeiro Da Costa Et Al. 2022 - The Body Adiposity Index Is Not Applicable For Brazilian Adult Population
No ratings yet
Ribeiro Da Costa Et Al. 2022 - The Body Adiposity Index Is Not Applicable For Brazilian Adult Population
8 pages
Assignment 2 (Chapter 5 & 6) Fin534
No ratings yet
Assignment 2 (Chapter 5 & 6) Fin534
26 pages
Identifying The Different Random Sampling
No ratings yet
Identifying The Different Random Sampling
13 pages
Two Dimensional Random Variables Guide
No ratings yet
Two Dimensional Random Variables Guide
4 pages
104 Cheat Sheet
No ratings yet
104 Cheat Sheet
4 pages
Bluman Elem Stats 9e Ch01 Ppts
No ratings yet
Bluman Elem Stats 9e Ch01 Ppts
16 pages
Design of Experiments
No ratings yet
Design of Experiments
18 pages
Probability and Distribution Lesson Plan
No ratings yet
Probability and Distribution Lesson Plan
3 pages
EDA WORKSHEETs
No ratings yet
EDA WORKSHEETs
5 pages
Math T STPM Sem 3 2022
No ratings yet
Math T STPM Sem 3 2022
2 pages
Attachment 1
No ratings yet
Attachment 1
3 pages
Are Female Mallards Attracted To The Color Green?
No ratings yet
Are Female Mallards Attracted To The Color Green?
5 pages
WPS01 01 Que 20210421
No ratings yet
WPS01 01 Que 20210421
24 pages
Experimental Quantitative Research
No ratings yet
Experimental Quantitative Research
29 pages
Auditing Ii Resume CH 17 Audit Sampling For Tests of Details of Balances (Contoh Audit Untuk Menguji Detail Dari Saldo)
No ratings yet
Auditing Ii Resume CH 17 Audit Sampling For Tests of Details of Balances (Contoh Audit Untuk Menguji Detail Dari Saldo)
22 pages
1 Statistics
No ratings yet
1 Statistics
12 pages
CP4
No ratings yet
CP4
3 pages
Scorecard Formula Guide
100% (1)
Scorecard Formula Guide
32 pages
Percentiles & Rank Calculation Guide
No ratings yet
Percentiles & Rank Calculation Guide
3 pages
8linear and Curvilinear Regrsssion, Curves
No ratings yet
8linear and Curvilinear Regrsssion, Curves
10 pages
Chapter 0 - Multiple Regression Models
100% (1)
Chapter 0 - Multiple Regression Models
34 pages
Hypothesis Testing for Urban Planners
No ratings yet
Hypothesis Testing for Urban Planners
32 pages
Coefficient of Variation Explained
No ratings yet
Coefficient of Variation Explained
4 pages
Burdick2008. Gauge Repeatability and Reproducibility Study, Misclassification Ratespdf
No ratings yet
Burdick2008. Gauge Repeatability and Reproducibility Study, Misclassification Ratespdf
4 pages
Model Answer MidTerm - Math313
No ratings yet
Model Answer MidTerm - Math313
5 pages
Descriptive Statistics Sheet
No ratings yet
Descriptive Statistics Sheet
8 pages
Lecture 6 - Inferential Statistics With Two Samples
No ratings yet
Lecture 6 - Inferential Statistics With Two Samples
27 pages

Text

Uploaded by

Text

Uploaded by

Exam of Statistical Learning

Question 1 (1.5 point)

a. the variance of the second principal component is 0.7195

Question 2 (1.5 point)

Question 3 (1.5 point)

Figure 1: Plot of the data

Question 4 (1.5 point)

We perform k-means clustering on a sample of size n = 77 and p = 41 covariates

Question 5 (1.5 point)

a. can only be applied with Euclidean distance

Question 6 (1.5 point)

a. maximum margin classifier

0.0 0.2 0.4 0.6 0.8 1.0

Answer to the following questions.

• minimum number of observations to include in either child node: 4

a. Compute the MSE on the training and test set.

• number of trees: B = 2000

Then, answer to the following questions.

You might also like