0% found this document useful (0 votes)
32 views9 pages

Text

Uploaded by

psbr6m92hj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views9 pages

Text

Uploaded by

psbr6m92hj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exam of Statistical Learning

Alessia Pini

2023-03-14

Theory
Answer to the following questions choosing the correct alternative. Note that only one alternative is true.
Motivate ALL your answers.

Question 1 (1.5 point)

We compute PCA on a sample of size n = 47 and p = 2 covariates. The sample covariance matrix is the
following:

## [,1] [,2]
## [1,] 1.5 0.7
## [2,] 0.7 1.6

a. the variance of the second principal component is 0.7195


b. the loadings of the second principal components are (0.7319, 0.6815)
c. none of the above

Question 2 (1.5 point)

We want to cluster tha data displayed in Figure 1 (n = 450, p = 2). We assume that data belong to two
clusters, and that within each cluster data are normally distributed.

a. k-means is expected to work well in this situation, since the two clusters seem to have a covariance
matrix that is proportional to the identity
b. we should not use k-means, since the two clusters seem to have a covariance matrix that is not
proportional to the identity
c. none of the above

Question 3 (1.5 point)

Based on a training set with sample size n = 20, we fit with least squares the polynomial regression with
response variable Y and covariate X:

Y = β0 + β1 X + β2 X 2 . . . + β30 X 30 + ε.

a. the training set residual sum of squares can be used for judging the validity of the model
b. the least squares problem has no solution
c. none of the above

1
8 10
6
data[,2]

4
2
−2 0

−5 0 5 10 15

data[,1]

Figure 1: Plot of the data

Question 4 (1.5 point)

We perform k-means clustering on a sample of size n = 77 and p = 41 covariates

a. at each iteration, cluster centers are computed as the sample means of the number of points within a
cluster
b. at each iteration, each point is assigned to the cluster with the smallest number of points
c. none of the above

Question 5 (1.5 point)

k-nearest neighbors:

a. can only be applied with Euclidean distance


b. can only be applied to classification problems with 2 classes
c. none of the above

Question 6 (1.5 point)

Consider the plot in Figure 2, reporting a sample of size 942 with two covariates and the value of a two-level
categorical outcome (color). Which method would you suggest to classify the data?

a. maximum margin classifier


b. logistic regression
c. none of the above

2
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Figure 2: Data

3
Exercises
You are asked to model the dependence between the price of a flight from an Italian airport and the following
set of covariates indicating the flight’s and purchase characteristics:

• destination: categorical, indicating the destination (levels: 0=Italy; 1=Europe (outside Italy); 2=Asia;
3=North America; 4=South America; 5=Oceania)
• type: categorical, indicating the type of ticket (levels: 1: one way; 2: round trip)
• airline: categorical, indicating the airline type (levels: 1=Low cost; 2=Mainline)
• promo: categorical, indicating whether there was some special promotions in the date of purchase
(levels: 0: no promotions; 1: promotions)
• class: class of the ticket (levels: 0: economy; 1: business)
• distance: numerical, length or distance of the flight (in km)
• duration: numerical, time duration of the flight (in hours)
• flight_year: numerical, year of the flight (2014 to 2023)
• flight_day: numerical, day of the year of the flight (1 to 365)
• purchase_to_flight: numerical, elapsed time between the purchase and flight (in days)

In all exercises involving supervised methods, you should use a training set containing 70% of
observations, whose raw indexes are selected with the following code, and a random seed equal to 314.

n = dim([Link])[1]
[Link](seed)
[Link] = sample(1:n,n*7/10)
train = [Link][[Link],]
test = [Link][-[Link],]

Answer to all following questions in the markdown file. Pay attention to motivate (with code or text) all
your answers.

4
Exercise 1 (points: 1+2+2)

Run a hierarchical clustering on the numerical covariates based on the euclidean distance and Ward linkage.

a. cut the tree at 2 clusters. How many units are contained into the different clusters?
b. cut the tree at 2 clusters and display the scatterplot of all covariates colored according to cluster labels.
Give a comment on the clusters obtained with the suggested method.

c. Now cluster the data with the average linkage and 2 clusters. Give a comment on the R script
on which linkage method you would suggest to use. Give a comment on the R script on
which linkage method you would suggest to use between the two proposed. Would you
suggest another linkage method?

5
Exercise 2 (points: 1+2+2)

Answer to the following questions.

a. Fit a Lasso regression setting a shrinkage parameter equal to λ = 2.2. Then, compute the vector of
estimated coefficients.
b. Now, compute the optimal value of the shrinkage parameter by cross validation. Use 10-fold cross
validation, a grid of 100 values of λ between 10−6 and 106 , and set a seed equal to 314‘. In the
markdown file, write a comment on the cross-validation results, suggesting a value for the
parameter λ.
c. Refit the model with the optimal value obtained at point b, and compute its MSE. In the markdown
file, write a comment on the coefficients obtained with such optimal value.

6
Exercise 3 (points: 1+2+2)

Using all covariates, fit a regression tree setting the following parameters:

• minimum number of observations to include in either child node: 4


• smallest allowed node size: 8
• minimum reduction of the MSE for considering a node split: 3 × 10−4 .

a. Compute the MSE on the training and test set.


b. Now the previous tree is pruned so that the final number of leaves is equal to 8. Compute the train and
test set MSE of the pruned tree. Which tree size you suggest to use?
c. Plot the resulting regression tree. Write a comment on the structure of the obtained tree.

7
Exercise 4 (points: 1+2+2)

Run a boosting model on the data set, setting the following parameters:

• number of trees: B = 2000


• shrinkage: λ = 0.05
• interaction depth d = 6

Then, answer to the following questions.

a. Compute the train and test set MSE of the boosting model.
b. What can you say about the most informative covariates? Write a comment on the
markdown file.
c. Now, recompute the MSE for d spanning from 1 to 6. Which value would you suggest for the
parameter d? What is the interpretation of such value?

8
Exercise 5 (4 points)

Fit a linear model on the dataset, and compute the test set MSE.
In the markdown file, write a comment on the model that in your opinion performs better
on this data set among the ones that you fitted for explaining the relationship between the
outcome and the covariates. Describe this model in terms of how the covariates influence the
response. Finally, specify whether (and how) you would suggest to improve the model.

You might also like