0% found this document useful (0 votes)
10 views5 pages

Tutorial 3 - Updated

This document is a tutorial for practicing data science methods in finance, focusing on predicting diamond prices and forecasting US stock market returns using R. It includes exercises utilizing the Caret and glmnet packages for model training and evaluation, as well as instructions for data preparation and analysis. The tutorial encourages the development of clean programs and does not require submission of answers.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Tutorial 3 - Updated

This document is a tutorial for practicing data science methods in finance, focusing on predicting diamond prices and forecasting US stock market returns using R. It includes exercises utilizing the Caret and glmnet packages for model training and evaluation, as well as instructions for data preparation and analysis. The tutorial encourages the development of clean programs and does not require submission of answers.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Science Methods in Finance

R Tutorial 3

November 8, 2024

Important Instructions
• The purpose of this tutorial is for you to practise some of the key concepts we covered
in the first topic

• It should not be submitted, but we strongly encourage you to work through it

For this exercise, NO write-up of your answers or submission is required. How-


ever, we recommend you already begin developing clean programs that you can
use later in the group assignment and the take-home exam

1
Question 1
The task in this question is to predict diamond prices. The main learning goal of this
question is to begin familiarizing you with the Caret package. The caret package contains
functions to streamline the model training process for complex regression and classification
problems. This package alone is often all you need to know for solving almost any supervised
machine learning problem; in addition, it also provides tools for auxiliary techniques such as:

• Data preparation (imputation, centering/scaling data, etc.)

• Data splitting

• Variable selection

• Model evaluation

1. Load the ggplot and summarize the diamonds dataset (a dataset that comes built-in
with the ggplot2 package), which contains the prices and other attributes of almost
54,000 diamonds. Split the data into a test and a training sample.

2. Estimate a linear model with lm and compare the RMSE in the test sample with the
RMSE based on the training sample.

3. Fit a Lasso to the training sample by using the glmnet package. Compare the RMSE
with the RMSE from the linear model for both the training and testing sample.

4. We now want to repeat the exercise with the Caret package. Begin by google “train-
Control Caret". The function trainControl generates parameters that further control
how models are created. Initialize a trainControl object that trains a model using 5 fold
cross-validation

5. Train an elastic-net model and set tuneLength=10. Use the trainControl object and
compare the RMSE with that from the linear model.

6. Explain how the glmnet can fit both Lasso and Ridge, which you can control by an
alpha parameter. Setting alpha = 0 gives the Ridge regression while setting alpha = 1
gives the Lasso regression.

2
Question 2
The main task of this question is to forecast the return on the US stock market.
To do so, you will need to use time-series cross-validation and a supervised learning methods
we covered in class. With permission from you, we plan to post the two best codes
with output on Canvas.
The following steps are meant to help you to get started.

1. We will use the datafile “[Link]”

2. You are free to choose any horizon but we will use quarterly horizon in the tutorial

3. The target you want to predict is the variable called “CRSP_SPvw" and the set of
predictors are the ones in the excel sheet. You are free to choose the ones you want to
include, but it should be at least 5, and make sure to lag the predictors by one period
(or lead the target)

4. Clean up your data with dplyr and use the readxl library to get your data into r. Using
dplyr together with readxl is part of the learning goal. The r file on canvas “02_start"
provides an example of how to align data.

After having cleaned the data, you should write a program that produces the following
output (but feel free to do much more)

5. Write a program that takes predictors as input and predict the return on the market
next period (feel free to play around with different forecast horizons)

6. Calculate out of sample R2 over the full period

7. Calculate out of sample R2 on a rolling basis

8. Compare the performance of your best model with that of a rolling mean. A rolling
mean means that you take the average of the target over the period from time t to T
and use that to predict the value of the target for period T + k.

3
Solution Q1.6
The glmnet package in R provides tools for fitting generalized linear models using penalized
maximum likelihood estimation. It supports ridge regression, lasso regression, and elastic-net
regression.
The general formulation of the regression problem solved by glmnet, with predictors X =
[x1 , x2 , . . . , xn ], response variable y, and regularization parameters λ and α, is as follows:
  
1 2 1 2
min ky − Xβk2 + λ (1 − α) kβk2 + α kβk1 ,
β 2N 2
where:

• 1
2N
ky − Xβk22 represents the mean squared error (MSE) scaled by 12 . This is commonly
used to simplify the mathematical derivation of the optimization problem.

• kβk22 represents the squared L2-norm of the coefficients β, known as the ridge penalty
term.

• kβk1 represents the L1-norm of the coefficients β, known as the lasso penalty term.

• N is the number of observations.

• λ is the regularization parameter controlling the amount of shrinkage: larger values of


λ result in greater shrinkage.

• α is the elastic-net mixing parameter, where 0 ≤ α ≤ 1:

– α = 1: The penalty is purely the lasso penalty.

– α = 0: The penalty is purely the ridge penalty.

– 0 < α < 1: The penalty combines lasso and ridge penalties, forming the elastic net
penalty.

The parameter λ is typically chosen through cross-validation.


For specific cases:

• Setting α = 1 in the glmnet function fits a lasso model, minimizing:


 
1 2
min ky − Xβk2 + λ kβk1 .
β 2N

4
• Setting α = 0 fits a ridge regression model, minimizing:
 
1 2 1 2
min ky − Xβk2 + λ kβk2 .
β 2N 2

Lasso regression performs feature selection by setting some coefficients exactly to zero.
Ridge regression, while shrinking coefficients, retains all predictors in the model.

You might also like