Linear Regression
2022/2023
Luís Paquete
University of Coimbra
Linear Regression
Contents
● Linear regression model
● Multiple linear regression model
● Coefficient of determination
● Assumptions of linear regression
● Transformations
Linear Regression
Regression model
● A mathematical model that describes the behavior of a system over a range of
input values.
● A regression model allows to predict how the system will perform when given an
input value that was not measured.
● A linear regression model assumes a linear relationship between the input
variable and the ouput variable.
Linear Regression
Regression model
● A simple linear regression model has the form
y = a + bx
where x is the input variable, y is the predicted output
variable and a and b are the regression parameters.
● If yi is the value measured for the input value xi, then
(xi,yi) can be written as
yi = a + bxi + ei
where ei is the residual for the i-th measurement, that
is, the difference between the measured value for yi
and that would have been predicted from the model.
Linear Regression
Regression model
● To find a and b that will form a line that most closely
fits the n measured data points, minimize the sum of
squares of the residuals, SSE:
Linear Regression
A side note: Why the sum of squares?
● Why not the sum of absolute differences? This function is not differentiable
at 0. Then, the minimizers of the function cannot be easily found.
● The sum of squares function is differentiable everywhere and it is convex,
that is, the local minimum is also global minimum. Moreover, a and b can
be calculated by a closed formula.
Linear Regression
Example
Develop a regression model to relate the time required to perform a file-read operation to
the number of bytes read
File size in bytes Times in ms
10 3.8
50 8.1
100 11.9
500 55.6
1000 99.6
5000 500.2
10000 1006.1 y = 2.24 + 0.1002 x
Linear Regression
Example in R
Linear Regression
Multiple linear regression
● Multiple linear regression extends linear regression for k > 1 independent input
variables
y = b0 + b1 x1 + b2 x2 + ... + bk xk
● Each data point (x1i, x2i, ..., xki, yi) can be expressed as
yi = b0 + b1 x1i + b2 x2i + ... + bk xki + ei
where ei is the residual
Linear Regression
Multiple linear regression
● The square sum of errors (SSE) is
● Using matrix notation, we have a multiple linear regression model as follows
Y=Xb+e
where b = (XTX)-1XT Y minimizes SSE
Linear Regression
Example
Develop a regression model to relate the time required to perform a certain number of
input-output and memory operations
IO operations Mem. operations Times in ms
10 10 2.8
10 100 3.1
100 10 10.9
100 100 12.6
1000 10 106.2
1000 100 119.1
Linear Regression
Example in R
> D <- [Link]("[Link]",header=TRUE)
> [Link] <- lm(D$time ~ D$IO + D$Mem)
> summary([Link])
Call:
lm(formula = R$time ~ R$IO + R$mem)
Residuals:
1 2 3 4 5 6
2.9144 -1.7523 0.9941 -2.2725 -3.9086 4.0248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.779630 2.947538 -0.604 0.589
R$IO 0.111336 0.003698 30.104 8.05e-05 ***
R$Mem 0.055185 0.036737 1.502 0.230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.049 on 3 degrees of freedom
Multiple R-squared: 0.9967, Adjusted R-squared: 0.9945
F-statistic: 454.2 on 2 and 3 DF, p-value: 0.0001888
y = -1.780 + 0.111 x1 + 0.055 x2
Linear Regression
Multivariate linear regression
● Multivariate linear regression extends linear regression for m > 1 dependent
variables
Y = B0 + B1 x1 + B2 x2 + ... + Bk xk
● Each data point (x1i, x2i, ..., xki, yij) can be expressed as
yij = b0j + b1j x1i + b2j x2i + ... + bkj xki + eij
where eij is the residual
Linear Regression
Coefficient of determination
● Determine how much of the total variation is "explained" by the linear model.
● SST is the total variation of the measured system output
which is partitioned into two components:
SSR: portion of the SST that is explained by the regression model
SSE: portion of the SST that is due to the measurement error
Linear Regression
Coefficient of determination
● The coefficient of determination r2 is the fraction of SST "explained" by the model
● If r2 = 0, then SSE is as large as SST
● If r2 = 1, then SSE is 0
Linear Regression
Coefficient of correlation
● The coeficient of determination is the squared value of the coefficient of
correlation of x and y.
● It allows to investigate whether the correlation between input and output is positive
(0 < r ≤ 1) or negative (-1 ≤ r < 0). It indicates the strength of the linear relation.
Linear Regression
Coefficient of correlation
● A side note: correlation does not imply causation
Linear Regression
Example
Develop a regression model to relate the time required to perform a file-read operation to
the number of bytes read
File size in bytes Times in ms
10 3.8
50 8.1
100 11.9
500 55.6
1000 99.6
5000 500.2
10000 1006.1 y = 2.24 + 0.1002 x r2 = 0.9996
Linear Regression
Example in R
Linear Regression
Assumptions of linear regression
● A more complete examination of the underying assumptions of linear regression
may indicate whether the model can be used for prediction (inference).
● In R, the linear regression assumptions can be verified by doing
plot(<linear model>)
Linear Regression
Assumptions of linear regression
Residuals-vs-fitted plot allows to verify:
● Linearity: the mean residual value for every fitted
value region (red line) should be close to 0.
● Homoskedasticity (constante variance): The
spread of residuals should be approximately the
same across the x-axis.
● Outliers: identify extreme residuals
Normal Q-Q plot to verify the normality of residuals
Linear Regression
Example:
Linear Regression
Transformations
● A way of overcoming the problem with assumptions is to transform the data
Rule of Thumb 1: Transforming y may correct problems with the error terms.
Rule of Thumb 2: Transforming x may correct the non-linearity.
● However, a transformed model may be harder to interpret
Linear Regression
Transformations
Example: (D. Bruce and F. X. Schumacher, 1935)
● Predict the volume of a tree (y) from its diameter (x)
y = -41.57 + 6.93 x r2 = 0.89
Linear Regression
Transformations
Example: (D. Bruce and F. X. Schumacher, 1935)
● Predict the log of the volume of a tree (ln y) from the log of its diameter (ln x)
ln y = -2.87 + 2.56 ln x r2 = 0.97
Linear Regression
Transformations
● It is also possible to deduce a possible transformation by plotting the data or having
some assumption about the process of generating y values
● For instance, if an exponential behavior is expected, such as
y = abx
by taking the logarithm of both sides
ln y = ln a + (ln b)x
the expression has a linear form:
y' = a' + b' x
Linear Regression
Example
Develop a regression model for the number of transistors in the following years
Year Transistors
1 9500
2 16000
3 23000
4 38000
5 62000
6 105000
Linear Regression
Example in R
Linear Regression
Example
Develop a regression model for the number of transistors in the following years
Year ln(Transistors)
1 9.1590
2 9.6803
3 10.0432
4 10.5453
5 11.0349
6 11.5617
b’ = 0.474
a’ = 8.679
y' = 8.679 + 0.474x
Linear Regression
Example in R
após transformada ln(transistores)
o r2 deu um valor muito mais
proximo de 1
Linear Regression
Example
Develop a regression model for the number of transistors in the following years
Year Transistors
1 9500
2 16000
3 23000
4 38000
5 62000
6 105000
b’ = 0.474 b = eb' = 1.61
a’ = 8.679 a = ea’ = 5878
y = (5878)1.61x
Linear Regression
Example
Develop a regression model for the relation between CPU-time and number of processors
Processors CPU-time
1 100
2 54
3 25
4 18
5 15
6 12
7 10
8 12
9 8
Linear Regression
Example in R
Linear Regression
Example
Reciprocal transformation:
Processors CPU-time-1
1 0.01
2 0.02
3 0.04
4 0.06
5 0.07
6 0.08
7 0.10
8 0.08
9 0.13
Linear Regression
Example in R
Linear Regression
Example
Develop a regression model for the relation between CPU-time and number of processors
Processors CPU-time
1 100
2 54
3 25
4 18
5 15
6 12
7 10
8 12
9 8
y = (-0.002+0.013 x)-1
Linear Regression
Example
Develop a regression model for the CPU-time of binary search given a list size
Size CPU-time
1 6.91
2 7.60
3 8.00
4 8.29
5 8.52
6 8.70
7 8.85
8 8.99
9 9.01
Linear Regression
Example in R
Linear Regression
Example
Logarithmic transformation: y = a + b log x
log Size CPU-time
0.00 6.91
0.69 7.60
1.10 8.00
1.39 8.29
1.61 8.52
1.79 8.70
1.95 8.85
2.08 8.99
2.20 9.01
Linear Regression
Example in R
Linear Regression
Example
Develop a regression model for the CPU-time of binary search given a list size
Size CPU-time
1 6.91
2 7.60
3 8.00
4 8.29
5 8.52
6 8.70
7 8.85
8 8.99
9 9.01
y = 6.92 + 0.98 log x
Linear Regression
Example
Develop a regression model for the CPU-time of insertion sort
Size CPU-time
1 2
2 1
3 6
4 14
5 15
6 30
7 40
8 74
9 75
Linear Regression
Example in R
Linear Regression
Example
Square root transformation: y1/2 = a + b x
Size CPU-time
1 1.00
2 1.41
3 2.45
4 3.74
5 3.87
6 5.48
7 6.32
8 8.60
9 8.66
Linear Regression
Example in R
Linear Regression
Example
Develop a regression model for the CPU-time of insertion sort
Size CPU-time
1 2
2 1
3 6
4 14
5 15
6 30
7 40
8 74
9 75
y = (-0.49 + 1.02 x)2
Linear Regression
Recap:
● Linear regression model assumes a linear relationship between the input variable and
the output variable.
● Multiple linear regression model deals with more than one input variable
● Coefficient of determination is the fraction of total variation that is provided by the linear
model
● The assumptions of linear regression need to be met in order to ensure that the model
can be used for inference (e.g prediction).
● Transformations can be applied in order to model polynomial, exponential or inverse
relationships, but some care must be taken in the interpretation of the resulting model.
Linear Regression
References:
● [Link], Measuring computer performance, Cambridge University Press, 2002 (see
chapter 8)
● C.C. McGeoch, A Guide to Experimental Algorithmics, Cambrigde University Press,
2012 (see chapter 7)
● J. Faraway, Practical regression and ANOVA in R, chapter 8.
● D. Bruce , F. X. Schumacher, Forest Mensuration, Botanical Gazette, 1935
● J.W. Tukey, Exploratory Data Anaysis, Addison Wesley, 1977
● F. Mosteller and J.W. Tukey, Data Anaysis and Regression: A Second Course in
Statistics. Addison Wesley, 1977.