0% found this document useful (0 votes)

10 views53 pages

PLSandPLSDA - Torino2021 - Federico Marini

The document discusses linear regression, focusing on direct versus inverse calibration methods in multivariate regression. It highlights the advantages of multivariate approaches, such as noise reduction and the ability to handle complex mixtures, and outlines the requirements, applicability, and limitations of both direct and inverse calibration. Additionally, it covers model predictions, evaluation metrics, and the importance of interpreting regression coefficients while considering variable scaling.

Uploaded by

veneta chaova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views53 pages

PLSandPLSDA - Torino2021 - Federico Marini

Uploaded by

veneta chaova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LINEAR REGRESSION:

MULTIVARIATE REGRESSION

Federico Marini
Dept. Chemistry, University of Rome “La Sapienza”, Rome, Italy
Direct vs inverse calibration
• The distinction refers to the way the relationship between signal(s) and
concentration of chemical constituents is established.
– DIRECT: signal is considered to be directly proportional to the concentration
– INVERSE: concentration is considered to be directly proportional to the signal.

Direct calibration Inverse calibration

1 120

0.9

100
0.8

0.7
80

0.6

Concentration
Absorbance

0.5 60

0.4

40
0.3

0.2
20

0.1

0 0
10 20 30 40 50 60 70 80 90 100 0 0.2 0.4 0.6 0.8 1
Concentration Absorbance

5. Multivariate Regression
Direct vs inverse calibration - 2
• UNIVARIATE CASE: no substantial difference appears to exist in calibrating a
regression line in a direct or in an inverse way.
– Inverse models more efficient with small data sets and high noise level
• MULTIVARIATE CASE: The difference is relevant.
– Direct models show some advantages, but present unsolvable problems regarding the variety of
analytical platform/problems they can be applied to
– Inverse models are almost always used.
-3
50 #10
1 16

0.9 45 14

0.8 40
12

Regression Coefficients
= ×
0.7 35

DIRECT
10
Concentration
Absorbance

0.6 30
8
0.5 25
6
0.4 20
0.3 4
15
0.2 2
10
0.1
5 0
0
250 300 350 400 450 0 -2
6 (nm) 0 5 10 15 20 25 1 2 3
Sample Index Constituents

50 1 5

45 0.9 4
40 0.8
3

Regression Coefficients
= ×
35 0.7

INVERSE
2
Concentration

Absorbance

30 0.6

25 0.5 1

20 0.4
0
0.3
15
-1
0.2
10
0.1 -2
5
0 -3
0 250 300 350 400 450 250 300 350 400 450
0 5 10 15 20 25 6 (nm) 6 (nm)
Sample Index

5. Multivariate Regression
Direct Calibration
• Requirements:
– Prepare mixtures of standards of all pure chemical constituents to be used as calibration samples.
– Number of mixtures ≥ number of constituents (the larger they are the more precise the results)
– Calibration mixtures should be as representative as possible of the combination of concentrations
in future, unknown samples
• Applicability:
– The concentrations of ALL the constituents in all calibration samples should be known.
– The constituents should be the same as in future test samples.
– Relates signals to constituents’ concentrations, but not to global properties (octane number,
sensorial attributes, iodine value, etc.)
• Limitations:
– Sensitive to spectral correlations (constituents with severely overlapped spectra cannot be reliably
determined)
– The presence of non-modeled interferents may lead to serious errors in the determination.

5. Multivariate Regression
Inverse Calibration
• Applicability:
– Multicomponent samples where only one or a few analytes are of interest
– Concentrations, spectra, and chemical identities of the other constituents may be unknown.
– First proposed in the 1960’s (connected to the development of NIR for the non-invasive analysis of
intact material*)
• Advantages:
– Allows studying of complex mixtures knowing only the concentrations of a limited number of
constituents.
– Quantitation of an analyte in the presence of interferences (if they are properly represented in the
calibration samples, even if their concentrations or chemical identities are not known).

50 1 5

45 0.9 4
40 0.8
3

Regression Coefficients
= ×
35 0.7
2
Concentration

Absorbance

30 0.6

25 0.5 1

20 0.4
0
0.3
15
-1
0.2
10
0.1 -2
5
0 -3
0 250 300 350 400 450 250 300 350 400 450
0 5 10 15 20 25 6 (nm) 6 (nm)
Sample Index

5. Multivariate Regression
Why going multivariate?
• Instead of just using one of the variables, it makes sense to use all the measured
information
• There are many significant advantages in doing so:
– Noise reduction: More (redundant) measurements of the same phenomenon
– Possibility of interferents: Non-selective signals can be made selective by
mathematics (provided that the signal profiles of the interferents are not completely
identical to that from the analyte).
– Exploratory aspects: models provide a number of informative parameters + residuals
– Outlier detection: detection of outlier is enhanced having multivariate data.

5. Multivariate Regression
A first multivariate approach: MLR
Statement of the problem:
• Linear relationship between a response yi and all the variables
measured on a sample xij
𝑦$ = 𝑏& + 𝑏(𝑥$( + 𝑏*𝑥$* + 𝑏+𝑥$+ + ⋯ + 𝑏- 𝑥$- + 𝑒$

• When more than a sample is analyzed:

𝑦( = 𝑏& + 𝑏(𝑥(( + 𝑏*𝑥(* + 𝑏+𝑥(+ + ⋯ + 𝑏- 𝑥(- + 𝑒(
𝑦* = 𝑏& + 𝑏(𝑥*( + 𝑏*𝑥** + 𝑏+𝑥*+ + ⋯ + 𝑏- 𝑥*- + 𝑒*
….
𝑦$ = 𝑏& + 𝑏(𝑥$( + 𝑏*𝑥$* + 𝑏+𝑥$+ + ⋯ + 𝑏- 𝑥$- + 𝑒$
….
𝑦/ = 𝑏& + 𝑏(𝑥/( + 𝑏*𝑥/* + 𝑏+𝑥/+ + ⋯ + 𝑏- 𝑥/- + 𝑒/

5. Multivariate Regression
A first multivariate approach: MLR
• The response estimates 𝑦0$ = 𝑏& + 𝑏(𝑥$( + 𝑏*𝑥$* + ⋯ + 𝑏- 𝑥$- lie on
a p-dimensional hyperplane

yi
50

ei = yi ! y^i
40

30
y^i

20
Y

15
10
-10
5
16 14 12 10 8 6 4 20
X1
X2

5. Multivariate Regression
Least squares as projection
• Column space of a matrix: space spanned by its columns (samples are the axes).
• Regression is a projection of y onto the column space of X:
4( 3
𝐲0 = 𝐗 𝐗 3 𝐗 𝐗 𝐲 = 𝐇𝐲

6
y
4

2 y
^

0
Sample3

-2
var1 var2
-4
4
-6
3

-8
2
-4
-3 1
-2
-1 0
0
1 -1
Sample2
Sample1 2
3 -2

5. Multivariate Regression
Model predictions and confidence intervals
• The predicted response for any of the observations used for model
building is :
𝑦0$ = 𝐱 $7 𝐛
with
𝐱 $7 = 𝑥$( 𝑥$* … ⋯ 𝑥$- 1
• The corresponding prediction uncertainty is:
𝑠<*0 = = 𝑠<*𝐱 $7 𝐗 7 𝐗 4( 𝐱 *
$ =𝑠< ℎ$$
where ℎ$$ is the leverage of the ith observation, which corresponds to the
(i,i) element of the Hat matrix.
• Accordingly, the (1-α)% confidence interval for 𝑦0$ is:
𝑦0$ ± 𝑡(4A,/4-4(𝑠<0 =

5. Multivariate Regression
Model building and data pretreatment
• We wrote the model as: 𝑦0$ = 𝐱 $7 𝐛
with 𝐱 $7 = 𝑥$( 𝑥$* … … 𝑥$- 1
and 𝐛 = 𝑏( 𝑏* … … 𝑏- 𝑏&
• The hyperplane fitting the data is bound to contain the point (D D which is also
𝐱, 𝑦),
the point with the smallest prediction uncertainty.

• However, in the case of mean centering for both the 𝐗 and the 𝐲:
– 𝑿HI = 𝑿 − 𝟏D 𝐱7
– 𝒚HI = 𝒚 − yD
the hyperplane fitting the data passes through the origin (𝟎, 0), so the
model doesn’t contain the term 𝑏& :
7
• Accordingly, with centered data, the model is: 𝑦0$,HI = 𝐱 $,HI 𝐛HI
7
with 𝐱 $,HI = 𝑥$( 𝑥$* … … 𝑥$-4( 𝑥$-
and 𝐛HI = 𝑏( 𝑏* … … 𝑏-4( 𝑏-

5. Multivariate Regression
Evaluating the regression
• The quality of a multivariate regression model can be evalulated using the same
figures of merit already discussed for the univariate case (bias, R2, residuals…)

R2=0.9739 s y=0.21 - bias=0.00

10 0.8

0.6
9

0.4

0.2
Predicted Y

Residuals
7

-0.2

5
-0.4

4 -0.6
4 5 6 7 8 9 10 4 5 6 7 8 9 10
Measured Y Measured Y

• Sometimes, to be consistent with cases where exact estimation of the degrees of

freedom is not possible, the uncertainty of predictions may be expressed as:
∑W Y
=UV X=
– Root mean square error: 𝑅𝑀𝑆𝐸 =
/

• Moreover, to compare models with a different number of variables, it is possible

to define a so-called adjusted R2 (the usual R2 never decreases with the addition of
a new variable):
* /4(
– 𝑅Z[\ = 1 − 1 − 𝑅*
/4-4(

5. Multivariate Regression
Interpreting the model
] = 𝑿𝒃
• MLR model is defined by: 𝒚
• The relation between the 𝐗 and the 𝐲 is encoded in the regression coefficients:
– Their magnitude and sign reflect the contribution of the individual X-variables in determining the
value of the predicted response.
– In the absence of interferents, they correspond to the profile of the pure constituent of interest

0.8 0.15

0.6

Regression coefficients
0.4
Regression coefficients

0.1
0.2

-0.2 0.05

-0.4

-0.6

0
-0.8 250 300 350 400 450
1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 (nm)
Variable index

5. Multivariate Regression
Interpreting the model: Things to be aware of
• The absolute value of the regression coefficients is influenced by the scales of the
𝐗 and the 𝐲 variables:
– They can have a high magnitude just due to the relative scales of that particular X variable and the
y.
– This problem can be solved by variable scaling (as already discussed).

No scaling Scaled data

0.8 0.4

0.6 0.3

0.2
0.4
Regression coefficients

Regression coefficients
0.1
0.2

0
0
-0.1

-0.2
-0.2

-0.4
-0.3

-0.6 -0.4

-0.8 -0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Variable index Variable index

5. Multivariate Regression
Interpreting the model: Things to be aware of - 2
• The presence of interferents makes interpretation of the regression coefficients
less straightforward:
– Signal of the pure constituent orthogonalized with respect to the contributions of all the
interferents
No overlap
0.2 2 0.2

Regression coefficients
0.15 1.5 0.15
Intensity

Intensity
0.1 1 0.1

0.05 0.5 0.05

0 0 0
250 300 350 400 450 250 300 350 400 450 250 300 350 400 450
6 (nm) 6 (nm) 6 (nm)

Overlap
0.2 3
Analyte
0.25

Regression coefficients
Interferent
2.5
0.15 0.2
2
Intensity

Intensity

0.15

0.1 1.5 0.1

1 0.05
0.05
0
0.5
-0.05
0 0
250 300 350 400 450 250 300 350 400 450 250 300 350 400 450
6 (nm) 6 (nm) 6 (nm)

5. Multivariate Regression
What if multiple ys?
• Each y is separately regressed on X
𝑦$( = 𝑏&( + 𝑏(( 𝑥$( + 𝑏*( 𝑥$* + 𝑏+( 𝑥$+ + ⋯ + 𝑏-( 𝑥$- + 𝑒$(
𝑦$* = 𝑏&* + 𝑏(* 𝑥$( + 𝑏** 𝑥$* + 𝑏+* 𝑥$+ + ⋯ + 𝑏-* 𝑥$- + 𝑒$*
⋮
𝑦$` = 𝑏&` + 𝑏(` 𝑥$( + 𝑏*` 𝑥$* + 𝑏+` 𝑥$+ + ⋯ + 𝑏-` 𝑥$- + 𝑒$`

• Problem can be stated in matrix form:

𝑦(( ⋯ 𝑦(` 𝑥(( … 𝑥(- 1 𝑒(( ⋯ 𝑒(`
𝑦*( ⋯ 𝑦*` 𝑥*( … 𝑥*- 1 𝑏(( ⋯ 𝑏(` 𝑒*( ⋯ 𝑒*`
⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮
𝑦$( ⋯ 𝑦$` = 𝑥$( … 𝑥$- + 𝑒$( ⋯ 𝑒$`
1 𝑏-( ⋯ 𝑏-`
⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⋮ 𝑏&( ⋯ 𝑏&` ⋮ ⋱ ⋮
𝑦/( ⋯ 𝑦/` 𝑥/( … 𝑥/- 1 𝑒/( ⋯ 𝑒/`
𝐘 = 𝐗𝐁 + 𝐄 = 𝐘e + 𝐄
• As in the case of a single y, the column of ones (and the terms 𝑏&` ) are absent for
centered data

5. Multivariate Regression
What if multiple ys? - 2
• Calculation of the model parameters by least squares gives the solution:
𝐁 = (𝑿7 𝑿)4( 𝑿7 𝐘

• This corresponds to each column of the regression coefficient matrix being given
by:
𝒃` = (𝑿7 𝑿)4( 𝑿7 𝒚`

• Since each response 𝒚` is predicted according to:

]` = 𝑿𝒃`
𝒚

building a single model on multiple responses gives identical results as building

𝑙 regression models on individual responses.

• This is not necessarily the case with other regression models (e.g., Partial Least
Squares Regression).

5. Multivariate Regression
Problems with MLR
• MLR is conceptually simple and generalizes univariate LS regression.
BUT
• The core of MLR is the calculation of the model parameters by the LS
approach, according to:
𝐁 = (𝑿7 𝑿)4(𝑿7 𝐘

• Due to the term (𝑿7 𝑿)4(, it can’t be applied to ill-conditioned matrices:

– Correlated variables
– More variables than samples

↓
• Use of Latent variables (BILINEAR MODELING):
– Few
– Orthogonal

5. Multivariate Regression
Partial Least Squares Regression (PLS)
• Scores are extracted from the X block so to be at the same time:
– Explanatory of a large amount of the variance in X
– Predictive for y (explanatory of a large amount of the variance in y).
• These conditions can be mathematically summarized by the following
conditions/characteristics:
– The scores 𝑻 = 𝒕( 𝒕* ⋯ 𝒕i = 𝑿𝑹 are extracted so to have maximum covariance
with the reponse y: argmax 𝒕r7 𝒚 ⟹ argmax 𝒓r7 𝑿7 𝒚
𝒓q 𝒓q

– 𝑹 = 𝒓( 𝒓* ⋯ 𝒓i is a matrix collecting the weights to obtain scores from the 𝑿

block
– The scores lie in the joint column space of X and y and are orthogonal.
– The variance in y is approximated by the same set of scores:
𝒚 = 𝑻𝒒7 + 𝒆< ⟹ 𝒚 ] = 𝑻𝒒7 = 𝑞( 𝒕( + 𝑞* 𝒕* + ⋯ + 𝑞i 𝒕i

5. Multivariate Regression
The PLS criterion graphically explained

-0.2

-0.4
Sample 3

x1
-0.6

-0.8

-1 y PCA
x t1
-1.2 tPLS 2
1
1

0.5 1.4
1.2
0 1
0.8
0.6
-0.5 0.4
0.2
Sample 2 -1 0
Sample 1

• With respect to PCA, scores are «rotated» towards y

5. Multivariate Regression
With more than 2 X-variables: The projection

x3 x2 x 4 x 11 x
3 x 13
x 12 5

2
x7
tPLS
2
1
x
x 6 10
Sample 3

0
x9 x1 x y
8

-1

-2

-2
tPLS
-3
1 0
-4
2
-5
4
1.5 1 0.5 0 -0.5 -1 -1.5 Sample 2
Sample 1

5. Multivariate Regression
PLS: Predicting the response
• The peculiarity of PLS is to extract scores which are directly explanatory also
of the y variance:
] = 𝑻𝒒7
– They can be used to approximate (predict) the response: 𝒚
• Scores are extracted from X as linear combination of the variables, through
the weight matrix:
– 𝑻 = 𝑿𝑹
• It is then possible to directly relate the predicted response to the predictor
matrix X as:
] = 𝑻𝒒7 = 𝑿𝑹𝒒7 = 𝑿𝒃wxy
– 𝒚
– 𝒃wxy is a vector of regression coefficients; 𝒃wxy = 𝑹𝒒7
] = 𝑿𝒃wxy allows to extend to
– The possibility of expressing the PLS model in the form: 𝒚
the method all the considerations already done for MLR and PCR.
– Prediction of the response for a new sample is carried out according to:
]/Xz = 𝑿/Xz 𝒃wxy
𝒚

5. Multivariate Regression
PLS: Model building and predictions
• Building a PLS model requires choosing the number of PLS components
(Latent variables, LVs):
– Selection is generally carried out through a cross-validation procedure
– In the example, 8 components (corresponding to the minimum of RMSECV) are retained
in the final model

1.6

1.4

1.2
RMSECV

0.8

0.6

0.4
0 5 10 15
Number of Latent Variables

5. Multivariate Regression
PLS: Model building and predictions - 2
• The model is calculated on the training set and validated on the test set:
– The «usual» figures of merit for regression can be used to evaluate the model quality

19
Predicted protein content (%w/w)

18 Calibration Validation
17 RMSE 0.343 0.297
16
bias 0.00 -0.09
R2 0.959 0.966
15

13
Training set
Test set
12
12 13 14 15 16 17 18 19 20 21 22
Measured protein content (%w/w)

5. Multivariate Regression
PLS: A bit more on the model
• The PLS scores are build so to capture the maximum covariance between X and y
and to be orthogonal.
• These conditions are accomplished through the introduction of the weights R.
• Once the scores are calculated, a further set of loadings P is needed:
– These loadings produce the best approximation of the variability in X given the set of
scores T: 𝑿 { = 𝑻𝑷7
– In PLS, these loadings are not orthogonal; they describe only the variance in X.

0.8 0.015
LV1 LV1
LV2 LV2
LV3 0.01 LV3

0.6 0.005

0
0.4

PLS X-loadings
-0.005
PLS weights

0.2 -0.01

-0.015
0
-0.02

-0.025
-0.2
-0.03

-0.4 -0.035
500 1000 1500 2000 500 1000 1500 2000
6 (nm) 6 (nm)

5. Multivariate Regression
PLS for a single response: What else do we get?
2
10

Regression coefficients
5
-1
LV2

-2

0
-3

-4

-5
-5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
LV1 6 (nm)

Scores plot Regression coefficients

#10-3
1.5 22

19
1

18
Q

y
16
0.5
15

13
0
0 5 10 15 20 25 30 35 40 45 12
T2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
T
1

Influence plot Inner relation

5. Multivariate Regression
Partial Least Squares for multiple responses: The idea

X Y

X = TPT + E Y = UQT + F

u1 = c1 t1
u2 = c2 t2
….
uF = cF tF

5. Multivariate Regression
PLS for multiple Ys: A few concepts
• The responses are assumed to possess an underlying latent structure.
• Scores are built so to have maximum covariance:
argmax 𝒕r7 𝒖r ⟹ argmax 𝒓r7 𝑿7 𝒚𝒒r
𝒓q ,𝒒q 𝒓q ,𝒒q

• Regression is accomplished between the scores (univariate linear regression):

{ = 𝑻𝑪
]r = 𝒕r 𝑐r ⇒ 𝑼
𝒖
3

1
1

0
u

-1

-2

-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
t1

5. Multivariate Regression
PLS for multiple Ys: Summarizing the model
• The responses are approximated by the bilinear model of the block as:
{ = 𝑼𝑸7 .
𝒀
• The values of the Y-scores are predicted from the X-scores (inner relation):
{ = 𝑻𝑪
𝑼
• X-scores are calculated from the X-variables through the X-weights:
𝑻 = 𝑿𝑹
• Also in the case of PLS regression for multiple Y, prediction of the responses can be
expressed directly in terms of the original variables, by combining the information
above into a single equation:
{ = 𝑼𝑸7 = 𝑻𝑪𝑸7 = 𝑿𝑹𝑪𝑸7 = 𝑿𝑩wxy
𝒀
• Where the matrix of regression coefficients is equal to:
𝑩wxy = 𝑹𝑪𝑸7

5. Multivariate Regression
PLS for multiple Ys: Considerations
• Contrarily to the case of MLR and PCR, in PLS there is a substantial difference in
building a single model to calibrate all the responses or L individual models, one for
any response.
• A single PLS-2 (multiple-y) model is recommended when the Y variables are
correlated and share an underlying structure.
• PCA of the Y matrix can help choosing the most appropriate approach

Y loadings
0.5

0.4 Moisture

0.3

0.2

0.1
Starch
PC2

-0.1
Protein
-0.2

-0.3 Oil

-0.4
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
PC1

5. Multivariate Regression
Partial Least Squares
Discriminant Analysis (PLS-DA)
Classification
Classification: Intro
• Sometimes, data are collected in order to predict a qualitative property, i.e., a response that can take only discrete values

0.9 60

0.8
Healthy
50
0.7

0.6 40

Sample index
Absorbance

0.5
30
0.4

0.3 20

0.2
10
Diseased
0.1

0
1000 1500 2000 2500 3000 3500 4000
Wavenumber (cm -1)

#107
4

3.5 Low activity

10
3

2.5 20
Sample index
Intensity

2 30
Medium activity
1.5
40

1
50
High activity
0.5
60
0
0 10 20 30 40
Retention time (min)

Classification
Classification
• Each value that the discrete response can take is called a class or category.
• Category or class is a (ideal) group of objects sharing similar characteristics.
120 2

0.6

100 1

0.5

0
80
0.4

X Y

log(1/R)

X2
-1
60

↔ 0.3
↔ 40
-2
0.2

-3
20
0.1

-4
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
6 (nm) -1 0 1 X1
Y

CLASSIFICATION:
• “To find a criterion to assign an object (sample) to one category (class)
based on a set of measurements performed on the object itself”
• In classification, categories are defined a priori (≠ cluster analysis)

Classification
Classification approaches

DISCRIMINANT TECHNIQUES: MODELING TECHNIQUES:

• Estimate the optimal boundaries • Focus on looking for similarities among
(decision surfaces) which separate the samples belonging to the same class.
different classes in the multidimensional • Each category is modeled individually.
space. • A sample can be assigned to one class,
• Samples are always classified to one and to more than one class or to no class at
only one of the categories in the training all.
set
Classification
Discriminant techniques
• Look for surfaces (decision boundaries) dividing the multivariate space into
as many regions as the number of classes in the training set.
• Associate to any vector of measurements 𝒙$ a predicted class 𝑐$̂ which can
be one and only one of the categories represented in the training set.
• Classification boundaries can take any arbitrary mathematical form
Linear boundary Non-linear boundary
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2
2

2
0 0
X

X
-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
X X
1 1

Classification
Partial Least Squares-Discriminant Analysis (PLS-DA)
• Useful when the number of variables is higher than that of
available samples and with correlated predictors
• Based on the PLS algorithm:
– Classification problem should be re-formulated as regression
• Class in encoded in dummy Y vector

Training data Binary coding Bipolar coding

1 1
1 1
X1 1
1
1
1
1 1
0 -1
0 -1
X2 0 -1
0 -1
0 -1

Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 2

X y ypred X b

• While “true” Y values are binary-coded, predictions are real valued.

1.4

1.2

0.8

0.6
Predicted Y

0.4

0.2

-0.2

-0.4

-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y

Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 3
• Classification is accomplished by setting a proper threshold to the
predicted Y values.
• The “natural” threshold is 0.5:
– Ypred>0.5 →class 1
– Ypred<0.5 →class 2
1.4

1.2

0.8

0.6
Predicted Y

0.4

0.2

-0.2

-0.4

-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y

Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 4
• Different methods have been proposed in the literature to find alternative
“optimal” threshold values (see later).
• In the example, setting the value to 0.33 allows the correct classification of
all samples:
1.4

1.2

0.8

0.6
Predicted Y

0.4

0.2

-0.2

-0.4

-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y

Classification
Model complexity
• PLS-DA is a bilinear model:
– Optimal number of components (LVs) should be selected
• Error criterion in cross-validation is (usually) based on the number
(percentage) of mis-classifications:
∑•‹Œ( 𝑛‹,H$•I`Z••$r$X[
%𝐶𝐸 𝑡𝑜𝑡 = 100 − %𝑁𝐸𝑅 𝑡𝑜𝑡 = 100×
𝑛•‘•
11

8
%Classification error

1
1 2 3 4 5 6 7 8 9
Number of LVs

Classification
With more than two classes
• Instead of a binary vector, a dummy binary matrix is used to code for
class belonging
• Y spans a G-1 dimensional space (G being the number of classes)

Training spectra Class index Dummy Y matrix

1 1 0 0
1 1 0 0
X1 1
1
1
1
0
0
0
0
1 1 0 0
2 0 1 0
2 0 1 0
X2 2 0 1 0
2 0 1 0
2 0 1 0
3 0 0 1
3 0 0 1
X3 3 0 0 1
3 0 0 1
3 0 0 3
1

Classification
PLS-DA for more than two classes
• Model is built using PLS-2 algorithm
• A matrix of regression coefficient is obtained

X Y Ypred X B

• For each sample, predicted Y is a row vector (real valued) having as

many columns as the number of classes.
• Different options to achieve classification based on the values of
predicted Y

Classification
PLS-DA for more than two classes - 2
• Predicted y is real-valued:
“true” y predicted y
1 0 0 1.03 0.09 -0.10
1 0 0 0.68 0.21 0.08
1 0 0 0.99 -0.10 0.01 1.2
1 0 0 0.96 0.18 -0.14 1

1 0 0 0.79 0.02 0.25 0.8

3
0 1 0

Predicted Y
0.6
0.14 0.94 0.07
0.4
0 1 0 -0.01 1.12 0.12 0.2
0 1 0 0.08 0.89 -0.02 0

0 1 0 0.33 0.45 0.25 -0.2

0 1 0 0.15 0.72 0.06

-0.4
-0.5
-0.5
0 0 1 0.13 -0.18 0.85 0 0
0.5
0 0 1 0.21 0.17 0.56
0.5
1
Predicted Y 1 1.5
0 0 1 -0.09 0.32 0.69
2 Predicted Y 1

0 0 1 0.12 0.06 1.01

0 0 3
1 0.02 -0.03 0.98

• In the simplest multiclass PLS-DA implementation, sample is assigned to

the class corresponding to the highest value of predicted Y
0.10 0.02 0.98 Class 3
Xunknown ]unknown
𝒚

Classification
PLS-DA for more than two classes - 3
• Another alternative approach to perform classification based on
discriminant PLS results is to apply LDA:
- On the predicted Y (after removing one of the columns)
- On the X scores T
{ = 𝑻𝑸7 = 𝑿𝑩
𝒀

T LDA

Remove
X
1 column
Ypred

Classification
Digression:
Validation
5. Multivariate Regression
The concept of validation
• Verify if valid conclusions can be formulated from a model:
– Able to generalize parsimoniously (with the smaller nr. of LV)
– Able to predict accurately
• Define a proper diagnostics for characterizing the quality of the
solution:
– Calculation of some error criterion based on residuals
• Residuals can be used for:
– Assessing which model to use;
– Defining the model complexity in component-based methods;
– Evaluating the predictive ability of a regression (or classification) model;
– Checking whether overfitting is present (by comparing the results in
validation and in fitting);
– Residual analysis (model diagnostics).

5. Multivariate Regression
The need for “new” data
• The use of fitted residuals would lead to overoptimism:
– Magnitude and structure not similar to the ones that would be obtained if the
model were used on new data.

Test set validation Cross-validation

Xval Yval

Xcal Ycal
Xcal Ycal

Xval Yval
5. Multivariate Regression
Test set validation
• Carried out by fitting the model to new data (test set):
– Simulates the practical use of the model on future data.
– Test set should be as independent as possible from the calibration set
(collecting new samples and analysing them in different days…)
– A representative portion of the total data set can be left aside as test set.

{ IZ` = 𝑿IZ` 𝑩
𝒀 { ’Z` = 𝑿’Z` 𝑩
𝒀

?
Xval Predicted Yval

Xcal Ycal
*
∑” •–—
$Œ( 𝑦$
’Z`
− 𝑦0$’Z`
MODEL QUALITY vs 𝑅𝑀𝑆𝐸𝑃 =
𝑁’Z`
Yval Pred. Yval

5. Multivariate Regression
Cross-validation
• Internal resampling method:
– Simulates test set validation by repeating a data splitting procedure where
different object are in turn placed in the validation set.
– Particularly useful when a limited number of samples are available.
• Schematically, it consists of the following steps:

1. Leave out part of the data values

Xval1 Yval1

X Y Xcal1 Ycal1

5. Multivariate Regression
Cross-validation
2. Build the model without these data

{ IZ`( = 𝑿IZ`(𝑩(
𝒀

Xcal1 Ycal1

3. Apply the model to the left out values and obtain predictions;

{ ’Z`( = 𝑿’Z`(𝑩(
𝒀

Xval1 ? Predicted Yval1

5. Multivariate Regression
Cross-validation
4. Calculate the corresponding residual error
vs 𝑃𝑅𝐸𝑆𝑆( = ™
”•–—V
𝑦$’Z`( − 𝑦0$’Z`(
*

Pred. Yval1
$Œ(
Yval1
5. Repeat steps 1-4 until each data value has been left out once

Xval2 Yval2

Xcal2 Ycal2 XcalG YcalG XvalG YvalG

6. Collect all the residuals into an overall error criterion

vs ∑•\Œ( 𝑃𝑅𝐸𝑆𝑆\ ∑”
$Œ( 𝑦$ − 𝑦
04$ *
𝑅𝑀𝑆𝐸𝐶𝑉 = =
𝑁 𝑁
Yval Pred. Yval

5. Multivariate Regression
Cross-validation
• Number of objects is limited
• Understand the inherent structure of the system ↔ Estimating
model complexity
• Objects in a data table can be stratified into groups based on
background information:
– Across instrumental replicates (repeatability)
– Reproducibility (analyst, instrument, reagent...)
– Sampling site and time
– Across treatment/origin (year, raw material, batch…)

5. Multivariate Regression

Regression Model
No ratings yet
Regression Model
30 pages
Linear Regression for Researchers
No ratings yet
Linear Regression for Researchers
41 pages
Multivariate Linear Regression Overview
No ratings yet
Multivariate Linear Regression Overview
52 pages
Rothman - 2010 - JoCGS - Sparse Multivariate Regression With Covariance Estimation
No ratings yet
Rothman - 2010 - JoCGS - Sparse Multivariate Regression With Covariance Estimation
17 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Regression and Prediction
No ratings yet
Regression and Prediction
56 pages
Unit 3
No ratings yet
Unit 3
25 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
6 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Chapter 8 Linear Regression
No ratings yet
Chapter 8 Linear Regression
34 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Sma32
No ratings yet
Sma32
30 pages
Psy524 Lecture 5 MR - Updated
No ratings yet
Psy524 Lecture 5 MR - Updated
29 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
10 pages
ML Linear Regression Trupesh Patel
No ratings yet
ML Linear Regression Trupesh Patel
23 pages
Understanding Regression Analysis Techniques
No ratings yet
Understanding Regression Analysis Techniques
3 pages
Iml Unit III
No ratings yet
Iml Unit III
18 pages
Multiple Linear Regression Estimation
No ratings yet
Multiple Linear Regression Estimation
45 pages
Multiple Regression Analysis for Business Forecasting
No ratings yet
Multiple Regression Analysis for Business Forecasting
100 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Advanced Regression Techniques
No ratings yet
Advanced Regression Techniques
63 pages
Fsgs
No ratings yet
Fsgs
28 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Multiple Regression & Model Building
No ratings yet
Multiple Regression & Model Building
20 pages
04 MLR
No ratings yet
04 MLR
32 pages
5.linear Regression
No ratings yet
5.linear Regression
39 pages
Multiple Linear Regression Explained
No ratings yet
Multiple Linear Regression Explained
55 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
Multiple Regression Forecasting Techniques
No ratings yet
Multiple Regression Forecasting Techniques
100 pages
Linear Regression For Machine Learning
No ratings yet
Linear Regression For Machine Learning
9 pages
Chap 5
No ratings yet
Chap 5
13 pages
Linear Models
No ratings yet
Linear Models
50 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
Multiple Regression
100% (1)
Multiple Regression
100 pages
RSM1282-2025-Session 6-Multiple Regression POST
No ratings yet
RSM1282-2025-Session 6-Multiple Regression POST
84 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
11 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
Tutorial On PLS and PCA
100% (1)
Tutorial On PLS and PCA
17 pages
Partial Least Squares Regression A Tutorial
100% (1)
Partial Least Squares Regression A Tutorial
17 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
ADM2304 Multiple Regression Dr. Suren Phansalker
No ratings yet
ADM2304 Multiple Regression Dr. Suren Phansalker
12 pages
Unit 2
No ratings yet
Unit 2
92 pages
NOTES - UNIT 2 - Machine Learning
No ratings yet
NOTES - UNIT 2 - Machine Learning
33 pages
Multiple Regression (Compatibility Mode)
No ratings yet
Multiple Regression (Compatibility Mode)
24 pages
Lecture 3.1
No ratings yet
Lecture 3.1
21 pages
ML Unit
No ratings yet
ML Unit
23 pages
Comprehensive Guide to Regression Analysis
No ratings yet
Comprehensive Guide to Regression Analysis
7 pages
Data Science Module 5 Q & A
No ratings yet
Data Science Module 5 Q & A
8 pages
A Review On Linear Regression Comprehensive in Machine Learning
No ratings yet
A Review On Linear Regression Comprehensive in Machine Learning
8 pages
Midterm Exam #2 Topics Overview
No ratings yet
Midterm Exam #2 Topics Overview
6 pages
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
No ratings yet
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
21 pages
Unit-2 Regression-Supervised Machine Learning
No ratings yet
Unit-2 Regression-Supervised Machine Learning
90 pages
Time and Motion Study
0% (1)
Time and Motion Study
6 pages
2025 All Series Zoomlion Brochure Printing Version中联重科工程机械产品绿皮书
No ratings yet
2025 All Series Zoomlion Brochure Printing Version中联重科工程机械产品绿皮书
51 pages
Concrete Quality Control Guide
100% (1)
Concrete Quality Control Guide
16 pages
Rust ML Ai
No ratings yet
Rust ML Ai
48 pages
PDE in Cleaning Validation Overview
No ratings yet
PDE in Cleaning Validation Overview
101 pages
Midland SynTech Radio Models Overview
No ratings yet
Midland SynTech Radio Models Overview
5 pages
Rural Marketing FMCG Product Hindustan Unilever Limited: Master of Business Administration (MBA) Session 2019-20
No ratings yet
Rural Marketing FMCG Product Hindustan Unilever Limited: Master of Business Administration (MBA) Session 2019-20
6 pages
Clark - Discovering Gifts Talents and Exceptionalities.
No ratings yet
Clark - Discovering Gifts Talents and Exceptionalities.
32 pages
Lab Exercise Brute Force 3.2
No ratings yet
Lab Exercise Brute Force 3.2
19 pages
MGT10009-CMP Week 2 Managerial Decision Making
No ratings yet
MGT10009-CMP Week 2 Managerial Decision Making
104 pages
Mahindra
No ratings yet
Mahindra
23 pages
Rolls-Royce: 250-C30 Series Operation and Maintenance
No ratings yet
Rolls-Royce: 250-C30 Series Operation and Maintenance
4 pages
Basic Mathematics
No ratings yet
Basic Mathematics
26 pages
Sharing Stories of Development - How School Leaders
No ratings yet
Sharing Stories of Development - How School Leaders
174 pages
Experiment-02: Aim: Apparatus Required: Arduino Uno 4x4 Matrix Keypad Servo Motor Jumper Wires Theory
No ratings yet
Experiment-02: Aim: Apparatus Required: Arduino Uno 4x4 Matrix Keypad Servo Motor Jumper Wires Theory
5 pages
Meaninglessness of Immortality
No ratings yet
Meaninglessness of Immortality
20 pages
Effective Speech Writing Tips
No ratings yet
Effective Speech Writing Tips
2 pages
Cavity Grouting Specs
100% (1)
Cavity Grouting Specs
24 pages
Automation: Pros and Cons Overview
No ratings yet
Automation: Pros and Cons Overview
2 pages
Role of Family in Value Inculcation
No ratings yet
Role of Family in Value Inculcation
19 pages
A Detailed Lesson Plan Science 3
No ratings yet
A Detailed Lesson Plan Science 3
9 pages
WJEC GCSE English Literature Unit 2a Guide
No ratings yet
WJEC GCSE English Literature Unit 2a Guide
24 pages
Datasets: Citations Count For Six Years For Papers Published Over A Five-Year Period
No ratings yet
Datasets: Citations Count For Six Years For Papers Published Over A Five-Year Period
1 page
Ri-fox N Pulse Oximeter Features
0% (1)
Ri-fox N Pulse Oximeter Features
2 pages
Albumin Package Insert
No ratings yet
Albumin Package Insert
2 pages
Calibrating Instrumentation and Control Devices
No ratings yet
Calibrating Instrumentation and Control Devices
41 pages
Case 6 - The Battle Over Net Neutrality
No ratings yet
Case 6 - The Battle Over Net Neutrality
2 pages
Migrate To Zoho Books From QuickBooks Online
No ratings yet
Migrate To Zoho Books From QuickBooks Online
37 pages
Pele0889-00 Colored Dust Caps For S - o - SSM Services
No ratings yet
Pele0889-00 Colored Dust Caps For S - o - SSM Services
2 pages
Year 11 Handbook - Final
No ratings yet
Year 11 Handbook - Final
39 pages