PLSandPLSDA - Torino2021 - Federico Marini
PLSandPLSDA - Torino2021 - Federico Marini
MULTIVARIATE REGRESSION
Federico Marini
Dept. Chemistry, University of Rome “La Sapienza”, Rome, Italy
Direct vs inverse calibration
• The distinction refers to the way the relationship between signal(s) and
concentration of chemical constituents is established.
– DIRECT: signal is considered to be directly proportional to the concentration
– INVERSE: concentration is considered to be directly proportional to the signal.
0.9
100
0.8
0.7
80
0.6
Concentration
Absorbance
0.5 60
0.4
40
0.3
0.2
20
0.1
0 0
10 20 30 40 50 60 70 80 90 100 0 0.2 0.4 0.6 0.8 1
Concentration Absorbance
5. Multivariate Regression
Direct vs inverse calibration - 2
• UNIVARIATE CASE: no substantial difference appears to exist in calibrating a
regression line in a direct or in an inverse way.
– Inverse models more efficient with small data sets and high noise level
• MULTIVARIATE CASE: The difference is relevant.
– Direct models show some advantages, but present unsolvable problems regarding the variety of
analytical platform/problems they can be applied to
– Inverse models are almost always used.
-3
50 #10
1 16
0.9 45 14
0.8 40
12
Regression Coefficients
= ×
0.7 35
DIRECT
10
Concentration
Absorbance
0.6 30
8
0.5 25
6
0.4 20
0.3 4
15
0.2 2
10
0.1
5 0
0
250 300 350 400 450 0 -2
6 (nm) 0 5 10 15 20 25 1 2 3
Sample Index Constituents
50 1 5
45 0.9 4
40 0.8
3
Regression Coefficients
= ×
35 0.7
INVERSE
2
Concentration
Absorbance
30 0.6
25 0.5 1
20 0.4
0
0.3
15
-1
0.2
10
0.1 -2
5
0 -3
0 250 300 350 400 450 250 300 350 400 450
0 5 10 15 20 25 6 (nm) 6 (nm)
Sample Index
5. Multivariate Regression
Direct Calibration
• Requirements:
– Prepare mixtures of standards of all pure chemical constituents to be used as calibration samples.
– Number of mixtures ≥ number of constituents (the larger they are the more precise the results)
– Calibration mixtures should be as representative as possible of the combination of concentrations
in future, unknown samples
• Applicability:
– The concentrations of ALL the constituents in all calibration samples should be known.
– The constituents should be the same as in future test samples.
– Relates signals to constituents’ concentrations, but not to global properties (octane number,
sensorial attributes, iodine value, etc.)
• Limitations:
– Sensitive to spectral correlations (constituents with severely overlapped spectra cannot be reliably
determined)
– The presence of non-modeled interferents may lead to serious errors in the determination.
5. Multivariate Regression
Inverse Calibration
• Applicability:
– Multicomponent samples where only one or a few analytes are of interest
– Concentrations, spectra, and chemical identities of the other constituents may be unknown.
– First proposed in the 1960’s (connected to the development of NIR for the non-invasive analysis of
intact material*)
• Advantages:
– Allows studying of complex mixtures knowing only the concentrations of a limited number of
constituents.
– Quantitation of an analyte in the presence of interferences (if they are properly represented in the
calibration samples, even if their concentrations or chemical identities are not known).
50 1 5
45 0.9 4
40 0.8
3
Regression Coefficients
= ×
35 0.7
2
Concentration
Absorbance
30 0.6
25 0.5 1
20 0.4
0
0.3
15
-1
0.2
10
0.1 -2
5
0 -3
0 250 300 350 400 450 250 300 350 400 450
0 5 10 15 20 25 6 (nm) 6 (nm)
Sample Index
5. Multivariate Regression
Why going multivariate?
• Instead of just using one of the variables, it makes sense to use all the measured
information
• There are many significant advantages in doing so:
– Noise reduction: More (redundant) measurements of the same phenomenon
– Possibility of interferents: Non-selective signals can be made selective by
mathematics (provided that the signal profiles of the interferents are not completely
identical to that from the analyte).
– Exploratory aspects: models provide a number of informative parameters + residuals
– Outlier detection: detection of outlier is enhanced having multivariate data.
5. Multivariate Regression
A first multivariate approach: MLR
Statement of the problem:
• Linear relationship between a response yi and all the variables
measured on a sample xij
𝑦$ = 𝑏& + 𝑏(𝑥$( + 𝑏*𝑥$* + 𝑏+𝑥$+ + ⋯ + 𝑏- 𝑥$- + 𝑒$
5. Multivariate Regression
A first multivariate approach: MLR
• The response estimates 𝑦0$ = 𝑏& + 𝑏(𝑥$( + 𝑏*𝑥$* + ⋯ + 𝑏- 𝑥$- lie on
a p-dimensional hyperplane
yi
50
ei = yi ! y^i
40
30
y^i
20
Y
10
15
10
-10
5
16 14 12 10 8 6 4 20
X1
X2
5. Multivariate Regression
Least squares as projection
• Column space of a matrix: space spanned by its columns (samples are the axes).
• Regression is a projection of y onto the column space of X:
4( 3
𝐲0 = 𝐗 𝐗 3 𝐗 𝐗 𝐲 = 𝐇𝐲
6
y
4
2 y
^
0
Sample3
-2
var1 var2
-4
4
-6
3
-8
2
-4
-3 1
-2
-1 0
0
1 -1
Sample2
Sample1 2
3 -2
5. Multivariate Regression
Model predictions and confidence intervals
• The predicted response for any of the observations used for model
building is :
𝑦0$ = 𝐱 $7 𝐛
with
𝐱 $7 = 𝑥$( 𝑥$* … ⋯ 𝑥$- 1
• The corresponding prediction uncertainty is:
𝑠<*0 = = 𝑠<*𝐱 $7 𝐗 7 𝐗 4( 𝐱 *
$ =𝑠< ℎ$$
where ℎ$$ is the leverage of the ith observation, which corresponds to the
(i,i) element of the Hat matrix.
• Accordingly, the (1-α)% confidence interval for 𝑦0$ is:
𝑦0$ ± 𝑡(4A,/4-4(𝑠<0 =
5. Multivariate Regression
Model building and data pretreatment
• We wrote the model as: 𝑦0$ = 𝐱 $7 𝐛
with 𝐱 $7 = 𝑥$( 𝑥$* … … 𝑥$- 1
and 𝐛 = 𝑏( 𝑏* … … 𝑏- 𝑏&
• The hyperplane fitting the data is bound to contain the point (D D which is also
𝐱, 𝑦),
the point with the smallest prediction uncertainty.
• However, in the case of mean centering for both the 𝐗 and the 𝐲:
– 𝑿HI = 𝑿 − 𝟏D 𝐱7
– 𝒚HI = 𝒚 − yD
the hyperplane fitting the data passes through the origin (𝟎, 0), so the
model doesn’t contain the term 𝑏& :
7
• Accordingly, with centered data, the model is: 𝑦0$,HI = 𝐱 $,HI 𝐛HI
7
with 𝐱 $,HI = 𝑥$( 𝑥$* … … 𝑥$-4( 𝑥$-
and 𝐛HI = 𝑏( 𝑏* … … 𝑏-4( 𝑏-
5. Multivariate Regression
Evaluating the regression
• The quality of a multivariate regression model can be evalulated using the same
figures of merit already discussed for the univariate case (bias, R2, residuals…)
0.6
9
0.4
0.2
Predicted Y
Residuals
7
-0.2
5
-0.4
4 -0.6
4 5 6 7 8 9 10 4 5 6 7 8 9 10
Measured Y Measured Y
5. Multivariate Regression
Evaluating the regression
• The quality of a multivariate regression model can be evalulated using the same
figures of merit already discussed for the univariate case (bias, R2, residuals…)
5. Multivariate Regression
Interpreting the model
] = 𝑿𝒃
• MLR model is defined by: 𝒚
• The relation between the 𝐗 and the 𝐲 is encoded in the regression coefficients:
– Their magnitude and sign reflect the contribution of the individual X-variables in determining the
value of the predicted response.
– In the absence of interferents, they correspond to the profile of the pure constituent of interest
0.8 0.15
0.6
Regression coefficients
0.4
Regression coefficients
0.1
0.2
-0.2 0.05
-0.4
-0.6
0
-0.8 250 300 350 400 450
1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 (nm)
Variable index
5. Multivariate Regression
Interpreting the model: Things to be aware of
• The absolute value of the regression coefficients is influenced by the scales of the
𝐗 and the 𝐲 variables:
– They can have a high magnitude just due to the relative scales of that particular X variable and the
y.
– This problem can be solved by variable scaling (as already discussed).
0.6 0.3
0.2
0.4
Regression coefficients
Regression coefficients
0.1
0.2
0
0
-0.1
-0.2
-0.2
-0.4
-0.3
-0.6 -0.4
-0.8 -0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Variable index Variable index
5. Multivariate Regression
Interpreting the model: Things to be aware of - 2
• The presence of interferents makes interpretation of the regression coefficients
less straightforward:
– Signal of the pure constituent orthogonalized with respect to the contributions of all the
interferents
No overlap
0.2 2 0.2
Regression coefficients
0.15 1.5 0.15
Intensity
Intensity
0.1 1 0.1
0 0 0
250 300 350 400 450 250 300 350 400 450 250 300 350 400 450
6 (nm) 6 (nm) 6 (nm)
Overlap
0.2 3
Analyte
0.25
Regression coefficients
Interferent
2.5
0.15 0.2
2
Intensity
Intensity
0.15
1 0.05
0.05
0
0.5
-0.05
0 0
250 300 350 400 450 250 300 350 400 450 250 300 350 400 450
6 (nm) 6 (nm) 6 (nm)
5. Multivariate Regression
What if multiple ys?
• Each y is separately regressed on X
𝑦$( = 𝑏&( + 𝑏(( 𝑥$( + 𝑏*( 𝑥$* + 𝑏+( 𝑥$+ + ⋯ + 𝑏-( 𝑥$- + 𝑒$(
𝑦$* = 𝑏&* + 𝑏(* 𝑥$( + 𝑏** 𝑥$* + 𝑏+* 𝑥$+ + ⋯ + 𝑏-* 𝑥$- + 𝑒$*
⋮
𝑦$` = 𝑏&` + 𝑏(` 𝑥$( + 𝑏*` 𝑥$* + 𝑏+` 𝑥$+ + ⋯ + 𝑏-` 𝑥$- + 𝑒$`
5. Multivariate Regression
What if multiple ys? - 2
• Calculation of the model parameters by least squares gives the solution:
𝐁 = (𝑿7 𝑿)4( 𝑿7 𝐘
• This corresponds to each column of the regression coefficient matrix being given
by:
𝒃` = (𝑿7 𝑿)4( 𝑿7 𝒚`
• This is not necessarily the case with other regression models (e.g., Partial Least
Squares Regression).
5. Multivariate Regression
Problems with MLR
• MLR is conceptually simple and generalizes univariate LS regression.
BUT
• The core of MLR is the calculation of the model parameters by the LS
approach, according to:
𝐁 = (𝑿7 𝑿)4(𝑿7 𝐘
↓
• Use of Latent variables (BILINEAR MODELING):
– Few
– Orthogonal
5. Multivariate Regression
Partial Least Squares Regression (PLS)
• Scores are extracted from the X block so to be at the same time:
– Explanatory of a large amount of the variance in X
– Predictive for y (explanatory of a large amount of the variance in y).
• These conditions can be mathematically summarized by the following
conditions/characteristics:
– The scores 𝑻 = 𝒕( 𝒕* ⋯ 𝒕i = 𝑿𝑹 are extracted so to have maximum covariance
with the reponse y: argmax 𝒕r7 𝒚 ⟹ argmax 𝒓r7 𝑿7 𝒚
𝒓q 𝒓q
5. Multivariate Regression
The PLS criterion graphically explained
-0.2
-0.4
Sample 3
x1
-0.6
-0.8
-1 y PCA
x t1
-1.2 tPLS 2
1
1
0.5 1.4
1.2
0 1
0.8
0.6
-0.5 0.4
0.2
Sample 2 -1 0
Sample 1
x3 x2 x 4 x 11 x
3 x 13
x 12 5
2
x7
tPLS
2
1
x
x 6 10
Sample 3
0
x9 x1 x y
8
-1
-2
-2
tPLS
-3
1 0
-4
2
-5
4
1.5 1 0.5 0 -0.5 -1 -1.5 Sample 2
Sample 1
5. Multivariate Regression
PLS: Predicting the response
• The peculiarity of PLS is to extract scores which are directly explanatory also
of the y variance:
] = 𝑻𝒒7
– They can be used to approximate (predict) the response: 𝒚
• Scores are extracted from X as linear combination of the variables, through
the weight matrix:
– 𝑻 = 𝑿𝑹
• It is then possible to directly relate the predicted response to the predictor
matrix X as:
] = 𝑻𝒒7 = 𝑿𝑹𝒒7 = 𝑿𝒃wxy
– 𝒚
– 𝒃wxy is a vector of regression coefficients; 𝒃wxy = 𝑹𝒒7
] = 𝑿𝒃wxy allows to extend to
– The possibility of expressing the PLS model in the form: 𝒚
the method all the considerations already done for MLR and PCR.
– Prediction of the response for a new sample is carried out according to:
]/Xz = 𝑿/Xz 𝒃wxy
𝒚
5. Multivariate Regression
PLS: Model building and predictions
• Building a PLS model requires choosing the number of PLS components
(Latent variables, LVs):
– Selection is generally carried out through a cross-validation procedure
– In the example, 8 components (corresponding to the minimum of RMSECV) are retained
in the final model
1.6
1.4
1.2
RMSECV
0.8
0.6
0.4
0 5 10 15
Number of Latent Variables
5. Multivariate Regression
PLS: Model building and predictions - 2
• The model is calculated on the training set and validated on the test set:
– The «usual» figures of merit for regression can be used to evaluate the model quality
22
21
20
19
Predicted protein content (%w/w)
18 Calibration Validation
17 RMSE 0.343 0.297
16
bias 0.00 -0.09
R2 0.959 0.966
15
14
13
Training set
Test set
12
12 13 14 15 16 17 18 19 20 21 22
Measured protein content (%w/w)
5. Multivariate Regression
PLS: A bit more on the model
• The PLS scores are build so to capture the maximum covariance between X and y
and to be orthogonal.
• These conditions are accomplished through the introduction of the weights R.
• Once the scores are calculated, a further set of loadings P is needed:
– These loadings produce the best approximation of the variability in X given the set of
scores T: 𝑿 { = 𝑻𝑷7
– In PLS, these loadings are not orthogonal; they describe only the variance in X.
0.8 0.015
LV1 LV1
LV2 LV2
LV3 0.01 LV3
0.6 0.005
0
0.4
PLS X-loadings
-0.005
PLS weights
0.2 -0.01
-0.015
0
-0.02
-0.025
-0.2
-0.03
-0.4 -0.035
500 1000 1500 2000 500 1000 1500 2000
6 (nm) 6 (nm)
5. Multivariate Regression
PLS for a single response: What else do we get?
2
10
Regression coefficients
5
-1
LV2
-2
0
-3
-4
-5
-5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
LV1 6 (nm)
21
20
19
1
18
Q
17
y
16
0.5
15
14
13
0
0 5 10 15 20 25 30 35 40 45 12
T2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
T
1
5. Multivariate Regression
Partial Least Squares for multiple responses: The idea
X Y
X = TPT + E Y = UQT + F
u1 = c1 t1
u2 = c2 t2
….
uF = cF tF
5. Multivariate Regression
PLS for multiple Ys: A few concepts
• The responses are assumed to possess an underlying latent structure.
• Scores are built so to have maximum covariance:
argmax 𝒕r7 𝒖r ⟹ argmax 𝒓r7 𝑿7 𝒚𝒒r
𝒓q ,𝒒q 𝒓q ,𝒒q
1
1
0
u
-1
-2
-3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
t1
5. Multivariate Regression
PLS for multiple Ys: Summarizing the model
• The responses are approximated by the bilinear model of the block as:
{ = 𝑼𝑸7 .
𝒀
• The values of the Y-scores are predicted from the X-scores (inner relation):
{ = 𝑻𝑪
𝑼
• X-scores are calculated from the X-variables through the X-weights:
𝑻 = 𝑿𝑹
• Also in the case of PLS regression for multiple Y, prediction of the responses can be
expressed directly in terms of the original variables, by combining the information
above into a single equation:
{ = 𝑼𝑸7 = 𝑻𝑪𝑸7 = 𝑿𝑹𝑪𝑸7 = 𝑿𝑩wxy
𝒀
• Where the matrix of regression coefficients is equal to:
𝑩wxy = 𝑹𝑪𝑸7
5. Multivariate Regression
PLS for multiple Ys: Considerations
• Contrarily to the case of MLR and PCR, in PLS there is a substantial difference in
building a single model to calibrate all the responses or L individual models, one for
any response.
• A single PLS-2 (multiple-y) model is recommended when the Y variables are
correlated and share an underlying structure.
• PCA of the Y matrix can help choosing the most appropriate approach
Y loadings
0.5
0.4 Moisture
0.3
0.2
0.1
Starch
PC2
-0.1
Protein
-0.2
-0.3 Oil
-0.4
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
PC1
5. Multivariate Regression
Partial Least Squares
Discriminant Analysis (PLS-DA)
Classification
Classification: Intro
• Sometimes, data are collected in order to predict a qualitative property, i.e., a response that can take only discrete values
0.9 60
0.8
Healthy
50
0.7
0.6 40
Sample index
Absorbance
0.5
30
0.4
0.3 20
0.2
10
Diseased
0.1
0
1000 1500 2000 2500 3000 3500 4000
Wavenumber (cm -1)
#107
4
2.5 20
Sample index
Intensity
2 30
Medium activity
1.5
40
1
50
High activity
0.5
60
0
0 10 20 30 40
Retention time (min)
Classification
Classification
• Each value that the discrete response can take is called a class or category.
• Category or class is a (ideal) group of objects sharing similar characteristics.
120 2
0.6
100 1
0.5
0
80
0.4
X Y
log(1/R)
X2
-1
60
↔ 0.3
↔ 40
-2
0.2
-3
20
0.1
-4
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
6 (nm) -1 0 1 X1
Y
CLASSIFICATION:
• “To find a criterion to assign an object (sample) to one category (class)
based on a set of measurements performed on the object itself”
• In classification, categories are defined a priori (≠ cluster analysis)
Classification
Classification approaches
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
2
2
0 0
X
X
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
-4 -2 0 2 4 -4 -2 0 2 4
X X
1 1
Classification
Partial Least Squares-Discriminant Analysis (PLS-DA)
• Useful when the number of variables is higher than that of
available samples and with correlated predictors
• Based on the PLS algorithm:
– Classification problem should be re-formulated as regression
• Class in encoded in dummy Y vector
Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 2
X y ypred X b
1.2
0.8
0.6
Predicted Y
0.4
0.2
-0.2
-0.4
-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y
Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 3
• Classification is accomplished by setting a proper threshold to the
predicted Y values.
• The “natural” threshold is 0.5:
– Ypred>0.5 →class 1
– Ypred<0.5 →class 2
1.4
1.2
0.8
0.6
Predicted Y
0.4
0.2
-0.2
-0.4
-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y
Classification
Partial Least Squares-Discriminant Analysis (PLS-DA) – 4
• Different methods have been proposed in the literature to find alternative
“optimal” threshold values (see later).
• In the example, setting the value to 0.33 allows the correct classification of
all samples:
1.4
1.2
0.8
0.6
Predicted Y
0.4
0.2
-0.2
-0.4
-0.6
0 10 20 30 40 50 60 70 80 90 100
Sample Y
Classification
Model complexity
• PLS-DA is a bilinear model:
– Optimal number of components (LVs) should be selected
• Error criterion in cross-validation is (usually) based on the number
(percentage) of mis-classifications:
∑•‹Œ( 𝑛‹,H$•I`Z••$r$X[
%𝐶𝐸 𝑡𝑜𝑡 = 100 − %𝑁𝐸𝑅 𝑡𝑜𝑡 = 100×
𝑛•‘•
11
10
8
%Classification error
1
1 2 3 4 5 6 7 8 9
Number of LVs
Classification
With more than two classes
• Instead of a binary vector, a dummy binary matrix is used to code for
class belonging
• Y spans a G-1 dimensional space (G being the number of classes)
Classification
PLS-DA for more than two classes
• Model is built using PLS-2 algorithm
• A matrix of regression coefficient is obtained
X Y Ypred X B
Classification
PLS-DA for more than two classes - 2
• Predicted y is real-valued:
“true” y predicted y
1 0 0 1.03 0.09 -0.10
1 0 0 0.68 0.21 0.08
1 0 0 0.99 -0.10 0.01 1.2
1 0 0 0.96 0.18 -0.14 1
3
0 1 0
Predicted Y
0.6
0.14 0.94 0.07
0.4
0 1 0 -0.01 1.12 0.12 0.2
0 1 0 0.08 0.89 -0.02 0
Classification
PLS-DA for more than two classes - 3
• Another alternative approach to perform classification based on
discriminant PLS results is to apply LDA:
- On the predicted Y (after removing one of the columns)
- On the X scores T
{ = 𝑻𝑸7 = 𝑿𝑩
𝒀
T LDA
Remove
X
1 column
Ypred
Classification
Digression:
Validation
5. Multivariate Regression
The concept of validation
• Verify if valid conclusions can be formulated from a model:
– Able to generalize parsimoniously (with the smaller nr. of LV)
– Able to predict accurately
• Define a proper diagnostics for characterizing the quality of the
solution:
– Calculation of some error criterion based on residuals
• Residuals can be used for:
– Assessing which model to use;
– Defining the model complexity in component-based methods;
– Evaluating the predictive ability of a regression (or classification) model;
– Checking whether overfitting is present (by comparing the results in
validation and in fitting);
– Residual analysis (model diagnostics).
5. Multivariate Regression
The need for “new” data
• The use of fitted residuals would lead to overoptimism:
– Magnitude and structure not similar to the ones that would be obtained if the
model were used on new data.
Xval Yval
Xcal Ycal
Xcal Ycal
Xval Yval
5. Multivariate Regression
Test set validation
• Carried out by fitting the model to new data (test set):
– Simulates the practical use of the model on future data.
– Test set should be as independent as possible from the calibration set
(collecting new samples and analysing them in different days…)
– A representative portion of the total data set can be left aside as test set.
{ IZ` = 𝑿IZ` 𝑩
𝒀 { ’Z` = 𝑿’Z` 𝑩
𝒀
?
Xval Predicted Yval
Xcal Ycal
*
∑” •–—
$Œ( 𝑦$
’Z`
− 𝑦0$’Z`
MODEL QUALITY vs 𝑅𝑀𝑆𝐸𝑃 =
𝑁’Z`
Yval Pred. Yval
5. Multivariate Regression
Cross-validation
• Internal resampling method:
– Simulates test set validation by repeating a data splitting procedure where
different object are in turn placed in the validation set.
– Particularly useful when a limited number of samples are available.
• Schematically, it consists of the following steps:
Xval1 Yval1
X Y Xcal1 Ycal1
5. Multivariate Regression
Cross-validation
2. Build the model without these data
{ IZ`( = 𝑿IZ`(𝑩(
𝒀
Xcal1 Ycal1
3. Apply the model to the left out values and obtain predictions;
{ ’Z`( = 𝑿’Z`(𝑩(
𝒀
5. Multivariate Regression
Cross-validation
4. Calculate the corresponding residual error
vs 𝑃𝑅𝐸𝑆𝑆( = ™
”•–—V
𝑦$’Z`( − 𝑦0$’Z`(
*
Pred. Yval1
$Œ(
Yval1
5. Repeat steps 1-4 until each data value has been left out once
Xval2 Yval2
vs ∑•\Œ( 𝑃𝑅𝐸𝑆𝑆\ ∑”
$Œ( 𝑦$ − 𝑦
04$ *
𝑅𝑀𝑆𝐸𝐶𝑉 = =
𝑁 𝑁
Yval Pred. Yval
5. Multivariate Regression
Cross-validation
• Number of objects is limited
• Understand the inherent structure of the system ↔ Estimating
model complexity
• Objects in a data table can be stratified into groups based on
background information:
– Across instrumental replicates (repeatability)
– Reproducibility (analyst, instrument, reagent...)
– Sampling site and time
– Across treatment/origin (year, raw material, batch…)
5. Multivariate Regression