0% found this document useful (0 votes)
18 views14 pages

Zon Tov 2017

The document presents DD-SIMCA, a MATLAB GUI tool designed for a data-driven SIMCA approach to class modeling, particularly useful in authentication tasks such as counterfeit drug detection. It outlines the algorithm's implementation, including the steps for data decomposition, acceptance area determination, and classification of new samples. The tool provides a user-friendly interface for building models and visualizing results, making it accessible for both novice and experienced users.

Uploaded by

n222840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Zon Tov 2017

The document presents DD-SIMCA, a MATLAB GUI tool designed for a data-driven SIMCA approach to class modeling, particularly useful in authentication tasks such as counterfeit drug detection. It outlines the algorithm's implementation, including the steps for data decomposition, acceptance area determination, and classification of new samples. The tool provides a user-friendly interface for building models and visualizing results, making it accessible for both novice and experienced users.

Uploaded by

n222840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Accepted Manuscript

DD-SIMCA — A MATLAB GUI tool for data driven SIMCA approach

Y.V. Zontov, [Link]. Rodionova, S.V. Kucheryavskiy, A.L. Pomerantsev

PII: S0169-7439(17)30146-6
DOI: 10.1016/[Link].2017.05.010
Reference: CHEMOM 3441

To appear in: Chemometrics and Intelligent Laboratory Systems

Received Date: 6 March 2017


Revised Date: 7 May 2017
Accepted Date: 8 May 2017

Please cite this article as: Y.V. Zontov, O.Y. Rodionova, S.V. Kucheryavskiy, A.L. Pomerantsev,
DD-SIMCA — A MATLAB GUI tool for data driven SIMCA approach, Chemometrics and Intelligent
Laboratory Systems (2017), doi: 10.1016/[Link].2017.05.010.

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and all
legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

DD-SIMCA — a MATLAB GUI tool for data driven SIMCA approach


Y.V. Zontova,e, [Link]. Rodionovab, S.V. Kucheryavskiyc, [Link],d

a
National Research University Higher School of Economics, Moscow, Russia
b
Semenov Institute of Chemical Physics RAS, Moscow, Russia
c
Aalborg University, Esbjerg, Denmark

PT
d
Branch of Institute of Natural and Technical Systems RAS, Sochi, Russia
e
Federal Institute of Industrial Property (FIPS), Moscow, Russia

RI
1 Introduction

SC
Authentication problems, where the main goal is to determine whether an object is, in fact, what it is
declared to be, are in high demand in such areas as counterfeit drug detection, food adulteration, etc.

U
Methods used for solving these problems are called class modeling methods or one-class classifiers
(OCC) [1, 2]. These methods are meant to distinguish objects of one particular class, also called
AN
target class, from all other objects and classes. This approach is similar to non-targeted analysis of
adulterants as the acceptance boundaries are developed on the authentic samples purely without the
M

necessity of known adulterant training samples.


Soft independent modeling of class analogy (SIMCA) is one of the OCC methods widely used in
D

chemometrics [3]. The original version, proposed by S. Wold [4], has numerous modifications
TE

mostly related to the way of constructing the acceptance boundaries [5-7]. Recently, authors
proposed the data driven version of SIMCA (DD-SIMCA), which allows to calculate the errors of
misclassification theoretically [8, 9].
EP

In this paper, we present DD-SIMCA — a MATLAB GUI tool that extends the MATLAB®
environment to provide an easy way for establishment and employment of the data driven SIMCA
C

technique.
AC

2 DD-SIMCA algorithm and implementation


In this section, we briefly present the main steps of the classification algorithm and the way it is
implemented in the DD-SIMCA tool. More detailed description and comparison with other
modifications of SIMCA method can be found elsewhere [8, 9].
The first step is decomposition of a training (I×J) data matrix X by the Principal Component
Analysis (PCA):
ACCEPTED MANUSCRIPT

X=TPt+E (1)
where T={tia} is the (I×A) scores matrix; P ={pja} is the (J×A) loadings matrix; E ={eij} is the (I×J)
matrix of residuals; and A is the number of principal components (PC).
In the second step, the score distance (SD), hi, and the orthogonal distance (OD), vi are calculated
for each training sample using the results of PCA decomposition,

PT
A
tia2 J
hi = t it (T t T) −1 t i = ∑ , vi = ∑ eij2 (2)
a =1 λ a j =1

RI
where λa, a=1,..,A are the diagonal elements of matrix TtT=Λ=diag(λ1,..., λA)
The total distance for the sample is calculated as

SC
h v
c = Nh + Nv ∝ χ 2 ( N h + Nv ) (3)
h0 v0
where parameters v0 and h0 are the scaling factors, Nh and Nv are the numbers of the degrees of

U
freedom (DoF). These parameters are unknown a priori, and they are estimated using a data driven
AN
approach explained in ref. [8, 9].
Third step defines the acceptance area or thresholds for the target class. Given the type I error α, the
acceptance area is determined as
M

c ≤ ccrit(α), (4)
where
D

ccrit = χ −2 ( 1 − α , N h + N v ) (5)
TE

is the (1–α) quantile of the chi-squared distribution with Nv+Nh DoF [9].
After this step, the model is ready for classification of new samples and can be represented by an
EP

acceptance area in the orthogonal vs. score distance (Acceptance plot) defined for the given α-value.
The α-value specifies a type I error, i.e. a share of the false negative decisions. Each sample from the
C

training set is characterized by its position in the Acceptance plot and has a status either of a
'regular’ sample, i.e. a sample attributed to the target class, or an ‘extreme’ sample, which is located
AC

out of the acceptance area and is determined as an alien (a non-member). Besides that, a second cut-
off level is determined to be used as the outlier border constructed for the given γ-value. This value
specifies the probability that at least one regular object from the data set will be erroneously
considered as an outlier. Unlike the acceptance area, the outlier area depends on the size of a
training set I. For a specific γ- value, the greater I, the farther the outlier area. For moderate dataset,
a common value of γ equals 0.05 or 0.01.
ACCEPTED MANUSCRIPT

In addition to the Acceptance plot, a special Extreme plot [9] is created. This plot demonstrates the
dependence of the observed number of the extreme samples versus the theoretically expected values,
calculated as n=αI. This dependence is shown in the plot together with the tolerance limits
calculated as

tα = n ± 2 α(1 − α) I = n ± 2 n(1 − n I ) (6)

PT
The extreme plot helps to analyze the quality of the classification model for the chosen number of
PCs.

RI
The last step is classification of test or new objects employing the established model. Classification
results are presented in the corresponding Acceptance plot. Additionally, the value of the type II

SC
error β, which is the rate of wrong acceptances of aliens as target objects [10] is calculated. It is also
possible to perform a reverse evaluation, namely to calculate the type I error α, which corresponds to
a given value of the type II error β.

U
The algorithm and graphical user interface (GUI) were implemented using the MATLAB Object-
AN
oriented programming (OOP). The main class, DDSimca, contains the implementation of SIMCA
and auxiliary algorithms. The instance of this class contains fields, which represent the actual DD-
SIMCA model. The class DDS Task is used to represent classification results for new data. The code
M

has been tested on MATLAB versions R2010a and R2015b.


The graphical user interface provides an easy way to build DD-SIMCA models and to apply them
D

for classification of test and new samples. Experienced users can also exploit the command-line
TE

version that provides numerous additional options, which are documented and exemplified. The
DD-SIMCA tool relies on its own implementation of statistical functions and thus does not address
the MATLAB Statistics Toolbox (and any other toolboxes).
EP

3 Working with the GUI tool


C

To illustrate the functionality of the DD-SIMCA Tool we will use a subset (three classes: A4, A5,
AC

A6) of the data set Amlodipin [11]. The original data contains NIR spectra of uncoated intact tablets
acquired through the PVC blisters. Spectra are preprocessed using the second order Savitzky-Golay
differentiation. Tablets in all three classes (the classes correspond to three different manufactures)
contain the same amount of identical active pharmaceutical ingredient (API). Each class is presented
by 50 spectra (10 spectra from 5 different batches), which are divided into two parts: the training (40
samples) and the test (10 samples) subsets. For illustrations, we use class A6 as the target class (40
ACCEPTED MANUSCRIPT

spectra for training and 10 spectra for validation) and test spectra from the other two classes,
denoted as ‘Test A4_A5’, for showing how the model works with new samples.
The DDSGUI module provides the graphical user interface (GUI) to the tool. Before starting the
module, ascertain that all necessary data are loaded into the standard MATLAB workspace. The
datasets should be loaded as matrices while the optional text labels for the samples should be

PT
represented as cell arrays of strings. The number of elements in label array should correspond to the
number of rows in the corresponding data matrix.
The GUI window consists of three panels that can be switched using tabs Model, Prediction and

RI
Options.

SC
3.1 The Model panel

Fig. 1 depicts the main panel, which contains the user controls for input of the training data set,

U
adjusting the model parameters, viewing and saving the training results. The user controls are
grouped into several sections based on their purpose and importance.
AN
M
D
TE
C EP
AC

Fig. 1. Model panel


ACCEPTED MANUSCRIPT

The controls in Data sets section are used to load data from the MATLAB workspace into the GUI.
The «Training Set» button invokes a dialog window for selection of the data. The list of datasets
includes only arrays that are available in the workspace. The «Labels» button invokes a dialog
window for selection of a cell array with labels. After the data is loaded, the indicator under the
corresponding button is changed from «Not selected» to the actual dimensions of the array (i.e. «[40

PT
x 72]»). The Labels array is optional and is used for output results in the Acceptance plot. If labels
are not available, each sample is indicated by its sequential number.

RI
The input data can be column-wise centered and/or scaled by the standard deviation using
checkboxes in the Preprocessing section.

SC
The Model parameters section provides a possibility to choose the main parameters, such as number
of the PCs, value of the type I error, significance level for outliers, a type of the acceptance area, and

U
a method for estimation of the chi-square distribution parameters.
The type I error is an important parameter, crucial for construction of the acceptance area. The
AN
«Auto» checkbox provides an option to calculate α in such a way that all training samples are
attributed as the target class members and thus are located inside the acceptance area. The outlier
M

significance level allows to change the outlier area (indicated by the red curve on the Acceptance
plot).
D

The program provides a possibility to switch between two types of acceptance area — «chi-square»
and «square». The first type is a triangular area [8], constructed using the sum of the normalized SD
TE

and OD given in Eq. (3), while the second type represents the classical area, where the cut-off levels
are constructed for each type of distance independently.
EP

Two methods of estimation of parameters of the chi-square distribution for the OD and SD distances
are available. «Classical» method is based on the method of moments. The «Robust» method [9]
takes into account possible outliers in the PCA model. The robust method is recommended in the
C

initial stage of analysis, when data may be contaminated by outliers.


AC

A group of buttons at the bottom left part of the Model panel provides user controls for the most
important tasks: building the model based on the selected parameters, showing windows with plots,
saving and loading the model, clearing the current model, and resetting the parameters to the default
values.
The right part of the panel contains the results of modeling for the data under consideration. The
Current model section contains information regarding a newly developed or loaded model, including
the predefined α-value, the number of PCs, the method used for estimation of the chi-square
ACCEPTED MANUSCRIPT

parameters, the type of preprocessing, the form of the acceptance area, and the estimated values for
the DoF for the orthogonal and score distances.
The choice of the number of PCs highly depends on the ultimate goal and the quality of the data at
hand. A user can easily change the number of PCs and the value of the type I error comparing the
results by building acceptance plots. Discussion regarding the “optimal” dimensionality for one-

PT
class classifies is presented in [12].
The Results and statistics section represents the total number of samples used for the model
building, as well as the number of samples, which appear to be extremes or outliers based on the

RI
current set of the model parameters. In case of extreme or outlying samples, a special warning
“Extreme objects in the training set!” is displayed.

SC
For the training data set, a posteriori sensitivity can be calculated using results from the «Result and
Statistics» section as

U
Sensitivity =100%(Samples-Extremes)/Samples (7)
It is recommended to compare a posteriori sensitivity with the predefined α-value. In the example,
AN
α value is set to 0.05. This means that the cut-off level is calculated assuming that 5% (out of forty
training objects) are expected to be extremes. As can be seen from Fig. 1, only one object is
M

identified as an extreme. Thus a posteriori sensitivity equals 97.5%, which is close to a priori
estimate.
D

The results of modeling can be visualized using the «Acceptance plot» button and the «Extreme
TE

plot» button (see section 3.4).


The constructed model may be saved to the MATLAB workspace and loaded from it. The
corresponding buttons are located in the lower left-hand area of the Model panel. The default name
EP

for the model is DD_SIMCA. The appropriate name may be chosen (the whitespaces are not
allowed). The model is an instance of DDSimca class and can be saved as a *.mat file for the future
C

employment.
AC
ACCEPTED MANUSCRIPT

3.2 The Prediction panel

PT
RI
U SC
AN
M
D

Fig. 2. Prediction panel.


TE

The Prediction panel allows to apply the constructed model to new objects. The layout and most of
the controls are similar to the Model panel (Fig. 2). The Current model section contains information
EP

regarding the parameters of the current classification model. This information duplicates the
information shown in the similar section of the Model panel.
C
AC

For prediction, the new or test dataset should be loaded from the MATLAB workspace using button
«New set» and, optionally, using button «Labels» in the same way as it is done for the training data.
The «Decide» button applies the current model to the new data. The results of classification are
visualized using the Acceptance plot (see section 3.4) while corresponding performance statistics are
shown in the Results and statistics section. By default, the value of the type II error, β, for the new
set is calculated. Additionally, the overall number of new samples and the number of samples that
are considered as extraneous regarding the current model as well as the chosen α value are reported.
ACCEPTED MANUSCRIPT

If a new dataset consists of several subsets, then the β-value is calculated for the subset that is the
closest to the acceptance area and other subsets are not taken into account [10]. In this case the
corresponding warning «The New Set contains more than one class of external objects!» is
displayed. Optionally, it is possible to select the «Calculate type I error» check-box and to input a
desired β-value that will be used for the corresponding α-error calculation.

PT
If the new set is a test set, with samples from the target class, the sensitivity can be calculated in the
same way as for the training set (see Eq.(7)). In the other cases the specificity is calculated as:

RI
Specificity=100%(External objects/Samples) (8)
In the current example, objects from two alien classes are combined into the ‘new set’. DD-SIMCA

SC
classifies them as extraneous objects; therefore the a posteriori specificity is equal to 100%.
The results may be saved as an instance of the DDS Task class in the MATLAB workspace using the

U
«Save results» button. The default name is DDS_TASK. The results from the workspace may be
AN
saved as a .mat file for further usage. It is important to mention that the DDS Task instance already
contains the DDSimca model.
A previously saved the DDSTask object can be loaded from the workspace to the GUI using the
M

«Load results» button.

3.3 Options panel


D

The Options panel (Fig. 3) contains various settings for the Acceptance plot, such as axes
TE

transformation and sample names. By default, the results in the Acceptance plot are presented using
the logarithmic scale for better visualization.
EP

x tr = ln(1 + x / x0 ) (9)
This transformation may be revoked by selecting ‘none’ in the Axes transformation dropdown list.
C

The checkboxes in the Options panel allow selecting whether the labels of the samples should be
shown in the plot.
AC

Fig. 3. Options panel


ACCEPTED MANUSCRIPT

3.4 Data visualization

The DD-SIMCA Tool provides two ways to represent the classification results graphically - the
Acceptance and the Extreme plots. The Acceptance plot for the Training set (Fig. 4a) provides a
graphic representation of the acceptance area, the area inside the green curve, outlier cut-off (red

PT
curve), regular samples (green dots), extremes (yellow squares), and outliers (red squares) for the
samples used for model building. If the checkbox “Show sample labels for Training Set” is inactive

RI
(as shown in Fig. 3), markers in Fig. 4a are shown without labels.

U SC
AN
M
D
TE

a) Training set A6 b) New set, Test set A4_A5


EP

Fig. 4. Acceptance plots for target class A6, α=0.05, 2PCs

The Acceptance plot for the new set (Fig. 4b) provides a graphical representation of the decision
C

area (green curve), samples that fit to the model (green dots) and samples considered as non-
AC

members (red dots).


Fig. 4 clearly depicts several groups of alien objects. Returning back to the Results and statistics
section (Fig. 2) we can see, that the program has found inhomogeneity in the ‘new set’ and provides
the corresponding warning. Thus, the estimation of the ß-value is given for the nearest subset only.
So, there is 0.01% chance that alien object(s) from class A4 can wrongly be attributed to class A6.

Extreme plot (Fig. 5a) is built using training set samples. It allows checking the training data set for
the presence of possible outliers. One can say that something is wrong with the data if the number of
ACCEPTED MANUSCRIPT

observed extreme objects does not fall inside the tolerance area (shown in vertical lines). In Fig. 5a
we can see that training set A6 does not contain outliers for the model with two PCs.

PT
RI
U SC
AN
a) Training set A6, 2 PCs. b) Set with outliers. 2 PCs.

Fig. 5. Extreme plot.


M

An example of the Extreme plot for data contaminated with outliers is presented in Fig. 5b. This
illustrative dataset is named “training_with _outliers.mat”. It is worth mentioning that the more PCs
D

are applied the more ‘ideal’ is the Extreme plot (red dots are closer to the diagonal). This can be
explained by the fact that complex model uses some PCs for description of the outliers.
TE

4 Conclusions
EP

The DD-SIMCA Tool implements all the features of the Data-Driven SIMCA method in MATLAB
scripting language. The tool, as well as a demonstration dataset, are freely available via GitHub [13]
C

or via the supplementary materials.


AC

The authors are grateful to the program testers. Their technical remarks have been corrected in the
stage of manuscript/software revision. General comments and proposals for further possible
software improvements are presented below.

5 Validation
Ph.D. Marina Cocchi1
ACCEPTED MANUSCRIPT

1
Università degli Studi di Modena e Reggio Emilia, Department of Chemical and Geological
Sciences.
I tested the software on Mac OS X 10.11.6 Matlab R2015b. No bugs or malfunctioning were
encountered. The interface is easy to use, the description provided in the paper and the help in the
GUI are clear for the users.

PT
Possible improvements:
1. It would be nice to load data from a folder different from the working one.
2. A nice future improvement would be to show as well the scores and loadings plot for the class

RI
model inside the interface.

SC
Ph.D. Sergei Zhilin1,2
1
CSort LLC, R&D Department, Barnaul, Russia.

U
2
Altai State University, Department of Informatics, Barnaul, Russia.
AN
The authors present the graphical user interface for their updated MATLAB program implementing
Data-Driven SIMCA method (DD-SIMCA).
Graphic interface to DD-SIMCA provides the following functionalities: loading of training/test sets
M

from MATLAB environment, one-class classifier model building, statistics output, visualization of
diagnostic plots, prediction of class membership and test objects statuses (inlier, extreme or outlier),
D

saving/loading of models and predictions to disk. Before model building, a user is able to specify the
TE

number of principal components, levels of type I error and outlier significance. Classical and robust
estimation method and two types of acceptance area (rectangle and chi-square) are implemented, so
one can compare the results of modelling by classical SIMCA and its data-driven variant which
EP

treats extreme objects and outliers more thoroughly.


A user can visually investigate objects statuses using acceptance plots for training and test sets.
C

Two-level decision area shown on the plot can be based on the rectangle form or on chi-square
curve. On default, axes have the logarithmic scale but this transformation can be turned off in order
AC

to use usual space of normalized score and orthogonal distances. If needed, samples on plots may be
equipped with labels.
Extreme objects and its relation to the theoretical tolerance area can be shown on the extreme
objects plot, which is useful for visual detection of potential outliers in a training set and estimation
of outliers and extreme objects percent.
The new version of the code of the DD-SIMCA procedure lying behind the GUI is refactored to an
object-oriented paradigm. It makes an employment of the procedure by third-party developers easier
ACCEPTED MANUSCRIPT

and more natural. Other valuable modifications of the program are aimed at the computations
speeding-up. Most of the time-consuming steps are optimized for speed, so a processing of large
datasets (e.g., derived from hyperspectral images) is available now.
It is worth to note that comprehensive help pages describing GUI and inner structure of MATLAB
classes are provided with the program.

PT
The new version of DD-SIMCA program provides the possibilities for comfortable use both in
interactive and programmatic mode.

RI
References

SC
1. Tax D. One-class classification: concept-learning in the absence of counter-examples. Doctoral
Dissertation, University of Delft, The Netherlands, 2001.

U
2. Rodionova [Link]., Titova A.V., Pomerantsev A.L. "Discriminant analysis is an inappropriate
method of authentication", Trends Anal. Chem., 2016; 78 (4): 17-22/
AN
3. Naes T., Isaksson T., Fearn T., Davies T. Multivariate Calibration and Classification,Wiley,
Christerer, 2002.
M

4. Wold S. Pattern recognition by means of disjoint principal components models, Pattern


Recognition 1976; 8: 127-139.
D

5. Nomikos P., MacGregor J.F. Multivariate SPC charts for monitoring batch processes.
Technometrics 1995; 37: 41–59.
TE

6. Hubert M, Rousseeuw P.J., Vanden Branden K. ROBPCA: a new approach to robust principal
component analysis. Technometrics 2005; 47: 64–79.
EP

7. Daszykowski M, Kaczmarek K, Stanimirova I, Vander Heyden Y, Walczak B. Robust SIMCA-


bounding influence of outliers. Chemom. Intell. Lab. Syst. 2007; 87: 95–103.

8. Pomerantsev A.L. Acceptance areas for multivariate classification derived by projection methods.
C

J. Chemom. 2008; 22: 601–609. DOI: 10.1002/cem.1147.


AC

9. Pomerantsev A.L., Rodionova [Link]. Concept and role of extreme objects in PCA/SIMCA. J.
Chemom. 2014; 28:429-438. DOI: 10.1002/cem.2506.

10. Pomerantsev A.L., Rodionova [Link]. On the type II error in SIMCA method. J. Chemom.
2014; 28: 518-522. DOI: 10.1002/cem.2610

11. Zontov Y.V., Balyklova K.S., Titova A.V., Rodionova [Link]., Pomerantsev A.L.
Chemometric aided NIR portable instrument for rapid assessment of medicine quality", [Link].
Biomed. Anal.2016; 131: 87-93. DOI: 10.1016/[Link].2016.08.008

12. Rodionova [Link]., Oliveri P., Pomerantsev A.L. Rigorous and compliant approaches to one-
ACCEPTED MANUSCRIPT

class classification. Chemom. Intell. Lab. Syst.2016; 159: 89-96. DOI:


10.1016/[Link].2016.10.002

13. Github [Link]

PT
RI
U SC
AN
M
D
TE
C EP
AC

You might also like