Machine Learning in Python For Process - Ankur Kumar
Machine Learning in Python For Process - Ankur Kumar
First Edition
Ankur Kumar
Jesus Flores-Cerrillo
Dedicated to our spouses, families, friends, motherlands, and all the data-science
enthusiasts
आचार्ाात्पादमादत्ते पादं शिष्यः स्वमेधर्ा ।
www.MLforPSE.com
All rights reserved. No part of this book may be reproduced or transmitted in any form or
in any manner without the prior written permission of the authors.
.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented and obtain permissions for usage of copyrighted materials.
However, the authors make no warranties, expressed or implied, regarding errors or
omissions, and assume no legal liability or responsibility for loss or damage resulting
from the use of information contained in this book.
The first book of the series ‘Machine Learning in Python for Process Systems
Engineering’ covers the basic foundations of machine learning and provides an overview of
broad spectrum of ML methods primarily suited for static systems. Step-by-step guidance on
building ML solutions for process monitoring, soft sensing, predictive maintenance, etc. are
provided using real process datasets. Aspects relevant to process systems such as modeling
correlated variables via PCA/PLS, handling outliers in noisy multidimensional dataset,
controlling processes using reinforcement learning, etc. are covered. The second book of the
series ‘Machine Learning in Python for Dynamic Process Systems’ focuses on dynamic
systems and provides a guided tour along the wide range of available dynamic modeling
choices. Emphasis is paid to both the classical methods (ARX, CVA, ARMAX, OE, etc.) and
modern neural network methods. Applications on time series analysis, noise modeling,
system identification, and process fault detection are illustrated with examples. This third
book of the series takes a deep dive into an important application area of ML, viz, prognostics
and health management. ML methods that are widely employed for the different aspects of
plant health management, namely, fault detection, fault isolation, fault diagnosis, and fault
prognosis, are covered in detail. Emphasis is placed on conceptual understanding and
practical implementations. Future books of the series will continue to focus on other aspects
and needs of process industry. It is hoped that these books can help process data scientists
find innovative ML solutions to the real-world problems faced by the process industry.
With the growing trend in usage of machine learning in the process industry, there is growing
demand for process domain experts/process engineers with data science/ML skills. These
books have been written to cover the existing gap in ML resources for such process data
scientists. Specifically, books of this series will be useful to budding process data scientists,
practicing process engineers looking to ‘pick up’ machine learning, and data scientists
looking to understand the needs and characteristics of process systems. With the focus on
practical guidelines and industrial-scale case studies, we hope that these books lead to wider
spread of data science in the process industry.
Other book(s) from the series
(https://MLforPSE.com/books/)
Book 1 Book 2
Preface
Imagine yourself in the shoes of a process engineer/analyst who has been assigned his/her
first machine learning-based project with the objective of building a plantwide monitoring tool.
Although an exciting task, it may easily turn into a frustrating effort due to the difficulty in
finding the right methodology that works for the process system at hand. Building a
successful process monitoring tool is challenging due to the different characteristics a
process dataset may possess which precludes the possibility of a single methodology that
works for all scenarios. Consequently, a number of powerful techniques have been devised
over the past several decades. While it is good to be spoilt with choices, it is easy for a
newcomer to get ‘drowned’ in the huge (and still burgeoning) literature on process monitoring
(PM) and predictive maintenance (PdM). There are a lot of scattered resources on PM and
PdM. However, unfortunately, no textbook exists that focusses on practical implementation
aspects and provides comprehensive coverage of commonly used PM/PdM techniques that
have proven useful in process industry. There is a gap in available machine learning
resources for PM/PdM catering to industrial practitioners and this book attempts to cover this
gap. Specifically, we wished to create a reader-friendly and easy-to-understand book that
can help its readers become ‘experts’ on ML-based PM/PdM ‘quickly’ (disclaimer: there is no
magic potion; hard work is still required!) with the right guidance.
In this book, we cover all three main aspects of process monitoring and predictive
maintenance, namely, fault/anomaly detection, fault diagnosis/identification, and fault
prognostics/remaining useful life estimation (RUL). Our intent is not to give a full treatise on
all the PM/PdM techniques that exist out there; albeit our focus is to help budding process
data scientists (PDSs) gain a bird’s-eye view of the PM/PdM landscape, obtain working-level
knowledge of the mainstream techniques, and have the practical know-how to make the right
choice of models. In terms of the spectrum of methodologies covered, we place equal
emphasis on modern deep-learning methods and classical statistical methods. While deep-
learning has provided remarkable results in recent times, the classical statistical (and
ML/data mining) methods are not yet obsolete. Infact MSPM (multivariate statistical process
monitoring) techniques are still widely used for process monitoring. Accordingly, this book
covers the complete spectrum of methodologies with univariate Shewhart-/CUSUM-/EWMA-
based control charts on one end and deep-learning-based RUL estimations on the other.
Guided by our own experiences from building monitoring models for varied industrial
applications over the past several years, this book covers a curated set of ML techniques
that have proven useful for PM/PdM. The broad objectives of the book can be summarized
as follows:
• reduce barrier-to-entry for those new to the field of PM/PdM
• provide working-level knowledge of PM/PdM techniques to the readers
• enable readers to make judicious selection of PM/PdM techniques appropriate for their
problems through intuitive understanding of the advantages and drawbacks of
different techniques
• provide step-by-step guidance for developing industrial level solutions for PM/PdM
• provide practical guidance on how to choose model hyperparameters judiciously
This book adopts a tutorial-style approach. The focus is on guidelines and practical
illustrations with a delicate balance between theory and conceptual insights. Hands-on-
learning is emphasized and therefore detailed code examples with industrial-scale datasets
are provided to concretize the implementation details. A deliberate attempt is made to not
weigh readers down with mathematical details, but rather use it as a vehicle for better
conceptual understanding. Complete code implementations have been provided in the
GitHub repository.
We are quite confident that this text will enable its readers to build process monitoring and
prognostics models for challenging problems with confidence. We wish them the best of luck
in their career.
1) Data scientists new to the field of process monitoring, equipment condition monitoring,
and predictive maintenance
2) Regular users of commercial anomaly detection software (such as Aspen Mtell)
looking to obtain a deeper understanding of the underlying concepts
3) Practicing process data scientists looking for guidance for developing process
monitoring and predictive maintenance solutions
4) Process engineers or process engineering students making their entry into the world
of data science
5) Industrial practitioners looking to build fault detection and diagnosis solutions for
rotating machinery using vibration data
Pre-requisites
No prior experience with machine learning or Python is needed. Undergraduate-level
knowledge of basic linear algebra and calculus is assumed.
Book organization
The book follows a holistic and hands-on approach to learning ML where readers first gain
conceptual insight and develop intuitive understanding of a methodology, and then
consolidate their learning by experimenting with code examples. Under the broad theme of
ML for process systems engineering, this book is an extension of the first two book of the
series (which dealt with fundamentals of ML, varied applications of ML in process industry,
and ML methods for dynamic system modeling); however, it can also be used as a
standalone text. Industrial process data could show varied characteristics such as
multidimensionality, non-Gaussianity, multimodality, nonlinearity, dynamics, etc. Therefore,
to give due treatment to the different modeling methodologies designed for dealing with
systems with different data characteristics, this book has been divided into seven parts.
Part 1 lays down the basic foundations of ML-assisted process and equipment condition
monitoring, and predictive maintenance. Part 2 provides in-detail presentation of classical
ML techniques for univariate signal monitoring. Different types of control charts and time-
series pattern matching methodologies are discussed. Part 3 is focused on the widely
popular multivariate statistical process monitoring (MSPM) techniques. Emphasis is paid to
both the fault detection and fault isolation/diagnosis aspects. Part 4 covers the process
monitoring applications of classical machine learning techniques such as k-NN, isolation
forests, support vector machines, etc. These techniques come in handy for processes that
cannot be satisfactorily handled via MSPM techniques. Part 5 navigates the world of artificial
neural networks (ANN) and studies the different ANN structures that are commonly employed
for fault detection and diagnosis in process industry. Part 6 focusses on vibration-based
monitoring of rotating machinery and Part 7 deals with prognostic techniques for predictive
maintenance applications.
This book attempts to cover a lot of concepts. Therefore, to avoid the book from getting bulky,
we have not included contents that are not directly relevant to PM/PdM and have already
been covered in detail in the first two books of the series. For example, ML fundamentals
related to cross-validation, regularization, noise removal, etc., are illustrated in great detail in
Book 1 of the series and not in this book.
Symbol notation
The following notation has been adopted in the book for representing different types of
variables:
- lower-case letters refer to vectors (𝑥 ∈ ℝ𝑚×1 ) and upper-case letters denote
matrices (𝑋 ∈ ℝ𝑛×𝑚 )
- individual element of a vector and a matrix are denoted as 𝑥𝑗 and 𝑥𝑖𝑗 , respectively.
- any ith vector in a dataset gets represented as subscripted lower-case letter (𝑥𝑖 ∈
ℝ𝑚×1 )
Table of Contents
Part 1: Introduction and Fundamentals
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural
Networks Modeling 249
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural
Networks Modeling 265
A
sk a plant manager about what gives him/her sleepless nights and you will invariably
get plant equipment failures and process abnormalities causing downtimes among the
top answers. Such concerns about plant reliability are not unfounded. Incipient
abnormalities, if left undetected, can cause cascading damages leading to economic losses,
plant downtimes, and even fatalities. Several major disasters in the process industry
(Philadelphia refinery explosion in 2019, Bhopal gas tragedy in 1984, etc.) were the results of
failures in timely detection and correction of process faults. While such disasters are
fortunately infrequent, ‘innocuous’ process abnormalities that lead to non-optimal plant
efficiencies and degradations in product quality occur routinely. Without exaggeration, it can
be said that 24X7 monitoring of process performance and plant equipment health status, and
forecast of impending failures are no longer a ‘nice to have’ but an absolute necessity!
Process industry has responded to the above challenges by putting more sensors and
collecting more real-time data. Unfortunately, this has led to data deluge and operators being
overwhelmed with ‘too much information but little insights’. Thankfully, machine learning
comes to the rescue with its ability to parse huge amount of data and find hidden patterns in
real-time. ML allows smart process monitoring (PM) wherein objective is not just to detect
process abnormalities but to catch the issues at early stages. Furthermore, ML facilitates
predictive maintenance (PdM) through advance prediction of equipment failure times.
In this chapter we will take a bird’s-eye view of the ML landscape for PM/PdM and understand
what it takes to achieve the above objectives. Specifically, the following topics are covered
• Introduction to process/equipment abnormalities and faults
• Typical workflow for ML-based process monitoring and predictive maintenance
• ML landscape for process monitoring and predictive maintenance
• Common PM/PdM solution deployment infrastructure employed in industry
Machine learning is a great tool, but it’s not magic; it still takes a lot ‘ML art’ to get the things
right. Let’s now take the first step towards mastering this art.
1
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
Process industry is a parent term used to refer to industries like petrochemical, oil & gas,
chemical, power, paper, cement, pharmaceutical, etc. These industries use processing plants
to manufacture intermediate or final consumer products. As emphasized in Figure 1.1, the
prime concerns of the management of these plants include, amongst others, optimal and safe
operations, quality control, and high reliability through proactive process monitoring and
predictive maintenance. All these tasks fall under the ambit of process systems engineering
(PSE). While ML is being slowly incorporated in the PSE tasks (for example, deep learning-
based process controller1), ML has had the biggest influence on the tasks related to plant
health management, viz, fault detection, fault diagnosis, and predictive maintenance.
Figure 1.1: Overview of industries constituting process industry and the common PSE tasks
Figure 1.2 shows a sample process flowsheet with traditional measurements of flow,
temperature, pressure, level, composition, power, and vibration. Such complex and highly
integrated operations, tight product specifications, and the economic compulsion to push
processes to their limits are making industrial operations more prone to failures. Nonetheless,
there is an increasing trend to have unmanned or lean-staffed plants with less human eyes to
monitor the process. This is where automated plant health management comes into play to
resolve this dichotomy. Process models combined with sensor data are used for continuous
monitoring of processes to detect, isolate, and diagnose faults, and for predicting fault
1
https://www.aspentech.com/en/products/msc/aspen-dmc3
MLforPSE.com|2
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
progression. The obvious gains are prevention of costly downtimes through better planned
maintenance. For developing process models, data-driven/ML models have become more
popular due to the relative ease of implementation and model maintenance compared to first
principle-based models.
Figure 1.2: A typical process flowsheet2 with flow (FI), temperature (TI), pressure (PI),
composition (Analyzers), level (LI), power (JI), vibration (VI) measurements.
Let’s continue learning about the ML-based plant health management by first taking a closer
look at the meaning of process faults and abnormalities.
2
Adapted from the original flowsheet by Gilberto Xavier (https://github.com/gmxavier/TEP-meets-LSTM) provided under Creative-Commons
Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|3
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
MLforPSE.com|4
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
As remarked before, faults entail unwanted deviations in process variables. Figure below
shows samples of data patterns that may be observed under the influence of process faults.
Faulty variable shows a drift but remains Faulty variable shows unusual spikes but
within ‘normal operation range’ remains within ‘normal operation range’
Figure 1.4 shows why automated fault detection is not a very straightforward activity. Under
normal plant operation, process variables do not remain at fixed values but show stochastic
fluctuations and normal variations due to changing plant load, product grade, ambient
conditions, etc. Therefore, one can’t just compare each process variable against some fixed
thresholds to ascertain the healthy state of the process. Modeling the multivariable
relationships among the plant variables become indispensable in most of the scenarios.
The takeaway message is that modern process plants are prone to multiple failures, and it
takes an ‘army’ to ensure reliable operations. In the industry 4.0 era, ML is being employed
as that ‘army’. Before we look at the different ML models available at our disposal, let’s try
and understand what exactly an ML model is expected to do.
MLforPSE.com|5
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
In the previous section we discussed in some detail the fault detection aspect of plant health
management. However, it is only a part of the journey towards reliable plant operation. Figure
1.5 shows the different milestones of the journey. As shown, fault detection is followed by fault
isolation or fault diagnosis wherein the objective is to identify the process variables that have
been affected by the fault or determine the underlying cause of the fault, respectively. For
example, for the valve malfunction problem illustrated in Figure 1.3, fault isolation pinpoints
the flow from the separator to the stripper as the faulty variable and fault diagnosis pinpoints
the valve stiction as the root cause of faulty behavior.
FDI vs FDD
In the process monitoring literature, you will find the acronyms FDI and FDD
very often. FDI stands for fault detection and isolation, and FDD stands for fault
detection and diagnosis. As alluded to before, although fault diagnosis is
different from fault isolation, it is often used (incorrectly) to refer to the task of
finding variables showing abnormal behavior.
Other terms that you may encounter are fault identification and fault
classification. While fault identification is same as fault isolation, fault
classification refers to categorizing/classifying a fault into one of several pre-
defined fault types.
Following FDI/FDD, lies the task of fault prognosis which entails forecasting the progression
of the identified fault. Fault prognosis helps to determine the amount of time left before the
equipment affected by the fault needs to be taken out of service for maintenance or the whole
plant needs to be shutdown for fault repair. For equipment-level monitoring, fault prognosis
provides what is popularly known as remaining useful life (RUL). For example, for the
compressor bearing damage problem illustrated in Figure 1.3, the vibrations will only be
slightly higher than normal during the initial stages of crack development. However, with time
the crack grows leading to greater and greater vibrations, and ultimately the compressor fails
or becomes too dangerous to operate. A good fault prognostic model can accurately estimate
the time left until the failure point of compressor is reached.
The advancement in fault prognosis algorithms have popularized the concept of predictive
maintenance, wherein the plant management can plan well-in-advance the maintenance
MLforPSE.com|6
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
schedule based on actual equipment/process health condition. As you can imagine, this
approach has obvious economic benefits (compared to time-based/preventive maintenance)
and, to nobody’s surprise, has caught fascination of process industry executives!
In the upcoming chapters, we will study in detail all the shown major aspects of PHM and
learn how to implement the end-to-end solutions.
.
3
Although the acronym ‘PHM’ is commonly used by the prognostic research community to refer to prognostic and health
management, we will use it to denote ‘plant health management’ in this book.
MLforPSE.com|7
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
As process data scientists, we have to live with the harsh truth that there is no single
universally good model for all occasions. One reason for this is that process data can show
different characteristics (such as nonlinearity, non-Gaussianity, dynamics, multi-modality,
etc.) which necessitates selection of different modeling methodologies. Additionally, the
availability of historical faulty data, the user’s end goal, and the type of installed sensors can
also influence the model selection as shown in Figure 1.6. This makes the task of correct
selection of ML model daunting (and potentially overwhelming for beginner PDSs).
Fortunately, the recourse is open-secret and is as simple as having a good understanding of
your data and system, and conceptually sound knowledge of pros and cons of the available
methods.
Figure 1.6: Sample factors that influence ML model selection for PM/PdM
MLforPSE.com|8
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
Before you embark upon modeling your process system, you would already have knowledge
of the various factors listed in the above figure, except possibly for the data characteristics.
We will study the techniques used to ascertain data characteristics in Chapter 3. Now that we
understand the factors that influence model selection, we are ready to see what models are
available at our disposal.
Figure 1.7 below shows the modeling methodologies for process monitoring that we will cover
in this book. Fault detection and diagnosis are precursors to fault prognosis and therefore the
same methodologies are employed for building predictive maintenance solutions as well. As
the category topics show, these methods cater to process data with different characteristics.
The methods range from ‘simple’ traditional control charts to modern deep learning.
The category of MSPM (multivariate statistical process monitoring) methods (PCA, PLS,
GMM, etc.) deserves special attention as it has been the bedrock of health monitoring of
complex process plants. A large section of the book will therefore cater to these methods.
However, irrespective of their popularity, MSPM methods have shortcomings. Therefore,
machine learning and deep learning models like Autoencoders, LSTMs, LOFs are covered as
well.
MLforPSE.com|9
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
In Figure 1.7, the modeling methodologies have been broadly divided into four categories, viz,
univariate statistical models, multivariate statistical models, classical machine learning
models, and artificial neural networks (ANN) models. Each of these categories are dealt with
in separate parts of the book. The statistical4 PM models extract a statistical model of the
system using past data. Within this category lies simple control-chart models that are used for
single variable monitoring. Though useful, these univariate models are understandably too
restrictive to handle plantwide monitoring of complex industrial plants. On the other end of the
model spectrum lies complex deep learning models that can theoretically handle any type of
process systems; the downside is cumbersome model training procedure and
hyperparameter optimization. In between these two extremes, lie the MSPM methods whose
ease of implementation and interpretable results have led to wide popularity. However, MSPM
methods tend to falter for highly nonlinear processes with complex data distributions.
Therefore, classical ML and deep learning methods have been receiving considerable
attention for process monitoring solution development for complex industrial processes.
The models in Figure 1.7 cater to the different scenarios that you may encounter in practice.
If you have abundant past faulty samples then classification models such as FDA, SVM, ANN,
etc. can be employed. However, in process industry, most of the time you will not have the
luxury of having past faulty data and therefore, many of the fault detection techniques covered
in this book cater to this scenario. The figure below illustrates the different principles employed
to detect the presence of process faults using only NOC data during model training.
4
In legacy process monitoring terminology, statistical process monitoring is also called statistical process control (SPC).
Although SPC methods do not involve any feedback to the process controllers, the word ‘control’ signifies the objective
of keeping the process ‘in-control’ through continuous monitoring.
MLforPSE.com|10
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
Fitted boundary
• High-dimensional NOC data are assumed to lie along a lower- • Training data is assumed to provide adequate representation
dimensional latent space. of the region in the measurement space that NOC data are
• Projection-based methods (such as PCA, ICA, KPCA, etc.) project expected to lie in.
original test sample in the latent space. • Methods like Hotelling’s T2 and SVDD can generate an
• The position of test sample in latent space and its implicit boundary around the NOC samples and provide a
reconstruction error are used to detect process fault. measure of how far a test sample lie from the NOC
boundary.
Measured response
NOC
Predictors
NOC
Predicted response
• Distance of a test sample from the neighboring NOC samples is • Variables are categorized into predictor and response variables.
used to infer its abnormality. KNN method can be used for this. • A regression model (ANN, PLS, SVR, Random Forest, etc.) is
• Alternatively, local density in the region where the test sample fitted to capture NOC behavior and prediction errors (or
falls in can be used to classify the test sample as faulty or residuals) are generated.
normal. LOF method can be used for this. • The residuals are monitored (using control charts, PCA, etc.) to
detect the presence of faults.
Figure 1.8: Popular fault detection methodologies using only NOC data
Note that the models in Figure 1.7 are applicable to both equipment level monitoring and
plantwide monitoring. Let us now move to an overview of how these models are actually
developed.
MLforPSE.com|11
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
In Figure 1.7, we saw different types of ML models for PM applications. Fortunately, the
workflow for model development and deployment remains similar, and is shown in Figure 1.9.
As is typical for a ML project, the tasks can be categorized into offline computations and
online/real-time computations. In online computations, process data are parsed through the
model to provide real-time insights and results. The models are built offline using historical
process data. This offline exercise is performed once or repeated at regular intervals for model
update. Brief description of the essential steps performed are provided below:
Figure 1.9: Steps involved in a typical ML model development for process monitoring
MLforPSE.com|12
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
➢ Sample and variable selection: One does not simply dump all the available historical
data and sensor measurements into a model training module. If a model is being built
to identify the normal process behavior, then care must be taken to include only
samples from fault-free operations in the model training dataset. Furthermore, if your
model does not handle dynamics then data from periods of process transitions should
be excluded.
➢ Model training and validation: Model training imply estimating the parameters of the
chosen ML model, for example, the neuron weights in an ANN model. Model validation
is employed for finding optimal values of model hyperparameters, for example, the
number of neurons in the ANN model. At the end of this step, the coveted process
model is obtained.
Additional activities related to computation of health indicator and subsequent RUL estimation
involved in fault prognosis are covered in Part 7 of the book which deals specifically with
prognostic techniques for predictive maintenance applications.
MLforPSE.com|13
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
After you have developed a satisfactory PHM model, the real test of your solution lies in how
well it is received by the end-users. The end-users could be reliability personnel/engineers at
the local plant sites or the central team of experts remotely supervising the plants. Figure 1.10
below shows a (simplified) common architecture for bringing your tool’s results to these end-
users. As shown, the ML model could be setup to run on local PCs at every site or a central
server machine/cloud resource5. Plant operators may access the tool’s results on the local
control-room screens or via web browsers in case of centralized deployment. The web user
interface could be either built using third-party visualization software (Tableau, Sisense,
Power BI, etc.) or completely custom-built using front-end web frameworks like bootstrap.
That is all it takes to deploy a ML-based PHM solution in a production environment. This
concludes our quick attempt to impress upon you the importance of process monitoring and
predictive maintenance in process industry. Hopefully, you also now have a good idea of what
resources you have to achieve your PM/PdM goals and what is takes to build a PM/PdM
solution.
5
There exists a specialized branch of machine learning engineering, called MLOps (short for machine learning operations)
that deals with reliable and scalable deployment of ML models in production.
MLforPSE.com|14
Chapter 1: Machine Learning, Process and Equipment Condition Monitoring, and Predictive Maintenance: An Introduction
Summary
In this chapter, we looked at the importance of plant health management for increasing
process safety, reducing downtime costs, and increasing equipment life. We understood the
meaning of process faults and abnormalities. We looked at the different stages of plant health
management, familiarized ourselves with the factors that influence model selection, and
looked at the different models available at our disposal to achieve the PHM goals. We also
briefly looked at the generic workflow for process monitoring model development and
understood how PM/PdM solutions are deployed in modern industrial settings. In the next
chapter, we will take the first step and learn about the environment we will use to execute our
Python scripts containing ML code for PHM.
MLforPSE.com|15
Chapter 2: The Scripting Environment
Chapter 2
The Scripting Environment
I
n the previous chapter, we studied the various aspects of machine learning-based process
monitoring and predictive maintenance. In this chapter, we will quickly familiarize ourselves
with the Python language and the scripting environment that we will use to write ML codes,
execute them, and see results. This chapter won’t make you an expert in Python but will give
you enough understanding of the language to get you started and help understand the several
in-chapter code implementations in the upcoming chapters. If you already know the basics of
Python, have a preferred code editor, and know the general structure of a typical ML script,
then you can skip to Chapter 3.
The ease of using and learning Python, along with the availability of a plethora of open-access
useful packages developed by the user-community over the years, has led to immense
popularity of Python. In recent years, development of specialized libraries for machine and
deep learning has made Python the default language of choice among ML community. These
advancements have greatly lowered the entry barrier into the world of machine learning for
new users.
With this chapter, you are putting your first foot into the ML world. Specifically, the following
topics are covered
• Introduction to Python language
• Introduction to Spyder and Jupyter, two popular code editors
• Overview of Python data structures and scientific computing libraries
MLforPSE.com|16
Chapter 2: The Scripting Environment
Installing Python
One can download official and the latest version of Python from the python.com website.
However, the most convenient way to install and use Python is to install Anaconda
(www.anaconda.com) which is an open-source distribution of Python. Along with the core
Python, Anaconda installs a lot of other useful packages. Anaconda comes with a GUI called
Anaconda Navigator (Figure 2.2) from where you can launch several other tools.
6
Most of the content of this chapter is like that of Chapter 2 of the book ‘Machine Learning in Python for Process Systems
Engineering’ and have been re-produced with appropriate changes to maintain standalone nature of this book.
MLforPSE.com|17
Chapter 2: The Scripting Environment
Jupyter Notebooks are another very popular way of writing and executing Python code. These
notebooks allow combining code, execution results, explanatory text, multimedia resources
in a single document. As you can imagine, this makes saving and sharing complete data
analysis very easy.
In the next section, we will provide you with enough familiarity on Spyder and Jupyter so that
you can start using them.
MLforPSE.com|18
Chapter 2: The Scripting Environment
Figure 2.3 shows the interface7 (and its different components) that comes up when you launch
Spyder. These are the 3 main components:
• Editor: You can type and save your code here. Clicking button executes the code
in the active editor tab.
• Console: Script execution results are shown here. It can also be used for executing
Python commands and interact with variables in the workspace.
• Variable explorer: All the variables generated by running editor scripts or console are
shown here and can be interactively browsed.
Like any IDE, Spyder offers several features. You can divide your script into cells and execute
only any selected cell if you choose to (by pressing Ctrl + Enter buttons). Intellisense allows
you to autocomplete your code by pressing Tab key. Extensive debugging functionalities
make troubleshooting easier. These are only some of the features available in Spyder. You
are encouraged to explore the different options (such as pausing and canceling script
execution, clearing out variable workspace, etc.) on the Spyder GUI.
7
If you have used MATLAB, you will find the interface very familiar.
MLforPSE.com|19
Chapter 2: The Scripting Environment
With Spyder, you have to run your script again to see execution results if you close and reopen
your script. In contrast to this, consider the Jupyter interface in Figure 2.4. Note that the
Jupyter interface opens in a browser. We can save the shown code, the execution outputs,
and explanatory text/figures as a (.ipnb) file and have them remain intact when we reopen the
file in Jupyter Notebook.
You can designate any input cell as a code or markdown (formatted explanatory text). You
can press Shift + Enter keys to execute any active cell. All the input cells can be executed via
the Cell menu.
This completes our quick overview of Spyder and Jupyter interface. You can choose either of
them for working through the codes in the rest of the book.
MLforPSE.com|20
Chapter 2: The Scripting Environment
In the current and next sections, we will see several simple examples of manipulating data
using Python and scientific packages. While these simple operations may seem unremarkable
(and boring) in the absence of any larger context, they form the building blocks of more
complex scripts presented later in the book. Therefore, it will be worthwhile to give these
atleast a quick glance.
Note that you will find ‘#’ used a lot in these examples; these hash marks are used to insert
explanatory comments in code. Python ignores (does not execute) anything written after # on
a line.
MLforPSE.com|21
Chapter 2: The Scripting Environment
Tuples are another sequence construct like lists, with a difference that their items and sizes
cannot be changed Since tuples are immutable/unchangeable, they are more memory
efficient.
# creating tuples
tuple1 = (0,1,'two')
tuple2 = (list1, list2) # equals ([2, 4, 6, 8], ['air', 3, 1, 5])
A couple of examples below illustrate list comprehension which is a very useful way of creating
new lists from other sequences
Note that Python indexing starts from zero. Very often, we need to work with multiple items of
the list. This can be accomplished easily as shown below.
MLforPSE.com|22
Chapter 2: The Scripting Environment
# selectively execute code based on condition # compute sum of squares of numbers in list3
if list1[0] > 0: sum_of_squares = 0
list1[0] = 'positive' for i in range(len(list3)):
else: sum_of_squares += list3[i]**2
list1[0] = 'negative'
print(sum_of_squares) # displays 78
# list1 becomes ['positive', 4, 6]
Custom functions
Previously we used Python’s built-in functions (len(), append()) to carry out operations pre-
defined for these functions. Python allows defining our own custom functions as well. The
advantage of custom functions is that we can define a set of instructions once and then re-
use them multiple times in our script and project.
For illustration, let’s define a function to compute the sum of squares of items in a list.
return sum_of_squares
MLforPSE.com|23
Chapter 2: The Scripting Environment
You might have noticed in our custom function code above that we used
different indentations (number of whitespaces at beginning of code lines) to
separate the ‘for loop’ code from the rest of the function code. This practice is
actually enforced by Python and will result in errors or bugs if not followed.
While other popular languages like C++, C# use braces ({}) to demarcate a
code block (body of a function, loop, if statement, etc.), Python uses
indentation. You can choose the amount of indentation, but it must be
consistent within a code block.
This concludes our extremely selective coverage of Python basics. However, this should be
sufficient to enable you to understand the codes in the subsequent chapters. Let’s continue
now to learn about some specialized scientific packages.
While the core Python data-structures are quite handy, they are not very convenient for the
advanced data manipulations we require for machine learning tasks. Fortunately, specialized
packages like NumPy, SciPy, Pandas exist which provide convenient multidimensional
tabular data structures suited for scientific computing. Let’s quickly make ourselves familiar
with these packages.
NumPy
In NumPy, ndarrays are the basic data structures which put data in a grid of values.
Illustrations below shows how 1D and 2D arrays can be created and their items accessed
MLforPSE.com|24
Chapter 2: The Scripting Environment
# create a 1D array
arr1D = np.array([1,4,6])
Note that the concept of rows and columns do not apply to a 1D array. Also, you would have
noticed that we imported the NumPy package before using it in our script (‘np’ is just a short
alias). Importing a package makes available all its functions and sub-packages for use in our
script.
MLforPSE.com|25
Chapter 2: The Scripting Environment
Executing arr2D.sum() returns the scalar sum over the whole array, i.e., 25.
MLforPSE.com|26
Chapter 2: The Scripting Environment
# slicing
arr8 = np.arange(10).reshape((2,5)) # rearrange the 1D array into shape (2,5)
print((arr8[0:1,1:3]))
>>> [[1 2]]
print((arr8[0,1:3])) # note that a 1D array is returned here instead of the 2D array above
>>> [1 2]
An important thing to note about NumPy array slices is that any change made on sliced view
modifies the original array as well! See the following example
This feature becomes quite handy when we need to work on only a small part of a large
array/dataset. We can simply work on a leaner view instead of carrying around the large
dataset. However, situation may arise where we need to actually work on a separate copy of
subarray without worrying about modifying the original array. This can be accomplished via
the copy method.
MLforPSE.com|27
Chapter 2: The Scripting Environment
Fancy indexing is another way of obtaining a copy instead of a view of the array being indexed.
Fancy indexing simply entails using integer or boolean array/list to access array items.
Examples below clarify this concept
Vectorized operations
Suppose you need to perform element-wise summation of two 1D arrays. One
approach is to access items at each index at a time in a loop and sum them.
Another approach is to sum up items at multiple indexes at once. The later
approach is called vectorized operation and can lead to significant boost in
computational time for large datasets and complex operations.
# vectorized operations
vec1 = np.array([1,2,3,4])
vec2 = np.array([5,6,7,8])
vec_sum = vec1 + vec2 # returns array([6,8,10,12]); no need to loop through index 0 to 3
Broadcasting
Consider the following summation of arr2D and arr1D arrays
# item-wise addition of arr2D and arr1D
arr_sum = arr2D + arr1D
MLforPSE.com|28
Chapter 2: The Scripting Environment
Pandas
Pandas is another very powerful scientific package. It is built on top of NumPy and offers
several data structures and functionalities which make (tabular) data analysis and pre-
processing very convenient. Some noteworthy features include label-based slicing/indexing,
(SQL-like) data grouping/aggregation, data merging/joining, and time-series functionalities.
Series and dataframe are the 1D and 2D array like structures, respectively, provided by
Pandas
Note that s.values and df.values convert the series and dataframe into corresponding NumPy
arrays.
8
numpy.org/doc/stable/user/basics.broadcasting.html
MLforPSE.com|29
Chapter 2: The Scripting Environment
Data access
Pandas allows accessing rows and columns of a dataframe using labels as well as integer
locations. You will find this feature pretty convenient.
# column(s) selection
print(df['id']) # returns column 'id' as a series
print(df.id) # same as above
print(df[['id']]) # returns specified columns in the list as a dataframe
>>> id
0 1
1 1
2 1
# row selection
df.index = [100, 101, 102] # changing row indices from [0,1,2] to [100,101,102] for illustration
print(df)
>>> id value
100 1 10
101 1 8
102 1 6
print(df.loc[101]) # returns 2nd row as a series; can provide a list for multiple rows selection
print(df.iloc[1]) # integer location-based selection; same result as above
Data aggregation
As alluded to earlier, Pandas facilitates quick analysis of data. Check out one quick example
below for group-based mean aggregation
MLforPSE.com|30
Chapter 2: The Scripting Environment
102 2 24
File I/O
Conveniently reading data from external sources and files is one of the strong forte of Pandas.
Below are a couple of illustrative examples.
# reading from excel and csv files
dataset1 = pd.read_excel('filename.xlsx') # several parameter options are available to customize
what data is read
dataset2 = pd.read_csv('filename.xlsx')
This completes our very brief look at Python, NumPy, and Pandas. If you are new to Python
(or coding), this may have been overwhelming. Don’t worry. Now that you are atleast aware
of the different data structures and ways of accessing data, you will become more and more
comfortable with Python scripting as you work through the in-chapter code examples.
MLforPSE.com|31
Chapter 2: The Scripting Environment
Sklearn
Sklearn is currently the most popular library for machine learning in Python. It provides several
modules to conveniently handle different aspects of ML such as data standardization,
performance scoring, data splitting, etc. You do not need to implement ML models such as
PCA, Random Forests, GMM, etc., from scratch; Sklearn provide ready-to-use
implementations of popular ML models. As you will see in the upcoming chapters, ML models
can be defined and fitted in a single line of code using Sklearn library.
Summary
In this chapter, we made ourselves familiar with the scripting environment that we will use for
writing and executing PHM scripts. We looked at two popular IDEs for Python scripting, the
basics of Python language, and learnt how to manipulate data using NumPy and Pandas. In
the next chapter, you will learn a few techniques for exploratory data analysis that you can
use to get some insights about your data which can help you select your ML model judiciously.
MLforPSE.com|32
Chapter 3
Exploratory Data Analysis: Getting to Know
Your Data Better
G
etting to know your enemy is a time-tested strategy for emerging victorious in any
battle. For developing a satisfactory process monitoring model, this strategy
translates to ‘knowing your process data well’. This task is formally termed as
exploratory data analysis (EDA). Most of the machine learning models make some
assumptions regarding the distribution (e.g., Gaussian vs uniform distribution) and
characteristics (e.g., dynamic vs steady state nature) of the data they operate upon.
Therefore, it only serves us well investing some time in EDA so that the consistency between
our chosen model’s assumptions and the characteristics of process data at hand can be
ascertained. Failure to do so will lead to high rate of false alerts and/or missed/delayed fault
detection which will most likely lead to ‘death’ of your monitoring tool due to loss of user
confidence!
In this chapter, we will learn how to assess the presence of four important properties in a
dataset, viz, nonlinearity, non-Gaussianity, dynamics, and multimodality. We will motivate the
study of these properties by understanding their impact on process monitoring performance.
We will especially focus on techniques that render themselves convenient for implementation
in an automated setting. As is obvious, the concepts learnt in this chapter will help you get
better at correct model selection. Specifically, the following topics are covered
33
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Ask any expert process data scientist about some advice to get better at ML model selection
and you will very likely get the suggestions to understand your data better. It’s true, you cannot
over-exaggerate the importance of gathering as many insights about the data as possible
before getting your hands dirty. Most of the process monitoring methodologies make
assumptions about the data characteristics and therefore, it is imperative to ascertain these
characteristics in our process data to ensure selection of appropriate monitoring model. Let’s
consider the classical PCA model (inarguably the most popular model for monitoring
multivariate industrial processes): the ideal dataset is linear, Gaussian distributed, single-
clustered, and with no dynamics; Figure 3.1 uses simple datasets to illustrate what the
deviations from these ideal characteristics look like.
𝑥2 𝑥3
𝑥1 𝑥2
𝑥1 Deviations
from ideal
process data
characteristic
𝑥2
𝑥2
𝑥1
𝑥1 𝐴𝐶𝐹
𝑥1
𝑡
To further motivate the discussions in the rest of the chapter, let’s take a quick look at the
impacts the non-ideal data characteristics can have on PCA performance.
Effect of nonlinearity
In an ideal PCA-compatible dataset, the variables are linearly related which allows the
standard PCA method to find the lower-dimensional manifold along which the data is
distributed. However, as can be seen below, PCA fails to transform the original 2D dataset to
a 1D feature space even though it is apparent that the original data points lie along a curved
manifold. This severely limits the ability of standard PCA to detect faulty samples.
MLforPSE.com|34
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Effect of non-Gaussianity
Not many process modelers pay attention to the Gaussianity of process dataset. The
presence of non-Gaussianity can severely impact the performance of many ML models.
Illustration below shows why standard PCA-based fault detection9 may fail for non-Gaussian
data.
Effect of dynamics
While PCA excels at extracting out static relationships among process variables, it fails to
capture any dynamic relationship. For example, consider the following process and the
corresponding data distribution in the 2D measurement space.
9
Do not worry if you do not understand the PCA jargon. You will become familiar with these terms in Chapter 7.
MLforPSE.com|35
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Don’t worry, if the PCA-specific jargon in the above illustrations were not immediately clear to
you. You may re-visit this section after we finish the study of PCA technique in Chapter 7.
Next, we will take each of the four non-ideal characteristics and learn how to quickly assess
its presence in a multivariable dataset.
10
https://github.com/afraniomelo/KydLIB
11
Melo et al., Open benchmarks for assessment of process monitoring and fault diagnosis techniques: A review and critical
analysis. Computers and Chemical Engineering, 2022.
12
https://github.com/vickysun5/SmartProcessAnalytics
13
Sun and Braatz, Smart process analytics for predictive modeling. Computers and Chemical Engineering, 2021.
MLforPSE.com|36
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
If any two variables in a multivariable dataset are nonlinearly related, then the dataset is said
to possess nonlinear characteristics. Therefore, one may look at scatter plots between each
pair of variables to assess nonlinearity, but this, expectedly, becomes cumbersome for high-
dimensional dataset. One common recourse is to use Pearson correlation coefficient (𝜌)
which quantifies the linear dependency between two variables. As is commonly known, 𝜌 ∈
[−1, 1] where values of -1 and 1 signify perfect linear correlation. A drawback, however, is
that 𝜌 = 0 only indicates no linear relationship and the variables could still be nonlinearly
related. Figure 3.2 shows some illustrative scenarios.
𝜌𝑥,𝑦 ~0; 𝑟𝑥,𝑦 ~0 𝜌𝑥,𝑦 = 0.95; 𝑟𝑥,𝑦 ~0 𝜌𝑥,𝑦 = −0.3; 𝑟𝑥,𝑦 = 0.7
𝐶𝑜𝑣ሺ𝑥,𝑦ሻ σ𝑁 തሻ
𝑖=1ሺ𝑥𝑖 −𝑥ҧ ሻሺ𝑦𝑖 −𝑦
𝜌𝑥,𝑦 = = ; N is number of samples
ඥ𝑣𝑎𝑟ሺ𝑥ሻඥ𝑣𝑎𝑟ሺ𝑦ሻ ටσ𝑁 2 𝑁 തሻ2
𝑖=1ሺ𝑥𝑖 −𝑥ҧ ሻ ටσ𝑖=1ሺ𝑦𝑖 −𝑦
Figure 3.2: Linear (𝜌𝑥,𝑦 ) and nonlinear (𝑟𝑥,𝑦 ) correlation coefficients [𝑟𝑥,𝑦 is explained below]
A popular metric to quantify generic dependency (both linear and nonlinear) is mutual
information (MI), which is defined as
Where 𝐻ሺ𝑥ሻ is information entropy of variable x and 𝐻ሺ𝑥, 𝑦ሻ is the joint information entropy
between x and y. The ennemi14 Python package allows convenient computation of MI; the
package provides a normalized index termed as MI correlation coefficient which varies
between 0 (no relationship) and 1, and is defined as
14
https://pypi.org/project/ennemi/
MLforPSE.com|37
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Proximity-based methods Inference about incoming test sample is based upon its
distances from historical data-points or local densities.
Kernelized SVMs Kernel trick is used to project nonlinear data into feature
space where linear segregation becomes possible.
Decision trees, Forests Input space is broken down into homogeneous smaller
regions through a series of binary decisions.
The MI correlation coefficient still does not tell us if two variables are strictly nonlinearly
related. Towards this end, Zhang et al.15 provided a useful approach combining 𝜌 and 𝜌𝐼 to
generate a nonlinearity coefficient16, 𝑟𝑥,𝑦 , defined as
∈ [0,1]
15
Zhang et al., A Novel Strategy of the Data Characteristics Test for Selecting a Process Monitoring Method Automatically.
Industrial & Engineering Chemistry Research, 2016
16
𝜌𝐼ሺ𝑥,𝑦ሻ (and, therefore, 𝑟𝑥,𝑦 ) may show near-zero negative values (https://polsys.github.io/ennemi/potential-
issues.html) which can be interpreted as zero.
MLforPSE.com|38
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
𝑥1 = 𝑡 + 𝑒1
𝑥2 = 𝑡 3 − 3𝑡 + 𝑒2
𝑥3 = −𝑡 4 + 3𝑡 2 + 𝑒3
𝑟𝑥,𝑦 𝜌𝑥,𝑦
𝑥1 𝑛𝑎𝑛 0.61 0.78 𝑥1 1 0.37 −0.2
𝑥2 0.61 𝑛𝑎𝑛 0.06൩ 𝑥2 0.37 1 −0.94൩
𝑥3 0.78 0.06 𝑛𝑎𝑛 𝑥3 −0.2 −0.94 1
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3
As expected, the nonlinearity matrix indicates significant nonlinearity in the system (due
to high nonlinearity coefficient values for the x1-x2 and x1-x3 pairs). The scatter plots
below clearly corroborate this and also clarify why x2-x3 pair appears as ‘linearly’ related.
For the entire dataset, Zhang et al. also proposed an overall nonlinear correlation coefficient
defined as
σ𝑚 2
𝑖,𝑗=1 𝑟𝑖,𝑗
𝑟= ට ; m is number of variables
𝑚2 −𝑚
MLforPSE.com|39
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Since, linear methods can provide satisfactory monitoring performance for linear and weakly
nonlinear dataset, Zhang et al. suggest a threshold between 0.3 and 0.5 for making the
decision of using nonlinear models.
Industrial processes often operate around some optimal condition. Even though
the actual process may be very complex and nonlinear, the data may exhibit only
linear relationships among the variables. Therefore, do not jump the gun on
choosing a nonlinear model; perform the nonlinearity check and then make an
educated selection of an ML model for your system.
We saw in Section 1 that wrong assumptions about Gaussianity of process data can lead to
incorrect determination of NOC envelope. For multivariate Gaussianity assessment, an easy
to implement method proposed by Zhang et at. will be presented here. The method uses the
fact that if data is multivariate Gaussian distributed then the squared Mahalanobis distance
(D) of samples from the center follow an F distribution17 as shown below18.
𝑛ሺ𝑁 2 −1ሻ
𝐷ሺ𝑥ሻ = ሺ𝑥 − 𝑢തሻ𝑇 𝑆𝑐𝑜𝑣 −1 ሺ𝑥 − 𝑢തሻ ~ 𝐹ሺ𝑚, 𝑁 − 𝑚ሻ eq. 4
𝑁ሺ𝑁−𝑚ሻ
ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚
4 5
𝑋~𝑁 ቀቂ ቃ , ቂ
4
ቃቁ 2ሺ10002 − 1ሻ
4 4 6 𝐷~ 𝐹ሺ2,998ሻ
1000 ∗ 998
17
𝐹ሺ𝑚, 𝑁 − 𝑚ሻ 𝑖mply an F distribution with m and N-m degrees of freedom
18
Note that if S is singular then m is replaced by rank of S and S-1 by pseudo-inverse of S in Eq. (4)
MLforPSE.com|40
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
𝜇
3𝜎 3𝜎
𝑓ሺ𝑥ሻ 𝑓ሺ𝑥1 , 𝑥2 ሻ
𝑥1
𝑥 𝑥2
1 1 𝑥−𝜇 2 1 1
ሺ𝑥−𝜇ሻ𝑇 𝛴 −1 ሺ𝑥−𝜇ሻ
𝑒 2 𝜎 ቁ
− ቀ
𝑓ሺ𝑥ሻ = 𝑓ሺ𝑥ሻ = 𝑒 −2
𝜎√2𝜋 ඥሺ2𝜋ሻ𝑚 |𝛴|
Once the Mahalanobis distances are available, the empirical cumulative distribution function
(CDF) is compared against the CDF of the expected F distribution. The algorithm below
summarizes the approach.
MLforPSE.com|41
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
➢ Generate fractile-fractile plot of statistic D’s empirical distribution and expected F distribution
• Values of D are sorted such that
𝐷𝑡 : 𝐷ሺ1ሻ ≤ 𝐷ሺ2ሻ ≤ ⋯ ≤ 𝐷ሺ𝑁ሻ
𝐹𝑡
• Empirical CDF of 𝐷𝑡 can be described as
𝑡−0.5
𝐶𝐷𝐹൫𝐷ሺ𝑡ሻ ൯ ≈ = 𝑟𝑡 ሺ𝑡 = 1,2, ⋯ , 𝑁ሻ
𝑁
Here, 𝐷ሺ𝑡ሻ is the fractile corresponding to probability rt
𝐷ሺ𝑡ሻ
➢ Plot 𝐷ሺ𝑡ሻ vs 𝐹𝑡 for 𝑡 = 1,2, ⋯ , 𝑁 and fit a straight line. If data follows multivariate
Gaussianity then the fitted line should pass through origin with slope equal to 1
𝑆 𝐷ሺ𝑡ሻ
< 0.15 ⇒ linear fit is not rejected
𝐹ത
1
𝐹ത = 𝐹𝑡
𝑁
𝑆𝑆𝐸
𝑆=ඨ
𝑁−2
• If the above condition is met, then the intercept and slope are compared against 0
and 1, respectively, via the following conditions
➢ If all the above three conditions are met, then multivariate Gaussian distribution is assumed
MLforPSE.com|42
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
The example below shows the implementation steps19 of the above algorithm.
𝑥3 = 𝑥1 + 𝑥2 + 𝑒
19
Code adapted from KydLIB package (https://github.com/afraniomelo/KydLIB/)
MLforPSE.com|43
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
In a dynamic dataset, the recorded samples are serially correlated, i.e., the current
measurements are not independent of the past measurements. We saw in Section 1 that
models designed to work with steady state data can fail to capture the temporal relationships
which lead to poor fault detection performance. The unmodeled serial correlations show up in
model residuals as illustrated in Figure 3.4. The presence of serial correlations in model
residuals is assessed through autocorrelation function (ACF) which computes the linear
correlation coefficients for the residual time series for different lags. If no significant dynamics
are present, then the ACF values should lie close to 0 for lags > 0. This rationale is used by
Sun & Braatz in the SPA package to automatically decide the need for a dynamic model.
[𝑌]
Figure 3.4: Serial correlations in model residuals upon usage of static models
MLforPSE.com|44
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
The previous approach entails creation of a static model to generate residuals. An alternative
approach followed by Melo et al. in KydLIB package uses ACF over the raw measurement
itself and not the residuals. The ‘time necessary’ to reach an autocorrelation coefficient value
of 0.5 was plotted for all the variables to assess the dynamic behavior of the dataset. A low
mean or median time required to achieve 0.5 autocorrelation signifies low level of dynamics
in the data.
MLforPSE.com|45
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
System Identification ARIMAX are used to model input-output time series data.
The most common approach for determining the presence of multimodality in data is to divide
the dataset into distinct clusters and look for the optimal number of clusters. Techniques such
as k-means clustering and Gaussian mixture modeling exist for data clustering. We will defer
a detailed discussion on these techniques to Chapter 11. Nonetheless, we show in the
example below how the dataset shown in Figure 3.1 can be assessed for multimodality. In
Chapter 16, you will see a neural network-based visualization technique called Self
Organizing Maps that can provide a 2D view of high-dimensional dataset and, therefore, help
to check for the presence of clusters.
MLforPSE.com|46
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Let’s see how we can automatically find the number of operation modes/clusters.
# check for clusters via Gaussian Mixture Modeling using Bayesian Information Criterion
from sklearn.mixture import GaussianMixture
BICs = []
lowestBIC = np.inf
for n_cluster in range(1, 5):
gmm = GaussianMixture(n_components = n_cluster, random_state = 100).fit(data)
BIC = gmm.bic(data) # check online code to see how x1 and x2 vectors are generated
BICs.append(BIC)
print (optimal_n_cluster)
>>> 2
MLforPSE.com|47
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
In this case study, we will bring together the concepts learnt to analyze a celebrated dataset,
the Tennessee Eastman Process (TEP) dataset. If you work in the area of process monitoring,
then you are very likely to encounter research papers using TEP dataset to demonstrate
efficacy of their FDD techniques. Therefore, let’s use this to put the concepts learnt into
practice. Melo et al. have analyzed the nonlinearity, non-Gaussianity, and dynamics
characteristics of this dataset. We will try to reproduce those results and also assess the
presence of multimodal distribution.
The TEP dataset20 comes from simulation of a large-scale industrial chemical plant. The plant
consists of several unit operations: a reactor, a condenser, a separator, a stripper, and a
recycle compressor21. There are 22 continuous process measurements, 19 composition
measurements, and 11 manipulated variables. The dataset contains training and test data
from normal operation period and 21 faulty periods with distinct fault causes. The NOC training
20
available at https://github.com/camaramm/tennessee-eastman-profBraatz
21
Detailed information about the process and the faults can be obtained from the original paper by Downs and Vogel
titled ‘A plant-wide industrial process control problem’.
MLforPSE.com|48
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
data file (d00.dat) contains 500 samples. Since this file is often used to build models, we will
analyze it in detail. We will exclude the composition measurements from our analysis. Figure
3.5 and 3.6 shows the process flowsheet and the line plots of the NOC data, respectively.
22
Adapted from the original flowsheet by Gilberto Xavier (https://github.com/gmxavier/TEP-meets-LSTM) provided under
Creative-Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|49
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
Figure 3.6: Line plots of continuous process measurements and manipulated variables
The plots below show the heat maps for the different correlation coefficients. The code for the
computation of the coefficients is the same as that shown in Section 3.2 and therefore not re-
produced here23. The overall linearity and nonlinearity coefficient comes out to be 0.21 and
0.20, respectively, indicating low levels of both linear and nonlinear correlations.
Variable #
Variable #
Variable #
Variable # Variable #
Variable #
23
The complete code used for TEP data assessment can be obtained from the GitHub repository
MLforPSE.com|50
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
The fractile-fractile plot and the fitted straight line for the Gaussianity assessment are shown
below. Based on the statistics obtained, the dataset is assessed to be non-Gaussian.
Intercept: -2.219
Slope: 1.145
𝐹ത : 35.482
𝑆/𝐹ത : 0.011
The plot below shows the ‘lag necessary’ to reach an autocorrelation coefficient value of 0.5
for each of the 33 variables. The low value of median lag required signifies overall low level
of dynamics in the data.
The BIC curve from GMM modeling of the TEP data indicates that the data is unimodal
distributed.
The above exercise indicates that the analyzed TEP dataset exhibits linear, static, non-
Gaussian, and unimodal characteristics. Correspondingly, you will see an application of ICA
MLforPSE.com|51
Chapter 3: Exploratory Data Analysis: Getting to Know Your Data Better
technique (which is well-suited for such datasets) employed for TEP fault detection in Part 3
of the book.
Summary
In this chapter, we looked at methods to assess the different characteristics of process data;
specifically, we focused on the presence of nonlinearity, non-Gaussianity, dynamics, and
multi-modality. We also performed an investigation of the process data characteristics from
Tennessee Eastman Chemical plant simulation mimicking a real-scale complex industrial
system. With the techniques learnt in this chapter, you will no longer be working in the dark
while selecting a model suitable for your dataset. In the next chapter, we will look at some
more guidelines on best practices for ML model development for plant health management.
MLforPSE.com|52
Chapter 4
Machine Learning for Plant Health
Management: Workflow and Best Practices
W
hether you are building a ML solution for fault detection, fault classification, or fault
prognosis, model development is the most critical task. Inarguably, obtaining a good
ML model is not a trivial task. You cannot obtain a good model by just dumping all
the available raw data in an off-the-shelf machine learning module. Incorrectly specify one
hyperparameter and your model will return garbage results; provide insufficiently ‘rich’ training
dataset and even the most carefully chosen ML model will prove incapable of providing
meaningful insights. Unfortunately, an automated procedure for ML model development that
works for all types of problems does not exist. Nonetheless, there is no cause for despair. The
trick to successful model development lies in being actively involved in the several model
development stages and making use of several useful guidelines that the ML community has
come up with over the years. We already saw in the previous chapter the importance of
acquiring a good understanding of data for correct model selection. In this chapter, we will
learn several other guidelines and the best practices.
We will not cover the best practices associated with generic machine learning workflow.
Concepts like feature extraction, feature engineering, cross-validation, regularization, etc.,
have already been covered in detail in our first book of the series. Albeit we will touch upon
topics that are specific to plant health management applications. In this chapter, our focus will
be on aspects that you should not ignore to ensure that you are not unknowingly setting your
model up for failure. Specifically, we will cover these topics
• ML model development workflow
• Data selection and pre-processing to obtain good training dataset
• Assessment of monitoring performance
• Best practices for model selection and tuning
53
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
The prime objective of the ML modeling task for building process modeling solutions is to
obtain a model that provide high fault sensitivity (i.e., the model is able to detect process faults
in incipient stages) and low false alarm rate (i.e., the model does not report a process fault
when process is operating normally). Balancing the trade-off between these two requirements
is not easy and requires careful attention to varied aspects during model development. It
definitely takes more than just executing a ‘model = <some ML_model>.fit()’ command on the
available data. In Chapter 1, we saw an overview of the typical steps involved in a ML model
development exercise. In this chapter, we will look at the different components of the workflow
in more details. Figure 4.1 lists the subtasks that we will touch upon. While separate books
can be written on each of these subtasks, we will focus on the aspects that may get overlooked
by an inexperienced process data scientist.
• Data balancing
Data Pre-processing • Noise and outlier removal
• Feature engineering and extraction
PHM
Model
MLforPSE.com|54
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
The workflow starts with data collection. Although most often you won’t have much control
over the data (process data collected in historical database) that is available for model
building, you can take a few steps to ensure that the ‘right’ data get passed down the workflow.
For example, you can ensure that only the NOC samples that are representative of the
process conditions that will be monitored are selected for further processing; this ensures that
the model learns the correct fault-free process behavior. Thereafter, you move onto the
exploratory data analysis step wherein you get a ‘feel’ of the data. In the previous chapter, we
saw in great detail how EDA can help us select the right class of model. In the next step, data
are pre-processed to increase the information content in the data. This step can involve data
cleaning (e.g., removing outliers and noise), data transformation (generating new features
and/or modifying existing ones), etc. After this step, we are ready to fit out chosen model.
A practical guideline for model selection is Occam’s razor which advices selection of simplest
model that meets the modeling objectives. In Chapter 1, we saw several factors that influence
model selection. These factors along with inferences from the EDA step will help you shortlist
your candidate models. After model fitting, an evaluation of the model’s fault detection and
fault diagnosis performance is conducted. Diverse set of metrics has been devised to quantify
different aspects of FDD performance. We will look at these metrics in this chapter. The last
step before you obtain your coveted model is model tuning where you play with your model’s
hyperparameters. We will study how you can use cross-validation for this step. With this broad
overview of the workflow, let’s take a closer look into some of the aspects.
At the ‘data selection’ step of the ML model development workflow, as a process modeler,
you need to take care that you are not accidently providing a dataset that ‘confuses’ the ML
model! To understand this further, let’s take a look at a couple of example scenarios.
MLforPSE.com|55
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
Any model will struggle to adequately describe the furnace temperatures during the last 1
year. The model-plant mismatch will add significantly to the ‘noise’ in the dataset which will
severely impact the fault sensitivity of the final model. The model performance is bound to be
disappointing. A sensible approach would be to use only the last 1 year of data as the training
data.
The takeaway message is that if you are not an ‘insider’ to the plant operations
then it would serve you well to sit down with the plant manager (and plant
engineer) to obtain as much information about the process as possible; use the
information gained to curate the training dataset.
MLforPSE.com|56
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
As alluded to in Chapter 1, the overall objective of this step is to increase the ‘information
content’ of training dataset so that the model’s ability to distinguish between normal and faulty
operations is bolstered. For example, model’s performance will suffer if training dataset is
corrupted with outliers or does not have adequate number of faulty samples. During pre-
processing step, training dataset is treated to remove these deficiencies. Let’s look at a couple
of deficiencies that are easy to get overlooked (/not handled properly) in pursuit of ‘quick’
results resulting in misleading inferences.
Handling outliers
In simple words, outliers are corrupted training samples that are not consistent with the normal
training samples. As should be obvious, outliers can ‘confuse’ models during extraction of the
true normal behavior of the process from training data. The illustration below shows how
unhandled outliers can lead to failures going undetected.
outliers
Abnormality
Outliers lead to
PCA
metric
incorrect specification
of alert threshold
In Chapter 4 of Book 1 of this series, we covered in detail several techniques for removal of
outliers; interested readers are referred to it for more details. For the dataset in the above
illustration, the correct modeling approach would be to either explicitly remove the outliers
prior to model fitting or implement robust PCA that is less susceptible to outliers and choose
control limits with correct significance level (such as 90% control limit instead of 95%).
In high-dimensional dataset, it is not trivial to detect the presence of outliers from just a ‘quick
glance’ and therefore, you may be tempted to skip outlier handling. However, as a best
practice, your default assumption must always be that the training dataset is potentially
infested with outliers and appropriate measures, as deemed suitable for the system at hand,
must be put in place to handle outliers.
MLforPSE.com|57
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
Fault-free samples
Number of
samples Data imbalance
Faulty samples
How can data imbalance be rectified? One trivial strategy could be to discard most of the NOC
samples (also termed undersampling) but this obviously leads to loss of crucial process
information. The other approach would be to add more faulty samples to training dataset, but
plant managers will not let you inject faults in their processes just so that your model can be
trained. A popular mechanism to handle data imbalance is SMOTE (synthetic minority
oversampling technique) wherein synthetic faulty samples are generated. The basic idea is
illustrated below. Here, a random faulty sample (A) is selected and a synthetic data point is
generated as illustrated. This procedure is repeated to create as many synthetic faulty
samples as desired.
new synthetic faulty data point randomly selected along the line
connecting sample A and its randomly chosen neighbor
Example 4.1: We will generate an imbalanced dataset of NOC and faulty samples. Then
we will learn how to use the imbalanced-learn package to balance the dataset.
MLforPSE.com|58
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
# generate data
cov = np.array([[6, -3], [-3, 2]])
pts_NOC = np.random.multivariate_normal([0, 0], cov, size=500)
cov = np.array([[1, 1], [1, 2]])
pts_Faulty = np.random.multivariate_normal([5,2], cov, size=25)
X = np.vstack((pts_NOC, pts_Faulty))
y = np.vstack((np.zeros((500,1)), np.ones((25,1)))) # labels [0=>NOC; 1=>Faulty]
We will oversample faulty samples using SMOTE and specify the ratio of faulty to fault-
free samples to be 0.33.
# Oversampling
overSampler = SMOTE(sampling_strategy=0.33)
X_smote, y_smote = overSampler.fit_resample(X, y)
A common practice is to augment SMOTE with undersampling of the NOC samples. The
RandomUnderSampler class of imblearn package can be utilized for this. In Example 4.1’s
dataset, we will increase ratio of faulty to fault-free samples to be 0.5 as shown below.
# Undersampling
from imblearn.under_sampling import RandomUnderSampler
underSampler = RandomUnderSampler(sampling_strategy=0.5)
X_balanced, y_balanced = underSampler.fit_resample(X_smote, y_smote)
MLforPSE.com|59
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
Now that your training dataset has been enriched, you can expect your fault classifier to
perform much better!
Post model fitting, we need some means of assessing the model performance to quantify how
well the model detects faults. The most straightforward metric is accuracy which simply
communicate the ratio of true predictions to the total number of predictions. Accuracy,
however, can be misleading for imbalanced datasets. For example, if your model always
trivially predicts ‘no fault’ and your dataset has 95% NOC samples then the model’s accuracy
will be 95%, which is deceptively good! Therefore, several other metrics have been devised
that quantify different aspects of fault detection capabilities of a model. Before we introduce
these metrics, let’s first understand a confusion matrix. As shown below, a confusion matrix
for a fault detection (binary classification) model provides a comprehensive overview of how
well the model has correctly (and incorrectly) classified samples belonging to positive (faulty)
and negative (normal/fault-free) classes.
Using the terms from the confusion matrix, the table below summarizes the varied
performance metrics.
MLforPSE.com|60
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
False Alarm Rate (FAR) or Fault Detection Rate (FDR) Missed Detection Rate or
False Positive Rate (FPR) or True Positive Rate (TPR) False Negative Rate (FNR)
or Sensitivity or Recall
𝐹𝑃 𝐹𝑃 𝑇𝑃 𝑇𝑃 𝐹𝑁 𝐹𝑁
𝐹𝐴𝑅 = = 𝐹𝐷𝑅 = = 𝐹𝑁𝑅 = =
𝐴𝑁 𝐹𝑃 + 𝑇𝑁 𝐴𝑃 𝑇𝑃 + 𝐹𝑁 𝐴𝑃 𝐹𝑁 + 𝑇𝑃
Precision or Positive
F1 Score Accuracy (ACC)
Predictive Value (PPV)
𝑇𝑃 2 𝑇𝑃 + 𝑇𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐹1 = 𝐴𝐶𝐶 =
𝑇𝑃 + 𝐹𝑃 1 1 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
You will encounter the terms precision and recall commonly in the anomaly detection
literature. Precision metric returns the ratio of number of correct positive predictions to the
total number of positive predictions by the model while recall returns the ratio of number of
correct positive predictions to the total number of positive samples in the data. For a process
monitoring tool (where positive label implies presence of a process fault), recall denotes
model’s ability to detect process fault, while precision denotes accuracy of model’s prediction
of process fault. A model can have high recall if it detects process faults successfully, but
occurrence of lots of false alarms will lower its precision. In other scenario, if the model can
detect only a specific type of process fault and gives very few false alarms then it will have
low recall and high precision. A perfect model will have high values (close to 1) for both
precision and recall. However, we just saw that it is possible to have one of these two metrics
high and the other low. Therefore, both metrics need to be reported. However, if only a single
metric is desired then F1 score is utilized which returns a high value if both precision and
recall are high and a low value if either precision or recall is low.
In process industry, recall (FDR) and false alarm rate (FAR) are more commonly employed to
assess a process monitoring tool’s performance. A plant operator is mainly concerned about
two things: does the model detect a fault when it occurs (given by FDR) and does the model
trigger false alerts under NOC (given by FAR). As would be obvious, FAR close to 0 is
MLforPSE.com|61
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
desirable. A well-designed monitoring tool has high FDR and low FAR values. Low FDR and
high FAR values are both undesirable because while low FDR value leads to safety issues
and economic losses due to delayed/no abnormality detection, high FAR value leads to loss
of user’s confidence in the tool and eventually tool’s demise. If you make your model very
sensitive to detect faults promptly at incipient stages, it will very likely generate lots of false
alerts. As a process data scientist, you will very often find yourself tuning the model to balance
the trade-off between FDR and FAR.
relaxed threshold ⇒
poor FDR, good FAR
good threshold ⇒
good FDR, good FAR
aggressive threshold ⇒
good FDR, poor FAR
As indicated, while a very ‘relaxed’ threshold will likely fail to detect faulty samples (FDR low,
FAR low), a very aggressive threshold will lead to normal samples being flagged as faulty
(FAR high, FDR high). This interplay between FDR and FAR for different alerting threshold is
graphically presented via a ROC curve shown below. While an ideal fault detection tool will
provide a perfect performance (FDR=1, FAR=0), in real scenarios, you will have to tune your
alert threshold to provide acceptable levels of FDR and FAR.
Ideal performance
1
Fault detection rate
MLforPSE.com|62
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
In general, the closer a curve is to the ideal detector or the more lift it has above the diagonal
line, the better performance it provides over a random classifier. To compare two models
using their ROC curves, a metric called AUC (area under curve) has been devised. AUC is
obtained by integrating the ROC curve from FAR=0 to FAR=1. An ideal detector has AUC=1
and higher AUC values imply better fault detection performance.
ARL1
fault
Fault metric
Sample #
In Chapter 5, we will see the trade-off between fault sensitivity and fault
detection speed. Specifically, a technique called CUSUM will be introduced that
is able to detect faults of small magnitudes (⇒ high fault sensitivity) but shows
higher ARL1 compared to classical 3-sigma technique (which is widely used for
univariate signal monitoring). Another metric called ARL0 denotes the average
time or number of samples a FD model takes to flag a (false) alarm when no
fault exists in the process.
Every ML model comes with a set of hyperparameter (e.g., the number of neurons in ANNs,
number of retained latent components in PCAs) that can be adjusted to improve the
performance metrics. However, care must be taken to avoid overfitting the model. The
illustration below shows SVDD modeling of NOC training samples for different values of the
bandwidth.
MLforPSE.com|63
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
Hyperparameter not set right Hyperparameter set right Hyperparameter not set right
For high-dimensional complex datasets, it is not easy to judge to judge whether we are
overfitting the data while tuning your model. Therefore, the standard approach is to use
separate sets of datasets to use for fitting, tuning, and assessing the model as shown in the
figure below.
An overfitted model is an unnecessarily complex model that ends up fitting the noise in the
training dataset and therefore, it can be expected to provide poor performance (high FAR) on
validation dataset. On the other hand, the underfitted model fails to capture even the
systematic structure in the training dataset leading to undesirable low FDR on both fitting and
validation datasets. Therefore, the hyperparameter values that lead to the least complex
model providing acceptable performance on validation dataset is often chosen as the final
model. This practice of using validation dataset to assess model performance during model
tuning is termed cross-validation.
MLforPSE.com|64
Chapter 4: Machine Learning for Plant Health Management: Workflow and Best Practices
Summary
In this chapter, we covered some basic but crucial aspects of building a fault detection model
that you should pay careful attention to ensure satisfactory model performance. We touched
upon the tasks of data selection, data pre-processing, model evaluation, and model tuning.
Starting from the next chapter, we will start the deep-dive into the world of developing process
monitoring models.
MLforPSE.com|65
Part 2
Univariate Signal Monitoring
MLforPSE.com|66
Chapter 5
Control Charts for Statistical Process Control
B
efore machine learning engulfed the process industry, simple plotting of key plant
variables with statistically chosen upper and lower thresholds used to be the norm for
detecting process abnormalities. These plots, called control charts, formed the major
component of statistical process control or statistical quality control. Although control charts
have lost some of their shine due to the advent of advanced multivariate process monitoring
tools, they are still widely employed by plant management to monitor crucial KPIs, for
example, product quality, process efficiency, etc. Simple concept, easy interpretation, and
quick implementation are some of the reasons for their continued popularity.
Shewhart charts (which includes the popular 3-sigma charts) are the earliest, simplest, and
most commonly used control charts. These, however, show poor performance for detection
of faults that cause small deviations. Therefore, alternatives such as CUSUM charts and
EWMA charts have been devised. In this chapter, we will learn these techniques and become
familiar with how to implement them in practice. We will conclude with some discussion on
the ways to overcome the shortcomings of univariate control charts. Specifically, the following
topics are covered.
67
Chapter 5: Control Charts for Statistical Process Control
Control charts are one of the seven24 pillars of statistical process control that are traditionally
used to monitor key production or product quality metrics in order to detect unexpected
deviations. When the process is ‘in-control’, the monitored variables are expected to exhibit
only natural cause variations around some target or mean values. As shown in Figure 5.1, a
control chart is simply a display of measurements of a single process variable plotted against
time or sample number. Additionally, these charts include a centerline or the expected mean
value, and a couple of limit lines called UCL (upper control limit) and LCL (lower control limit).
Most industrial process variables show natural variations due to random disturbances
affecting the process. The control limits are statistically designed in such a way that under
natural cause variations, the monitored variable remains within the control limits with certain
desired probability. The breach of the control limits indicates potential process fault or an ‘out-
of-control’ situation. Proper specification of the limits is therefore essential to ensure minimal
false alarms and rapid detection of faults.
The traditional control charts take three different forms, viz, Shewhart charts, CUSUM charts,
and EWMA charts. While a Shewhart chart plots only the current measurements on the control
chart, the other two plot some combination of current and past measurements. You will soon
learn how usage of past measurements allow detection of incipient or low magnitude faults
which may not get detected by Shewhart charts. Control charts are not limited to tracking
process measurements only; any metric that is expected to exhibit only random fluctuations
around some mean or target can be monitored via control charts. Correspondingly, control
charts are also employed for monitoring model residuals, latent variables (for example, in
PCA), etc. Let’s first get started with Shewhart charts.
24
https://asq.org/quality-resources/statistical-process-control
MLforPSE.com|68
Chapter 5: Control Charts for Statistical Process Control
Shewhart chart is the simplest and widely used control chart wherein the process
measurement at each sampling instant is plotted against the sample number and the control
limits are often set at 𝜇 ± 3𝜎, where 𝜇 and 𝜎 are the in-control process variable mean and
standard deviation. The centerline is obviously set at 𝜇. Shewhart charts assume Gaussian
distribution and therefore, the 3𝜎 control chart amounts to a false alarm rate of 0.27% as
shown in Figure 5.2
Probability of data falling within
the control limits is 0.9973
3𝜎
3𝜎
The statistics 𝜇 and 𝜎 are often not known exactly and are estimated from historical in-control
data. Let 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑁 be the observations from historical good operation period, then the
sample statistics can be estimated as25
𝑁
1 𝑠
𝜇Ƹ = 𝑥ҧ = 𝑥𝑖 ; 𝜎ො = 𝑐
𝑁 4
𝑖=1
̅ሻ𝟐
σሺ𝒙𝒊 −𝒙
where s is sample standard deviation (ට 𝑵−𝟏
) and c4 is a correction factor26 (which tends to
1 for large N) to account for finite samples. This leads to the following control limits
𝑠 𝑠
𝐿𝐶𝐿 = 𝑥ത − 3 𝑐 ; 𝑈𝐶𝐿 = 𝑥ത + 3 𝑐
4 4
25
https://www.itl.nist.gov/div898/handbook/pmc/section3/pmc322.htm
26
https://www.itl.nist.gov/div898/handbook/pmc/section3/pmc32.htm
MLforPSE.com|69
Chapter 5: Control Charts for Statistical Process Control
m ⋮ ⋮ ⋮
measurements
It is the mean of the subgroup that is tracked on the standard Shewhart chart. Infact, the
Shewhart chart with m=1 is referred to as an individual Shewhart chart. Let 𝑥𝑖𝑗 denote the
jth measurement in the ith subgroup. The sample statistics for the subgroup mean can be
obtained as follows using historical IC (in-control) data.
1 𝑆ҧ 1 𝑆ҧ
𝐿𝐶𝐿 = 𝑥Ӗ − 3 𝑐 𝑈𝐶𝐿 = 𝑥Ӗ + 3 𝑐
4 √𝑚 4 √𝑚
For continuous processes, both individual and standard Shewhart charts find utility. For
example, suppose that plant efficiency measurements are available every minute. Plant
operators, if desired, can track the minute-level efficiencies using individual control chart.
Alternatively, efficiency measurements may be averaged over an hour (⇒ m=60) and the
hourly-averaged efficiency values may be tracked
While Shewhart charts are easy to interpret and implement, their performance is poor when
used to detect small deviations (< 1.5𝜎 shifts) in process variable mean. The following
illustration clarifies this aspect.
MLforPSE.com|70
Chapter 5: Control Charts for Statistical Process Control
To understand how Shewhart plots can be generated, let us work through the code used to
generate the plots in Figure 5.3.
# generate data
x0 = np.random.normal(loc=10, scale=2, size=250) # NOC samples
x1 = np.random.normal(loc=11, scale=2, size=50) # faulty samples
x = np.hstack((x0,x1)) # combined data
MLforPSE.com|71
Chapter 5: Control Charts for Statistical Process Control
Consider the control chart shown below for the dataset shown in Figure 5.3. It is apparent that
unlike the Shewhart chart, this control chart easily flags the out-of-control situation. This chart
is called CUSUM chart and is known to be more effective at detecting small mean shifts. At
the ith sampling instant, the CUSUM statistic (Si) that is plotted on the CUSUM chart is the
‘Cumulative SUM’ of deviations of each of the available measurements from mean (or target).
When a process is in control, the random deviations around the mean cancel out and Si
wanders around zero; however, in presence of a mean shift of, say, ∆ at rth sampling instant,
an additional ∆ term gets added to Si at every instant ≥ 𝑟 which leads to a continuous upward
(or downward) trend as seen in Figure 5.4.
𝑥1 ⋯ 𝑥𝑖−1 𝑥𝑖 Out-of-control
indication
CUSUM Statistic (𝑆)
𝑆𝑖 = ൫𝑥𝑗 − 𝜇0 ൯ ;𝑖 ≥ 0
𝑗=0
Or, equivalently
MLforPSE.com|72
Chapter 5: Control Charts for Statistical Process Control
Fault detection using CUSUM charts used to be done graphically by looking at the slope of
the trends using the so-called V-masks. However, in modern statistical software systems, the
following two-sided CUSUM charts are utilized.
•𝑘≥0
• Deviation from mean (or target) greater than 𝑘 increases 𝑆 + or 𝑆 −
• 𝑘 is usually set to be ½ the size of mean shift that we want to detect quickly
Both 𝑆 + and 𝑆 − are plotted together on the same chart along with a specified control limit (or
threshold) H as shown below.
Violation of control
limit by either 𝑆 + or
𝑆 − indicates process
not being in statistical
control
The implementation of CUSUM chart requires specification of k, H, and mean 𝜇0 (or target
T). While 𝜇0 is simply the sample mean of historical IC data, determination of control limit H
is slightly more involved compared to Shewhart charts. H is often selected based on the
ARL(0) and ARL(1) values for different levels of mean shifts27. Usually, H is set at 4𝜎ො or 5𝜎ො
where 𝜎ො is estimated the same way as done for Shewhart charts.
27
https://www.itl.nist.gov/div898/handbook/pmc/section3/pmc3231.htm
MLforPSE.com|73
Chapter 5: Control Charts for Statistical Process Control
Example 5.1: We will take the data from Figure 5.3 and learn how to build CUSUM
control charts. We will also see the impact of k on chart performance.
# CUSUM chart
mu, sigma = np.mean(x0), np.std(x0)
k, H = 0.25*sigma, 5*sigma
for i in range(1,len(x)):
S_positive[i] = np.max([0, x[i]-(mu+k) + S_positive[i-1]])
S_negative[i] = np.max([0, (mu-k)-x[i] + S_negative[i-1]])
The CUSUM chart can promptly detect the shift in mean. However, we also see some
false alerts in NOC data. This is because our chart is very sensitive (designed to capture
just 0.5𝜎 shifts). The figure below shows the impact of increasing k to 0.5𝜎 – the
incidence of false alerts has decreased, but there is considerable delay in fault detection
(~ 50 samples).
MLforPSE.com|74
Chapter 5: Control Charts for Statistical Process Control
CUSUM charts can also detect large shifts in process mean; however, the detection is slower
(i.e., the ARL(1) is higher) than Shewhart charts. EWMA provides a good compromise
between detection of small mean shifts and quick detection of large shifts.
Like CUSUM, EWMA charts are control charts with memory. Here, at any ith sampling instant,
the statistic (zi) plotted on the EWMA control chart is a weighted combination of the ith process
measurement and the previous statistic zi-1; this translates to EWMA statistic being an
exponentially weighted average of all the past measurements as shown in Figure 5.5. We can
see that EWMA chart provides a balance between Shewhart (only current observation is
considered) and CUSUM (equal weightage given to all available observations) charts;
accordingly, this leads to a good balance between detection delay and fault sensitivity.
𝑥1 ⋯ 𝑥𝑖−1 𝑥𝑖
EWMA Statistic (z)
MLforPSE.com|75
Chapter 5: Control Charts for Statistical Process Control
𝜆
𝐿𝐶𝐿 = 𝜇ො − 3𝜎
ෝට
𝜆
; 𝑈𝐶𝐿 = 𝜇ො + 3𝜎
ෝට
2−𝜆 2−𝜆
Example 5.1 continued: We will again take the data from Figure 5.3 and learn how to
build EWMA control charts.
# EWMA chart
mu, sigma = np.mean(x0), np.std(x0)
smoothFactor = 0.1
LCL = mu - 3*sigma*np.sqrt(smoothFactor/(2-smoothFactor))
UCL = mu + 3*sigma*np.sqrt(smoothFactor/(2-smoothFactor))
z = np.zeros((len(x),))
z[0] = mu
for i in range(1,len(x)):
z[i] = smoothFactor*x[i] + (1-smoothFactor)*z[i-1]
plt.plot(z,'--',marker='o', color='teal')
plt.plot([1,len(x)],[LCL,LCL], color='red'), plt.plot([1,len(x)],[UCL,UCL], color='red')
plt.plot([1,len(x)],[mu,mu], '--', color='maroon')
plt.xlabel('sample #')
The balance between the performances of Shewhart and CUSUM chart is immediately apparent
in the above EWMA chart. The EWMA statistic breaches (although barely) the alert threshold (⇒
better performance than Shewhart chart) and detection delay is lower than that of CUSUM as
well (⇒ better performance than CUSUM chart).
MLforPSE.com|76
Chapter 5: Control Charts for Statistical Process Control
To illustrate an industrial application of control charts, we will consider data28 from an aeration
tank shown below. An aeration tank uses small air bubbles to keep solid particles in
suspension. Excessive foaming and loss of valuable solid product occurs if too much air is
blown into the tank. If too little air is blown into the tank, the particles sink and drop out of
suspension. The simulated dataset contains 573 observations, where each observation
equals the total liters of air added to the tank in a one-minute interval.
Air
Aeration Tank
All 573 observations are plotted below. You won’t be judged if you fail to notice a slight upward
shift around 300 minutes in the first glance.
28
The aeration rate dataset and description is publicly available at https://openmv.net/info/aeration-rate.
MLforPSE.com|77
Chapter 5: Control Charts for Statistical Process Control
However, we have learnt that CUSUM charts excel in capturing small mean shifts. Let’s see
how nicely it works for this dataset. We will take first 200 observations as NOC data which is
plotted below. The mean and standard deviation is computed from these samples.
To generate the CUSUM chart, k is taken to be equal to 0.25𝜎. Using the code shown in
Example 5.1, the CUSUM control chart29 shown below is generated. It is impressive that,
without much false alarms, the small mean shift has now been made very ‘obvious’ to see!
29
The complete code is provided in the online GitHub repository.
MLforPSE.com|78
Chapter 5: Control Charts for Statistical Process Control
The obvious shortcoming of the techniques learnt in this chapter is their inability to effectively
detect faults in multivariate systems as illustrated below. The faulty sample is an obvious
outlier in the 2D space; however, if the variables x1 and x2 are analyzed individually, the faulty
sample appears within the NOC ranges.
Faulty sample
Another shortcoming of the vanilla control charts that we have studied is the assumption of
independence among the data samples. If you are dealing with a signal that shows
autocorrelation then a better strategy as illustrated below is to build a model (e.g., a time-
series model30) and compute residuals (between observed and predicted values). For NOC
data and carefully built model, the residuals will be uncorrelated and have Gaussian
distribution, and therefore, the vanilla control charts can be built for the residuals to indirectly
monitor the original signal.
30
Time series modeling techniques are covered in detail in Book 2 of the series.
MLforPSE.com|79
Chapter 5: Control Charts for Statistical Process Control
Measured signal
Another popular strategy to deal with autocorrelated signals is to update the model using the
latest observations and monitor the model parameters.
Summary
In this chapter, we familiarized ourselves with statistical process control charts for monitoring
univariate signals. We looked at the three popular techniques, viz, Shewhart charts, CUSUM
charts, and EWMA charts. We learnt how to build these charts and understood the pros and
cons of these different methods. In the next chapter, we will continue our study of univariate
signal monitoring and look at pattern matching-based methodologies.
MLforPSE.com|80
Chapter 6
Process Fault Detection via Time Series
Pattern Matching
I
magine you are a plant operator newly put in charge of running a plant and you observe
an interesting pattern in one of the process variable: occasional spikey fluctuations without
the signal violating the DCS alarm limits. A natural line of investigation would be to find if
such patterns have occurred in the past and are a leading indicators of underlying process
faults. However, how do you quickly sift through years of historical data to find similar
patterns? Consider another scenario where you are responsible for quality control of a batch
process. To check if the latest batch went smoothly, you may want to compare it with known
reference/golden batch. However, batches may show normal variations due to different batch
durations or abnormal deviations due to process fault. How do you train an algorithm to
smartly call out a faulty batch? One thing common in both these scenarios is that we are not
looking at abnormality of a single measurement; instead, abnormality of a sequence of
successive values (also called collective anomalies) is of interest.
Time series pattern matching is a mature field in the area of time series classification and
recent algorithmic advances now allow very fast sequence comparisons to find similar or
abnormal patterns in historical data. Unsurprisingly, pattern matching is being offered as prime
feature in commercial process data analytic software (such as Aspen’s Process Explorer,
SEEQ, etc.). In this chapter, we will work through some use-cases of pattern-matching-based
process monitoring. Specifically, the following topics are covered
81
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
In anomaly detection literature, anomalies in univariate time series or dynamic signals are
categorized into three categories: point anomalies, contextual anomalies, and collective
anomalies. Figure 6.1 illustrates these anomalies for a valve (%) opening signal. As depicted
in Figure 6.1b, if a single measurement deviates significantly from the rest of the sensor
readings, then a point anomaly is said to have occurred. Contextual anomaly occurs when a
measurement is not anomalous in an ‘overall sense’ but only in a specific context. For
example, in Figure 6.1c, point ‘B’ is abnormal when taken in the context of operation mode 1
only; valve opening goes close to 80% under normal operation but not when the process is in
mode 1. The last category of collective anomaly occurs when a group/sequence of successive
measurements jointly show abnormal behavior, although the individual measurements may
not violate NOC range. While control charts can be built to detect point and contextual
anomalies, more specialized approaches are needed to detect collective anomalies.
Therefore, this chapter is devoted to study of approaches for collective anomaly detection.
A Point anomaly
Normal conditions
(a) (b)
Mode 1 Mode 2
Contextual anomaly Collective anomaly
B
(c) (d)
The need for (sub) sequence-based pattern matching for FDD show up in different forms in
process industry; Figure 6.2 illustrates some of the use-case scenarios. Let’s work through
some of these use-cases to understand the underlying techniques and available resources.
MLforPSE.com|82
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
(a)
➢ Pattern in last 0.5 hour ➢ Does any adsorber bed’s
➢ Is this pattern associated with pressure profile differs from
any faulty condition? those of the rest of the beds ?
(c) (d)
➢ Has my plant start-up
➢ In the last 1 day of operation,
progressed normally?
has any pattern occurred that
is very different compared to
the rest of the data?
Figure 6.2: Sample use-case scenarios of time series pattern matching-based fault detection
We will work through the case scenarios (a) and (c). Both of these use-cases involve finding
similarity of a ‘query’ subsequence with several other subsequences taken from the same
time series or another time series. In order to accomplish this in a time-efficient manner, a
library called STUMPY31 will be utilized. Let’s learn how to utilize STUMPY for our time series
data mining tasks.
31
https://stumpy.readthedocs.io/en/latest/index.html.
S.M. Law, STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining. Journal of Open Source
Software, 2019.
MLforPSE.com|83
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
To answer the question posed in Figure 6.2a, the following task can be performed: using the
real-time data in the last 30 mins, the most similar subsequence can be found in the historical
data; thereafter, one can simply check if a process upset had followed the found subsequence
in the history. We will use steam flow data32 obtained from a model of a Steam Generator at
Abbott Power Plant in Champaign IL. The provided dataset contains 9600 samples (sampling
time 3 seconds) of several process variables including steam flow. We created two separate
datafiles, historical_steamFlow.txt and current_steamFlow.txt, to contain the historical data
and the real time data that will be used in this case study. Let’s start with loading the data.
# fetch data
historical_steamFlow = np.loadtxt('historical_steamFlow.txt')
current_steamFlow = np.loadtxt('current_steamFlow.txt')
plt.figure(), plt.plot(historical_steamFlow)
plt.ylabel('Historical steam flow'), plt.xlabel('Sample #')
32
De Moor B.L.R. (ed.), DaISy: Database for the Identification of Systems, Department of Electrical Engineering,
ESAT/STADIUS, KU Leuven, Belgium, URL: http://homes.esat.kuleuven.be/~smc/daisy/
MLforPSE.com|84
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
Let’s see how the query and target subsequence compare to each other.
plt.figure()
plt.plot(current_steamFlow, color="green", label="current_steamFlow")
plt.plot(historicalPattern, color="maroon", label="Most similar subsequence")
plt.xlabel('Sample #’), plt.ylabel('Steam Flow'), plt.legend()
MLforPSE.com|85
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
For visualization purpose, current steam flow time series has been
appended to the end of historical steam flow time series.
The question posed in Figure 6.2c can be solved by finding a subsequence which is at a large
distance from any other subsequence in a given time series. For this case-study, we will use
the steam flow time series provided in the original steam generator dataset. Let’s start with
loading the data.
# fetch data
data = np.loadtxt(steamgen.dat')
steamFlow = data[:,8]
plt.figure(), plt.plot(steamFlow)
plt.ylabel('Steam flow'), plt.xlabel('Sample #')
MLforPSE.com|86
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
The 'stump' function does the following: for every subsequence (of length sequenceLength)
within the steamFlow time series, it automatically identifies its corresponding nearest-
neighbor subsequence of the same length33. The first column of matrix_profile array
contains the distance values and the nearest neighbor subsequence's indices are in the 2nd
and 3rd columns. Let’s see how the distances vary.
To find the most unusual subsequence which can be a potential anomaly, we just need to find
the subsequence position in the steamFlow time series that has the largest value from its
nearest neighbor subsequence.
33
Check out the nice visualization at https://stumpy.readthedocs.io/en/latest/index.html.
MLforPSE.com|87
Chapter 6: Process Fault Detection via Time-Series Pattern Matching
# find discord
discord_position = np.argsort(matrix_profile[:, 0])[-1]
print('The discord is located at position: ', discord_position)
We won’t be surprised if you are already thinking of the different ways you could use time
series pattern matching for your own process systems. Libraries like STUMPY have made
finding patterns in huge volumes of data easy and therefore, we encourage you to let your
imaginations run wild with the potential use cases!
Summary
In this chapter, we looked at time series data mining techniques for fault detection in process
data. Specifically, we worked with a steam generator dataset, and looked at pattern matching
and discord discovery problems. We learnt how to use a powerful library called STUMPY to
solve these problems.
MLforPSE.com|88
Part 3
Multivariate Statistical Process Monitoring
89
Chapter 7
Multivariate Statistical Process Monitoring for
Linear and Steady-State Processes: Part 1
I
t is not uncommon to have hundreds of process relevant variables being measured at
manufacturing facilities. However, conservation laws such as mass balances,
thermodynamics constraints, enforced product specifications, and other operational
restrictions induce correlations among the process variables and make it appear as if the
measured variables are all derived from a small number of hidden (un-measured) variables.
Several smart techniques have been derived to find these hidden latent variables. Latent
variable-based techniques allow characterization of ‘normal’ process noise affecting the
process during NOC. Process monitoring methods based on latent space monitor the values
of latent variables and process noise in real-time to infer the presence of process faults.
Sounds complicated? Don’t worry! This chapter will show you how this is accomplished while
retaining focus on conceptual understanding and practical implementation.
PCA and PLS are among the most popular latent variable-based process monitoring tools
and have been reported in several successful industrial process monitoring applications. This
chapter provides a comprehensive exposition of the PCA and PLS techniques and teaches
you how to apply them for fault detection. Furthermore, we will learn how to identify the faulty
process variable using the popular contribution analysis methodology. Specifically, the
following topics are covered
90
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
34
The popularity of latent-variable techniques for process control and monitoring arose from the pioneering work by John
McGregor at McMaster University.
MLforPSE.com|91
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Mathematical background
Consider a data matrix 𝑋 ∈ ℝ𝑁×𝑚 consisting of N samples of m process variables where each
row represents a data-point in the original measurement space. It is assumed that each
column is normalized to zero mean and unit variance. Let 𝑣 ∈ ℝ𝑚 represent the ‘loading’
vector that projects data-points along PC1; it can be found by solving the following
optimization problem
ሺ𝑋𝑣ሻT 𝑋𝑣
max eq. 1
𝑣≠0 𝑣𝑇𝑣
It is apparent that Eq. 1 is trying to maximize the variance of the projected data-points along
PC1. Loading vectors for other PCs are found by solving the same problem with the added
constraint of orthogonality to previously computed loading vectors. Alternatively, loading
vectors can also be computed from eigenvalue decomposition of covariance matrix (S) of X
1
𝑋 𝑇 X = 𝑆 = 𝑉𝛬𝑉 𝑇 eq. 2
𝑁−1
Above is the form you will find commonly in PCA literature. The columns of eigenvector matrix
𝑉 ∈ ℝ𝑚×𝑚 are the loading vectors that we need. The diagonal eigenvalue matrix 𝛬 equals
diag{𝜆1 , 𝜆2 , . . . , 𝜆𝑚 }, where 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑚 are the eigenvalues. Infact, 𝜆𝑗 is equal to the
variance along the jth PC. If there is significant correlation among process variables in original
data, only the first few eigenvalues will be significant. Let’s assume that k PCs are retained,
then the first k columns of 𝑉 (which corresponds to the first k 𝜆𝑠 ) are taken to form the loading
matrix 𝑃 ∈ ℝ𝑚×k . Transformed data in the PC space can now be obtained
bulky X lean T
Projected values eq. 3
along jth PC
𝑡𝑗 = 𝑋𝑝𝑗 𝑜𝑟 𝑇 = 𝑋𝑃
Nxm Nxk
The m dimensional ith row of X has been transformed into k (< m) dimensional ith row of T.
𝑇 ∈ ℝ𝑁×𝑘 is called score matrix and the jth column of T (tj) contains the (score) values along
the jth PC. The scores can be projected back to the original measurement space as follows
𝑋 = 𝑇𝑃 𝑇 eq. 4
MLforPSE.com|92
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Figure 7.2: Process data from a polymer manufacturing plant. Each colored curve
corresponds to a process variable.
For this dataset, it is reported that the process started behaving abnormally around sample
70 and eventually had to be shut down. Therefore, we use samples 1 to 69 for training the
PCA model using the code below. The rest of the data will be utilized for process monitoring
illustration later.
# normalize data
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_train)
# PCA
pca = PCA()
score_train = pca.fit_transform(data_train_normal)
35
Data was originally available at https://landing.umetrics.com (unfortunately, this link no longer seems to work).
Dataset is also referenced at https://www.academia.edu/38630159/Multivariate_data_analysis_wiki. Data file is made
available in this book’s GitHub repository.
MLforPSE.com|93
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
After training the PCA model, loading vectors/principal components can be accessed from
transpose of the components_ attribute of pca model. Note that we have not accomplished
any dimensionality reduction yet. PCA has simply provided us an uncorrelated dataset in
score_train. To confirm this, we can compute the correlation coefficients among the columns
of score_train. Only the diagonal values are 1 while the rest of the coefficients are 0!
# confirm no correlation
corr_coef = np.corrcoef(score_train, rowvar = False)
>>> print('Correlation matrix: \n', corr_coef[0:3,0:3]) # printing only a portion
Correlation matrix:
[[ 1. 0. -0.]
[ 0. 1.0 0.]
[-0. 0. 1.0]]
For dimensionality reduction we will need to study the variance along each PC. Note that the
sum of variance along the m PCs equals the sum of variance along the m original dimensions.
Therefore, the variance along each PC is also called explained variance. The attribute
explained_variance_ratio gives the fraction of variance explained by each PC and Figure
7.3 clearly shows that not all 33 components are needed to capture all the information in data.
Most of the information is captured in the first few PCs itself.
plt.figure()
plt.plot(cum_explained_variance, 'r+', label = 'cumulative % variance explained')
plt.plot(explained_variance, 'b+', label = 'variance explained by each PC')
plt.ylabel('Explained variance (in %)’), plt.xlabel('Principal component number'), plt.legend()
MLforPSE.com|94
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
A popular approach for determining the number of PCs to retain is to select the number of
PCs that cumulatively capture atleast 90% (or 95%) of the variance. The captured variance
threshold should be guided by the expected level of noise or non-systematic variation that
you do not expect to be captured. Alternative methods include cross-validation, scree tests,
AIC criterion, etc. However, none of these methods are universally best in all the situations.
Thus, we have achieved ~60% reduction in dimensionality (from 33 to 13) by sacrificing just
10% of the information. To confirm that only about 10% of the original information has been
lost, we will reconstruct the original normalized data from the scores. Figure 7.4 provides a
visual confirmation as well where it is apparent that the systematic trends in variables have
been reconstructed while noisy fluctuations have been removed.
V_matrix = pca.components_.T
P_matrix = V_matrix[:,0:n_comp]
Figure 7.4: Comparison of measured and reconstructed values for a few variables
MLforPSE.com|95
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
The 90% threshold could also have been specified during model training itself through the
n_components parameter: pca = PCA(n_components = 0.9). In this case the insignificant PCs
are not computed and the score_train_reduced matrix can be computed from the model using
the transform method.
# alternative approach
pca = PCA(n_components = 0.9)
score_train_reduced = pca.fit_transform(data_train_normal)
data_train_normal_reconstruct = pca.inverse_transform(score_train_reduced)
R2_score = r2_score(data_train_normal, data_train_normal_reconstruct)
In Figure 7.2, we saw that it was not easy to infer process abnormality after 69 th sample by
simply looking at the combined time-series plot of all the available variables. Individual
variable plot may provide better clues, but continuously monitoring all the 33 plots of individual
variables is not a convenient task.
MLforPSE.com|96
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
PCA makes the monitoring task easy by summarizing the state of any complex multivariate
process into two simple indicators or monitoring indices as shown in Figure 7.5. During model
training, statistical thresholds are determined for the indices and for a new data-point, the new
indices’ values are compared against the thresholds. If any of the two thresholds are violated,
then presence of abnormal process condition is confirmed.
Hotelling’s T2
Let ti denote the ith row of T which represents the transformed ith data-point in the PC space.
The T2 index for this data-point is calculated as follows
2
𝑡𝑖,𝑗
𝑇 2 = σ𝑘𝑗=1 = 𝑡𝑖 𝛬−1 𝑇
𝑘 𝑡𝑖 eq. 5
𝜆𝑗
2 𝑘ሺ𝑁2 −1ሻ
𝑇𝐶𝐿 = 𝐹𝑘,𝑁−𝑘 ሺ𝛼ሻ eq. 6
𝑁ሺ𝑁−𝑘ሻ
𝐹𝑘,𝑁−𝑘 ሺ𝛼ሻ is the (1-α) percentile of a F-distribution with k and n-k degrees of freedom. In
2
essence, 𝑇 2 ≤ 𝑇𝐶𝐿 represents an ellipsoidal boundary around the training data-points in the
PC space.
SPE/Q
The second index, Q, represents the distance between the original and reconstructed data-
point. Let ei denote the ith row of E. Then
𝑄 = σ𝑚 2
𝑗=1 𝑒𝑖,𝑗 eq. 7
36
In statistical terms, 𝛼 is the Type I error rate. In colloquial terms, it corresponds to 100*(1- 𝛼)% control limit.
MLforPSE.com|97
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Again, under normality assumption, the control limit for Q is given by the following expression
2
𝑧𝛼 ට2𝜃2 ℎ02 𝜃2 ℎ0 ሺ1−ℎ0 ሻ
𝑄𝐶𝐿 = 𝜃1 ( +1 + ) eq. 8
𝜃1 𝜃12
2𝜃1 𝜃3
ℎ0 = 1 − and 𝜃𝑟 = σ𝑚 𝑟
𝑗=𝑘+1 𝜆𝑗 ; 𝑟=1,2,3
3𝜃22
We now have all the information required to generate control charts for the fault indices for
training data. We will continue our case study on the polymer manufacturing dataset.
for i in range(N):
T2_train[i] = np.dot(np.dot(score_train_reduced[i,:],lambda_k_inv),score_train_reduced[i,:].T)
# T2 control limit
import scipy.stats
MLforPSE.com|98
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
# Q control limit
eig_vals = pca.explained_variance_
theta1 = np.sum(eig_vals[k:])
theta2 = np.sum([eig_vals[j]**2 for j in range(k,m)])
theta3 = np.sum([eig_vals[j]**3 for j in range(k,m)])
h0 = 1-2*theta1*theta3/(3*theta2**2)
z_alpha = scipy.stats.norm.ppf(1-alpha)
Q_CL = theta1*(z_alpha*np.sqrt(2*theta2*h0**2)/theta1+ 1 + theta2*h0*(1-h0)/theta1**2)**2
Figure 7.6 shows that quite a few data-points in training data violate the thresholds, which
was not expected with 99% control limits. This indicates that the multivariate normality
assumption does not hold for this dataset. Other specialized ML methods like KDE, SVDD
can be employed for control boundary determination for non-Gaussian data. We will study
these methods in later chapters. Alternatively, as alluded to before, if N is large, another
popular approach is to directly find the control limits as 99th percentiles of the T2 and Q values
for training dataset.
MLforPSE.com|99
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Figure 7.7: (Left) Flow measurements across a valve (Right) Mean-centered flow
readings with two abnormal instances (samples a and b)
For example, consider the scenario in Figure 7.7. The two flow measurements are
expectedly correlated. Normal data-points lie along the 45° line (PC1 direction), except,
instances ‘a’ and ‘b’ which exhibit different type of abnormalities. For sample ‘a’, the
correlation between the two flow variables is broken which may be the result of a leak in
valve. This results in abnormally high Qa value; 𝑇𝑎2 however is not abnormally high
because the projected score, ta, is similar to those of normal data-points. For sample ‘b’,
the correlation remains intact resulting in low (zero) Qb value. The score, tb, however, is
abnormally far away from the origin resulting in abnormally high 𝑇𝑏2 value.
# calculate T2_test
MLforPSE.com|100
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
T2_test = np.zeros((data_test_normal.shape[0],))
for i in range(data_test_normal.shape[0]): # eigenvalues from training data are used
T2_test[i] = np.dot(np.dot(score_test_reduced[i,:],lambda_k_inv),score_test_reduced[i,:].T)
# calculate Q_test
error_test = data_test_normal_reconstruct - data_test_normal
Q_test = np.sum(error_test*error_test, axis = 1)
Figure 7.8 juxtaposes the monitoring statistics for training and test data. By looking at these
plots, it is immediately evident that the test data exhibit severe process abnormality. Both T2
and Q values are significantly above the respective control limits.
Figure 7.8: Monitoring charts for training and test data. Vertical cyan-colored line separates
training and test data
After detection of process faults, the next crucial task is to diagnose the issue and identify
which specific process variables are showing abnormal behavior. The popular mechanism to
accomplish this is based on contribution plots. As the name suggests, a contribution plot is a
plot of the contribution of original process variables to the abnormality indexes. The variables
with highest contributions are flagged as potentially faulty variables.
SPE contributions
For SPE (squared prediction error), let’s reconsider Eq. 7 as shown below where SPE j
denotes the SPE contribution of the jth variable.
𝑆𝑃𝐸 = σ𝑚 2 𝑚
𝑗=1 𝑒𝑗 = σ𝑗=1 𝑆𝑃𝐸𝑗 eq. 9
Therefore, SPE contribution of a variable is simply squared error for that variable. If SPE index
has violated its control limit, then the variables with relatively large SPEj values are considered
the potentially faulty variables.
MLforPSE.com|101
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
T2 contributions
For T2 contributions, calculations are not as straightforward. Several expressions have been
postulated in literature37. The commonly used expression below was proposed by wise et al.38
2
𝑇 2 contribution of variable 𝑗 = ቀ𝑗 𝑡ℎ element of ൫𝐷1/2 𝑥൯ቁ eq. 10
−1
𝐷 = 𝑃𝛬𝑘 𝑃𝑇
Note that these contributions39 are computed for each data-point. Let’s find which variables
need to be further investigated at 85th sample.
# T2 contribution
sample = 85 - 69
data_point = np.transpose(data_test_normal[sample-1,])
D = np.dot(np.dot(P_matrix,lambda_k_inv),P_matrix.T)
T2_contri = np.dot(scipy.linalg.sqrtm(D),data_point)**2 # vector of contributions
# SPE contribution
error_test_sample = error_test[sample-1,]
SPE_contri = error_test_sample*error_test_sample # vector of contributions
37
S. Joe Qin, Statistical process monitoring: basics and beyond, Journal of Chemometrics, 2003
38
Wise et. al., PLS toolbox user manual, 2006
39
A quick derivation (Alcala & Qin, Analysis and generalization of fault diagnosis methods for process monitoring. Journal
2 1 2
−1/2 𝑇 2
of Process Control, 2011) is as follows (𝑇 2 = 𝑥 𝑇 𝑃𝛬−1 𝑇
𝑘 𝑃 𝑥 = ‖𝑃𝛬𝑘 𝑃 𝑥‖ ≡ ‖𝐷1/2 𝑥‖ = σ𝑚 𝑇
𝑗=1 ቀ𝜉𝑗 𝐷 2 𝑥ቁ =
σ𝑚 2 th 𝑇 2 th
𝑗=1 𝑇𝑗 ሻ . Here, 𝜉𝑗 is j column (𝜉𝑗 = [0 ⋯ 1 ⋯ 0] ) of the identity matrix of size m X m and 𝑇𝑗 is j variable contribution.
MLforPSE.com|102
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Variable # 24 makes large contributions to both the indices and in Figure 7.10 we can see a
sharp decline in its value towards the end of the sampling period. A plant operator can use
his/her judgement to further troubleshoot the abnormality to isolate the root-cause.
Note that contribution plots do not explicitly identify the underlying cause of the fault; rather,
they only isolate the variables that have been most affected by the fault or are most
inconsistent with the NOC behavior.
Smearing effect
While contribution analysis is a widely employed method for fault isolation, it is
prone to errors – variables that are not impacted by fault can end up having
significant contributions. For example, in the simple illustration below, assume that
an actual process state denoted by point ‘a’ is measured as point ‘b’ due to faulty f1
sensor. However, as shown, variable f2 makes significant contribution to the Q
metric and therefore, f2 will be inferred to be faulty as well. The take-home message
is that caution must be exercised during interpretation of the contribution plot and
the actual data for the highlighted variables should be checked before reporting the
faulty variables.
Other approaches for fault isolation have been explored such as the reconstruction-
based approach by Alcala & Qin (see the 2009 article title ‘Reconstruction-based
approach for process monitoring’ published in Automatica).
MLforPSE.com|103
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Partial least squares (PLS) is a supervised multivariate regression technique that estimates
linear relationship between a set of input variables and a set of output variables. Like PCA,
PLS transforms raw data into latent components - input (X) and output (Y) data matrices are
transformed into score matrices T and U, respectively. Figure 7.11 provides a conceptual
comparison of PLS methodology with those of other popular linear regression techniques,
principal component regression (PCR) and multivariate linear regression (MLR).
Figure 7.11: PLS, PCR, MLR methodology overview. Note that the score matrix, T, for PLS
and PCR can be different.
While MLR computes the least-squares fit between X and Y directly, PCR first performs PCA
on input data and then computes least-squares fit between the score matrix and Y. By doing
so, PCR is able to overcome the issues of collinearity, high correlation, noisy measurements,
and limited training dataset. However, the latent variables are computed independent of the
output data and therefore, the score matrix may capture those variations in X which are not
relevant for predicting Y. PLS overcomes this issue by estimating the score matrices, T and
U, simultaneously such that the variation in X that is relevant for predicting Y is maximally
captured in the latent variable space.
Note that if the number of latent components retained in PLS or PCR model
is equal to the original number of input variables (m), then PLS and PCR
models are equivalent to MLR model.
The unique favorable properties of PLS along with low computational requirements has led to
its widespread usage in process monitoring for real-time process monitoring, soft-sensing,
fault classification, and so on.
MLforPSE.com|104
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Mathematical background
PLS performs 3 simultaneous jobs:
To see how PLS achieves its objectives, consider again the data matrix 𝑋 ∈ ℝ𝑁×𝑚 consisting
of N observations of m input variables where each row represents a data-point in the original
measurement space. In addition, we also have an output data matrix with p (≥ 1) output
variables, 𝑌 ∈ ℝ𝑁×𝑝 . It is assumed that each column is normalized to zero mean and unit
variance in both the matrices. The first latent component scores are given by:
The vectors w1 and c1, termed weight vectors, are computed such that the covariance
between t1 and u1 are maximized. Referring to the definition of covariance, we can see that
by maximizing the covariance, PLS tries to meet all the three objectives simultaneously.
In the next step, with the computed pairs {X, 𝑡1 } and {Y, 𝑢1 }, loading vectors (p1 and q1), are
found via least squares regression
In Eq. 12, E and F are called residual matrices and represent the part of X and Y that have
not yet been captured. To find the next component scores, the above three steps are repeated
with matrices E1 and F1 replacing X and Y. Note that the maximum number of possible
components equals m. For each component, the weight vectors are found via iterative
procedures like NIPALS (algorithm shown later) or SIMPLS. The final PLS decomposition
looks like the following
𝑋 = TP 𝑇 + E = σ𝑘𝑖=1 𝑡𝑖 𝑝𝑖𝑇 + 𝐸
where k is the number of latent components computed. The expressions in Eq. (11) are
referred to as the outer relations for the X and Y blocks, respectively. An inner relation is also
estimated for each pair {𝑡i , 𝑢𝑖 } via linear regression
MLforPSE.com|105
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
𝑢𝑖 ≈ 𝑏𝑖 𝑡𝑖
𝑜𝑟 𝑈 = [𝑢1 , 𝑢2 , ⋯ , 𝑢𝑘 ] ≈ [𝑏1 𝑡1 , 𝑏2 𝑡2 , ⋯ , 𝑏𝑘 𝑡𝑘 ] = 𝑇𝐵
𝑌 = 𝑈
̂𝑄𝑇 = 𝑇𝐵𝑄𝑇 eq. 14
If these algorithmic details appear intimidating, do not worry40. Sklearn provides the class
PLSRegression which is very convenient to use as we will see in the next section where we
will develop a PLS-based fault detection tool.
1) Initialize 𝑖 = 1, 𝑋1 = 𝑋, 𝑌1 = 𝑌
2) Set score vector 𝑢𝑖 ሺ∈ ℝ𝑁𝑥1 ሻ to any column of 𝑌𝑖
3) Calculate weight vector 𝑤𝑖 = 𝑋 𝑇 𝑢𝑖
11) Set 𝑖 = 𝑖 + 1 and return to step 2. Stop when 𝑖 > 𝑘 [number of latents to be retained]
12) Form the matrices 𝑇 = [𝑡1 , 𝑡2 , ⋯ , 𝑡𝑘 ]
𝑈 = [𝑢1 , 𝑢2 , ⋯ , 𝑢𝑘 ]
40
We included the NIPALS algorithm details so that you can follow the mathematical background of KPLS algorithm
easily in Chapter 10.
MLforPSE.com|106
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
where, 𝑅 = 𝑊ሺ𝑃 𝑇 𝑊ሻ−1 whose ith column equals the vector 𝑟𝑖 . An obvious relationship is
that 𝑟1 = 𝑤1 .
Although soft sensing is the predominant use of PLS in process industry, the PLS framework
renders itself useful for process monitoring as well. The overall methodology is similar to PCA-
based monitoring: after PLS modeling, monitoring indices are computed, control limits are
determined, and violation of the control limits are checked for fault detection. PLS-based
monitoring is preferred when process data can be divided into input and output blocks. For
illustration, we will use data collected from an LDPE (low-density polyethylene) production
process using a simulated model that was tuned to match a typical industrial process 42. As
shown in the figure below, the process consists of a multi-zonal tubular reactor. The dataset
consists of 54 samples of 14 process variables and 5 product quality variables. It is known
that a process fault occurs sample 51 onwards (Figure 7.13).
41
https://learnche.org/pid/latent-variable-modelling/projection-to-latent-structures/how-the-pls-model-is-calculated
42
MacGregor et al., Process monitoring and diagnosis by multiblock PLS methods, Process Systems Engineering, 1994.
This dataset can be obtained from https://openmv.net.
MLforPSE.com|107
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Our objective here is to build a fault detection tool that clearly indicates the onset of process
fault. To appreciate the need for such a tool, let’s look at the alternative conventional
monitoring approach. If a plant operator was manually monitoring the 5 quality variables
continuously, he/she could notice a slight drop in values for the last 4 samples. However,
given that the quality variables exhibit large variability during normal operations, it is difficult
to make any decision without first examining other process variables because the quality
variables may simply be responding to ‘normal’ changes elsewhere in the process.
Unfortunately, it would be very inconvenient to manually interpret all the process plots
simultaneously.
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_train)
43
Kourti & MacGregor, Process analysis, monitoring and diagnosis, using multivariate projection methods, Chemometrics
and Intelligent Laboratory Systems, 1995
MLforPSE.com|108
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
pls = PLSRegression(n_components = 3)
pls.fit(X_train_normal, Y_train_normal)
Computation of captured variances reveal that just 56% of the information in X can explain
almost 90% of the variation in Y; this implies that there are variations in X which have only
minor impact on quality variables.
Tscores = pls.x_scores_
X_train_normal_reconstruct = np.dot(Tscores, pls.x_loadings_.T)
# can also use pls.inverse_transform(Tscores)
A look at t vs u score plots further confirms that linear correlation was a good assumption for this
dataset. We can also see how the correlation becomes poor for higher components.
Figure 7.14: X-scores vs Y-scores. Here tj and uj refer to the jth columns of T and U
matrices, respectively.
MLforPSE.com|109
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
T2 statistic
Like in PCA, T2 statistic quantifies the systematic variations in predictor variables (X) that are
related to the systematic variations in response (Y) variables. Large deviation in T2 statistic
implies significant changes in process operating conditions. Let ti denote the ith row of T. The
T2 index for this data-point is given by
2
𝑡
𝑇 2
= σ𝑘𝑗=1 𝑖,𝑗 = 𝑡𝑖 𝛬−1 𝑇
𝑘 𝑡𝑖
𝜎 𝑗
𝛬𝑘 , a diagonal matrix, is the covariance matrix of T with 𝜎𝑗 (variance of jth component scores)
2
as its diagonal elements. 𝑇𝐶𝐿 for a false rate of 𝛼 is obtained by the following expression
2
𝑘ሺ𝑁 2 − 1ሻ
𝑇𝐶𝐿 = 𝐹 ሺ𝛼ሻ
𝑁ሺ𝑁 − 𝑘ሻ 𝑘,𝑁−𝑘
The second and third indices, SPEx and SPEy, represents the residuals or the unmodelled
part of X and Y, respectively. Let ei and fi denote the ith row of E and F, respectively. Then
𝑚
2
𝑆𝑃𝐸𝑥 = 𝑒𝑖,𝑗
𝑗=1
𝑝
2
𝑆𝑃𝐸𝑦 = 𝑓𝑖,𝑗
𝑗=1
Note that if output measurements are not available in real-time then SPEy is not calculated.
With normality assumption for the residuals and large number of training samples, the control
limit for SPE statistic is given by the following expression
2𝜇 2 𝜎
ℎ = , 𝑔 =
𝜎 2𝜇
MLforPSE.com|110
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
𝜒𝛼2 is the (1-α) percentile of a chi-squared distribution44 with h degrees of freedom; 𝜇 denotes
the mean value and 𝜎 denotes the variance of the SPE statistic. Let’s now generate the control
charts for the fault detection indices.
T_cov = np.cov(Tscores.T)
T_cov_inv = np.linalg.inv(T_cov)
T2_train = np.zeros((N,))
for i in range(N):
T2_train[i] = np.dot(np.dot(Tscores[i,:],T_cov_inv),Tscores[i,:].T)
# SPEx
x_error_train = X_train_normal - X_train_normal_reconstruct
SPEx_train = np.sum(x_error_train*x_error_train, axis = 1)
# SPEy
y_error_train = Y_train_normal - pls.predict(X_train_normal)
SPEy_train = np.sum(y_error_train*y_error_train, axis = 1)
# control limits
#T2
import scipy.stats
alpha = 0.01 # 99% control limit
T2_CL = k*(N**2-1)*scipy.stats.f.ppf(1-alpha,k,N-k)/(N*(N-k))
# SPEx
mean_SPEx_train = np.mean(SPEx_train)
var_SPEx_train = np.var(SPEx_train)
g = var_SPEx_train/(2*mean_SPEx_train)
h = 2*mean_SPEx_train**2/var_SPEx_train
SPEx_CL = g*scipy.stats.chi2.ppf(1-alpha, h)
# SPEy
mean_SPEy_train = np.mean(SPEy_train)
var_SPEy_train = np.var(SPEy_train)
g = var_SPEy_train/(2*mean_SPEy_train)
44
Yin et al., A review of basic data-driven approaches for industrial process monitoring, IEEE Transactions on Industrial
Electronics, 2014
MLforPSE.com|111
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
h = 2*mean_SPEy_train**2/var_SPEy_train
SPEy_CL = g*scipy.stats.chi2.ppf(1-alpha, h)
MLforPSE.com|112
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Contribution analysis-based fault isolation for PLS model proceed along the similar lines as
that for PCA. Each monitoring statistic is broken down into contributions from individual
variables and the variables with highest contributions are flagged as potentially faulty
variables.
𝑚 𝑚
T2 contributions
To find T2 contributions, the following approach is adopted45. Let 𝑥 𝑠𝑎𝑚𝑝𝑙𝑒 ∈ ℝ𝑚𝑋1 and 𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 ∈
ℝ𝑘𝑋1 denote the input vector and the score vector, respectively, for a candidate sample. Then,
2
𝑇 2 = ሺ𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 ሻ𝑇 𝛬−1
𝑘 𝑡
𝑠𝑎𝑚𝑝𝑙𝑒
= ‖𝛬−0.5
𝑘 𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 ‖
2
= ‖𝛬−0.5
𝑘 𝑅𝑇 𝑥 𝑠𝑎𝑚𝑝𝑙𝑒 ‖ = ԡГ𝑥 𝑠𝑎𝑚𝑝𝑙𝑒 ԡ2
45
Choi & Lee, Multiblock PLS-based localized process diagnosis. Journal of Process Control, 2005.
MLforPSE.com|113
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
2
𝑚
where, Г = 𝛬−0.5𝑘 𝑅 𝑇 , Гሺ: , 𝑗ሻ is the jth column of Г, and 𝑥 𝑠𝑎𝑚𝑝𝑙𝑒 ሺ𝑗ሻ is the value of the jth input
variable in the sample. The contribution of the jth input variable for 𝑇 2 is given as
ԡГሺ: , 𝑗ሻ𝑥 𝑠𝑎𝑚𝑝𝑙𝑒 ሺ𝑗ሻԡ2 . Let’s now find which variables need to be further investigated at the 54th
sample.
# SPEx contribution
sample = 54
data_point = np.transpose(data_normal[sample-1,])
x_error_test_sample = x_error_test[sample-1,]
SPEx_contri = x_error_test_sample*x_error_test_sample # vector of contributions
# SPEy contribution
y_error_test_sample = y_error_test[sample-1,]
SPEy_contri = y_error_test_sample*y_error_test_sample # vector of contributions
# T2 contribution
W = pls.x_weights_
P = pls.x_loadings_
R = np.dot(W, np.linalg.inv(np.dot(P.T, W)))
Ghe = np.dot(scipy.linalg.sqrtm(T_cov_inv), R.T)
T2_contri = np.zeros((X_train_normal.shape[1],))
for i in range(X_train_normal.shape[1]):
vect = Ghe [:,i]*data_point[i]
T2_contri[i] = np.dot(vect, vect)
MLforPSE.com|114
Chapter 7: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 1
Variable # 9 makes large contributions to both SPEx and T2 indexes and in Figure 7.10 we
can see that there was a sharp increase in its value towards the end of the sampling period.
Variable # 9 was also reported as the most contributing variable in the work of McGregor et
al43.
With this, we have come to the end of our study of PCA- and PLS-based monitoring solution
development. You probably now understand why these two techniques are so popular.
However, notwithstanding the powerful capabilities of PCA and PLS, you may encounter
problems where vanilla-PCA and vanilla-PLS fail to provide satisfactory performance. Process
data exhibiting severe nonlinearity and/or dynamics require some modifications to the vanilla
techniques. We will study these variants in the upcoming chapters.
Summary
With this chapter we have reached a significant milestone in our ML-based PM/PdM journey.
You have seen how hidden process knowledge can be conveniently extracted from process
data and converted into process insights. With PCA and PLS tools in your arsenal you are
now well-equipped to tackle many of the process monitoring related problems. However, our
journey does not end here. In the next chapter, we will study a few more latent-variable-based
techniques that are equally powerful.
MLforPSE.com|115
Chapter 8
Multivariate Statistical Process Monitoring for
Linear and Steady-State Processes: Part 2
B
y now you must be very impressed with the powerful capabilities of PCA and PLS
techniques. These methods allowed us to extract latent variables and monitor
systematic variations in latent space and process noise separately. However, you may
ask, “Are these the best latent variable-based techniques to use for all problems?”. We are
glad that you asked! Other powerful methods do exist which may be better suited in certain
scenarios. For example, independent component analysis (ICA) is preferable over PCA when
process data is not Gaussian distributed. It can provide latent variables with stricter property
of statistical independence rather than only uncorrelatedness. Independent components may
be able to characterize the process data better than principal components and thus may result
in better monitoring performance.
In another scenario, if your end goal is to classify process faults into different categories for
fault diagnosis, then, maximal separation between data from different classes of faults would
be your primary concern rather than maximal capture of data variance. Fisher discriminant
analysis (FDA) is preferred for such tasks.
In this chapter, we will learn in detail the properties of ICA and FDA. We will apply these
methods for process monitoring and fault classification for a large-scale chemical plant.
Specifically, the following topics are covered
• Introduction to ICA
• Process monitoring of non-Gaussian processes
• Introduction to FDA
• Fault classification for large scale processes.
116
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figure 8.1: Simple illustration of ICA vs PCA. The arrows in the x1 vs x2 plot show the direction
vectors of corresponding components. Note that the signals t1 and t2 are not independent as
value of one variable influences the range of values of the other variable.
ICA uses higher-order statistics for latent variable extractions, instead of only second order
statistics (mean, variance/covariance) as done by PCA. Therefore, for non-Gaussian
46
If you observe closely, you will find that ICA latent signals (u1 and u2) do differ from s1 and s2 signals in terms of sign
and magnitude; we will soon learn why this happens and why this is not a cause of worry.
MLforPSE.com|117
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Independence vs Uncorrelatedness
Before jumping into the mathematics behind ICA, let us take a few seconds to ensure that we
understand the concepts behind independence and uncorrelatedness. Two random variables,
y1 and y2, are said to be independent if the value of one signal does not impact the value of
the other signal. Mathematically, this condition is stated as
𝑝ሺ𝑦1, 𝑦2ሻ = 𝑝ሺ𝑦1ሻ𝑝ሺ𝑦2ሻ
Where 𝑝ሺ𝑦1ሻ is the probability density function of y1 alone, and 𝑝ሺ𝑦1, 𝑦2ሻ is the joint probability
density function. The variables y1 and y2 are said to be uncorrelated if their covariance is
zero
𝐶ሺ𝑦1, 𝑦2ሻ = 𝐸{ሺ𝑦1 − 𝐸ሺ𝑦1ሻሻሺ𝑦2 − 𝐸ሺ𝑦2ሻሻ} = 𝐸ሺ𝑦1 ∗ 𝑦2ሻ − 𝐸ሺ𝑦1ሻ𝐸ሺ𝑦2ሻ = 0
Where 𝐸ሺ. ሻ denotes mathematical expectation. Using the independence condition, it can be
easily shown that if the variables are independent, they are also uncorrelated but not vice
versa. Therefore, uncorrelatedness is a weaker form of independence.
MLforPSE.com|118
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Mathematical background
Consider data matrix 𝑋 ∈ ℝ𝑚×N consisting of N observations of m input variables where each
column represents a data-point in the original measurement space. Note that in contrast to
PCA, transposed form of data matrix is employed here. In ICA, it is assumed that measured
variables are a linear combination of d (≤ m) independent components s1, s2, …, sd.
𝑋 = 𝐴𝑆 eq. 1
where 𝑆 ∈ ℝ𝑑×N and 𝐴 ∈ ℝ𝑚×d is called the mixing matrix. The objective of ICA is to estimate
the unknown matrices A and S from the measured data X. This is accomplished by finding a
demixing matrix, W, such that the ICs or the rows of estimated matrix (𝑆̂) become as
independent as possible.
𝑆̂ = 𝑊𝑋 eq. 2
Before estimating W, the initial step in ICA involves removing correlations between the
variables in the data matrix X. The step, called as whitening or sphering, is accomplished via
PCA.
𝑍 = 𝑄𝑋 eq. 3
where 𝑄 ∈ ℝ𝑑×d is called whitening matrix. Whitening makes the rows of Z uncorrelated. To
see how it helps, let B = QA and consider the following relationships,
𝑍 = 𝑄𝑋 = 𝑄𝐴𝑆 = 𝐵𝑆 eq. 4
Therefore, whitening converts the problem of finding matrix A into that of finding matrix B. The
advantage lies in the fact that B is an orthogonal matrix - can be shown considering that
whitened variables are uncorrelated and ICs are independent - and hence fewer parameters
need to be estimated. Using orthogonality property, the following relationship results,
𝑆 = 𝐵𝑇 𝑍 = 𝐵𝑇 𝑄𝑋
⇒ 𝑊 = 𝐵𝑇 𝑄 eq. 5
The above procedure summarizes the steps involved in ICA. Note that the sets {A/n, S/n} and
{A, S} result in the same measured data matrix X for any non-zero scalar n, and therefore the
47
www.sci.utah.edu/~shireen/pdfs/tutorials/Elhabian_ICA09.pdf is an excellent quick read for more details on this.
MLforPSE.com|119
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
sign and magnitude of the original ICs cannot be uniquely estimated. This however does not
affect usage of ICA for process monitoring because the estimated ICs for both training and
test data get scaled by the same scalar implicitly. FastICA (a popular algorithm for ICA)
computes ICs such that the L2 norm of each IC score is 1. This is the reason why the
reconstructed IC signals in Figure 8.1 seem to be scaled versions of original IC signals. We
will use this fact later, so do not forget it!
Let’s quickly see the process impact of one of the fault conditions (fault 10) which include
disturbances in one of the feed’s temperature. The impact of this fault can be seen in abnormal
48
Detailed information about the process and the faults can be obtained from the original paper by Downs and Vogel
titled ‘A plant-wide industrial process control problem’.
49
Adapted from the original flowsheet by Gilberto Xavier (https://github.com/gmxavier/TEP-meets-LSTM) provided under
Creative-Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|120
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
stripper temperature profile (Figure 8.4a). The plot in PC space (Figure 8.4b) shows more
clearly how the faulty operation data are different from the normal operation data. We will use
ICA and FDA for detecting and classifying these faults automatically.
# quick visualization
plt.figure(), plt.plot(TEdata_noFault_train[:,17])
plt.xlabel('sample #'), plt.ylabel('Striper Tempearture'), plt.title('Normal operation')
plt.figure(), plt.plot(TEdata_Fault_train[:,17])
plt.xlabel('sample #'), plt.ylabel('Striper Tempearture'), plt.title('Faulty operation')
Figure 8.4 (a): Normal vs faulty process profile in Tennessee Eastman dataset
Figure 8.4 (b): Normal vs faulty (Fault 10) TE process data in PC space
MLforPSE.com|121
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
The number of ICs returned by FastICA equals the dimension of the measured space.
Previously, we saw in PCA that not all extracted latent variables are important; only a few
dominant/important components contain majority of the information while the rest contain
noise or trivial details. Moreover, the extracted PCs were ordered according to their variance.
However, there is no standard criterion for quantifying the importance of ICs and selecting the
optimal number of ICs automatically. One popular approach (described below) for quantifying
component importances was suggested by Lee et. al50.
Equation 2 shows that the ith row (wi) of demixing matrix W corresponds to the ith IC. Therefore
Lee et al. suggested using the Euclidean norm (L2) of wi to quantify the importance of ith IC
and subsequently order the ICs in decreasing order of importance. Using this rationale, Figure
8.5 shows that not all ICs are equally important as the L2 norm of several ICs are much smaller
than the rest.
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_noFault_train)
plt.figure()
plt.plot(L2_norm, 'b'), plt.xlabel('IC number (unsorted)'), plt.ylabel('L2 norm')
plt.figure()
plt.plot(L2_norm_sorted_pct, 'b+'), plt.xlabel('IC number (sorted)'), plt.ylabel('% L2 norm')
50
Lee et al., Statistical process monitoring with independent component analysis, Journal of Process Control, 2004
MLforPSE.com|122
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figure 8.5: (Unsorted) L2 norm of each row of W and (sorted) percentage of the L2 norms
After the ordering of the ICs, 2 approaches could be utilized to determine the optimal number
of ICs. Approach 1 uses the sorted L2 norm plot to determine a cut-off IC number beyond
which the norms are relatively insignificant. For our dataset, as seen in Figure 8.5, no clear
cut-off number is apparent. Another approach entail choosing the number of ICs equal to the
number of PCs. This approach also ensures fair comparison between ICA and PCA. We will
use this 2nd approach in this chapter.
# decide # of ICs to retain via PCA variance method and compute ICs
from sklearn.decomposition import PCA
pca = PCA().fit(data_train_normal)
Note that the FastICA.fit method expects each row of data matrix (X) to represent a sample
while we used a transposed form in our mathematical descriptions. This may cause confusion
at times. Nonetheless, the shapes of the extracted mixing, demixing, and whitening matrices
are same in both the places.
MLforPSE.com|123
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
The illustration example in Figure 8.1 showed that if the latent variables are non-Gaussian
distributed then ICA can extract the latent signals better than PCA. For such systems, it can
be expected that ICA-based monitoring will give better performance than PCA-based
monitoring. Like PCA/PLS-based monitoring mechanism, monitoring metrics and their
corresponding thresholds are computed using NOC data. The metric values for test data are
compared against the thresholds to check for presence of faults.
T2 statistic
The first statistic, 𝐼 2 , is defined as the sum of the squared independent scores (in reduced
dimension space) and is a measure of systematic process variations (like PCA T 2 statistic).
Let si denote the ith column of matrix Sd which represents the ith sample/data-point in the
reduced IC space. The 𝐼 2 index for this sample is calculated by
𝐼 2 = 𝑠𝑖𝑇 𝑠𝑖 eq. 6
You may wonder why we have not included the covariance matrix as we did for T2 metric
computation. This is because the variance of each IC score is same (remember, the L2 norm
is same) and thus inclusion of covariance matrix is redundant.
SPE statistic
The second index, SPE, represents the distance between the measured and reconstructed
data-point in the measurement space. For its computation, let us construct a matrix Bd by
selecting the columns from matrix B whose indices correspond to the indices of the rows
selected from W when we generated matrix Wd. Let xi and 𝑥ෝ𝑖 denote the ith measured and re-
constructed sample. Then,
𝑥ෝ𝑖 = 𝑄 −1 𝐵𝑑 𝑠𝑖 = 𝑄 −1 𝐵𝑑 𝑊𝑑 𝑥𝑖
𝑒𝑖 = 𝑥𝑖 − 𝑥ෝ𝑖
𝑆𝑃𝐸 = 𝑒𝑖𝑇 𝑒𝑖 eq. 7
MLforPSE.com|124
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
𝐼𝑒2 statistic
The third metric, 𝐼𝑒2 , is based on the excluded ICs. Let We contain the excluded rows of matrix
W. Then
𝑠𝑖𝑒 = 𝑊𝑒 𝑥𝑖
𝐼𝑒2 = ሺ𝑠𝑖𝑒 ሻ𝑇 𝑠𝑖𝑒 eq. 8
Due to the non-Gaussian nature of the ICs, we do not have convenient closed-form equations
for computing the thresholds/control limits of the above computed indices. A standard practice
is to use Kernel density Estimation (KDE) for threshold determination of ICA metrics. Since
we will learn KDE in a later chapter, we will employ the percentile method here.
Below we define a function that takes an ICA model and data matrix as inputs and returns the
monitoring metrics. Figure 8.6 shows the monitoring charts along with 99% control limits.
# Define function to compute ICA monitoring metrics for training or test samples
def compute_ICA_monitoring_metrics(ica_model, number_comp, data):
"""
data: numpy array of shape = [n_samples, n_features]
Returns
----------
monitoring_stats: numpy array of shape = [n_samples, 3]
"""
N = data.shape[0]
# model parameters
W = ica.components_
L2_norm = np.linalg.norm(W, 2, axis=1)
sort_order = np.flip(np.argsort(L2_norm))
W_sorted = W[sort_order,:]
# compute I2
Wd = W_sorted[0:number_comp,:]
Sd = np.dot(Wd, data.T)
I2 = np.array([np.dot(Sd[:,i], Sd[:,i]) for i in range(N)])
# compute Ie2
We = W_sorted[n_comp:,:]
Se = np.dot(We, data.T)
Ie2 = np.array([np.dot(Se[:,i], Se[:,i]) for i in range(N)])
MLforPSE.com|125
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
# compute SPE
Q = ica.whitening_
Q_inv = np.linalg.inv(Q)
A = ica.mixing_
B = np.dot(Q, A)
B_sorted = B[:,sort_order]
Bd = B_sorted[:,0:n_comp]
parameters
----------------
ICA_statistics: numpy array of shape = [n_samples, 3]
CLs: List of control limits
trainORtest: 'training' or 'test'
"""
MLforPSE.com|126
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figures 8.7 and 8.8 show the ICA/PCA monitoring charts for faults 10 and 5. For Fault 10,
ICA gives significantly higher FDR. There are many samples between 400 to 600 where PCA
metrics are below their control limits despite the presence of process abnormalities. The
performance difference is even more prominent for Fault 5 test data. Sample 400 onwards
PCA charts incorrectly indicate a normal operation although it is known that the faulty
condition remains till the end of the sampling period. Another point to note is that both PCA
and ICA have low FAR values as not many samples before sample 160 violate the control
limits. Here again ICA has lower/better FAR values. Similar FAR behavior is observed for the
non-faulty test dataset. We have not shown FAR values, but you should compute them and
confirm that ICA gives lower FAR values.
parameters
-----------
monitoring_stats: numpy array of shape = [n_samples, 3]
CLs: List of control limits
Returns
----------
51
Lee et al., Statistical monitoring of dynamic processes based on dynamic independent component analysis, Chemical
Engineering Science, 2004
52
Yin et al., A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark
Tennessee Eastman process, Journal of Process Control, 2012
MLforPSE.com|127
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
alarmRate: float
"""
violationFlag = monitoring_stats > CLs
alarm_overall = np.any(violationFlag, axis=1) # violation of any metric => alarm
alarmRate = 100*np.sum(alarm_overall)/monitoring_stats.shape[0]
return alarmRate
xmeas = TEdata_Fault_test[:,0:22]
xmv = TEdata_Fault_test[:,41:52]
data_Fault_test = np.hstack((xmeas, xmv))
# scale data
data_test_scaled = scaler.transform(data_Fault_test)
Figure 8.7: Monitoring charts for Fault 10 data with ICA (top, FDR = 90.8%) and PCA
(bottom, FDR = 75.6%) metrics
MLforPSE.com|128
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figure 8.8: Monitoring charts for Fault 5 data with ICA (top, FDR = 100%) and PCA (bottom,
FDR = 41.5%) metrics
Fisher Discriminant Analysis (FDA), also called linear discriminant analysis (LDA), is a
multivariate dimensionality reduction technique which maximizes the ‘separation’ in the lower
dimensional space between data belonging to different classes. For conceptual
understanding, consider the simple illustration in Figure 8.9 where a 2D dataset (with 2
classes of data) has been projected onto 1D spaces by FDA and PCA. The respective 1-D
scores show that while the two classes of data are well segregated in LD space, the
segregation is very poor in PC space. This observation was expected because PCA, while
determining the projection directions, does not consider information about different data
classes. Therefore, if your intention is to reduce dimensionality for subsequent data
classification and training data is labeled into different classes, then FDA is more suitable.
MLforPSE.com|129
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figure 8.9: Simple illustration of FDA vs PCA. The arrows in the x1 vs x2 plot show the
direction vectors of 1st components of the corresponding methods
Due to the powerful data discrimination ability of FDA, it is widely used in process industry for
operating mode classification and fault diagnosis. Large-scale industrial processes often
experience different kinds of process issues/faults and it is imperative to accurately identify
the specific issue quickly during online operations to minimize economic losses. During model
training, FDA learns from data collected from historical process failures (data from a specific
process fault form a class) to find the optimal projection directions and classify abnormal
process data into specific faults in real-time. We will study one such application in this chapter.
Mathematical background
To facilitate data classification, FDA not only maximizes the separation between the classes
but also minimizes the variation/scatter within each class. To see how this is achieved, let us
first consider a dataset matrix 𝑋 ∈ ℝ𝑁×m consisting of N samples, N1 of which belong to class
1 (𝜔1) and N2 belong to class 2 (𝜔2 ). FDA seeks to find a projection vector w ∈ ℝm such that
the projected scalars/samples (𝑧 = 𝑤 𝑇 x) are maximally separated. Let 𝜇̃1 and 𝜇̃2 denote the
means of projected values of classes 1 and 2, respectively
1
𝜇̃𝑖 = 𝑧
𝑁𝑖
𝑧 ∈ 𝜔1
Class separation could be quantified as distance between the projected means |𝜇̃1 − 𝜇̃2 |; this,
however, is not a robust measure as shown by the illustration below.
MLforPSE.com|130
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
𝑠̃ 𝑖2 = ሺ𝑧 − 𝜇̃𝑖 ሻ2
𝑧 ∈ 𝜔1
The separation criterion is now defined as the normalized distance between the projected
means. This formulation seeks to find a projection where the class means are far apart and
samples from the same class are close to each other.
|𝜇̃1 − 𝜇̃2 |2
𝐽ሺ𝒘ሻ =
𝑠̃12 + 𝑠̃22
Using the base relation 𝑦 = 𝑤 𝑇 x and straightforward algebraic manipulations53, one can
equivalently represent the objective 𝐽ሺ𝑤ሻ as follows which also holds for any (p) number of
data classes
𝑤 𝑇 𝑆𝑏 𝑤
𝐽ሺ𝑤ሻ = eq. 9
𝑤 𝑇 𝑆𝑤 𝑤
𝑆𝑤 = 𝑆𝑗
𝑗=1
𝑆𝑗 = ሺ𝑥𝑖 − 𝜇𝑗 ሻሺ𝑥𝑖 − 𝜇𝑗 ሻ𝑇
𝑥𝑖 ∈ 𝜔 1
where, 𝜇 ∈ ℝm and 𝜇𝑗 ∈ ℝm denote the mean vectors of all the N samples and N j samples
from jth class, respectively, in the measurement space. The first FDA vector, w1, is found by
maximizing J(w) and the subsequent vectors are found by solving the same problem with the
added constraints of orthogonality to previously computed vectors. Note that there can be at
53
Elhabian & Farag, A tutorial on data reduction Linear Discriminant Analysis, September 2009
MLforPSE.com|131
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
most p-1 FDA vectors. Alternatively, like PCA, the vectors can also be computed as solutions
of a generalized eigenvalue problem
𝑆𝑤−1 𝑆𝑏 𝑤 = 𝜆𝑤 eq. 10
where 𝜆 = J(w). Therefore, the eigenvalues (𝜆𝑠 ) indicate the degree of separability among the
data classes when projected onto the corresponding eigenvectors. The first discriminant/FDA
vector/eigenvector corresponds to the largest eigenvalue, the 2nd FDA vector is associated
with the 2nd largest eigenvalue, and so on. Once the FDA vectors are determined, data-points
can be projected, and classification models can be built in the reduced FDA space. Overall,
FDA transformation from m dimensional space to p-1 dimensional FDA space can be
represented as
𝑍 = 𝑋𝑊𝑝
where 𝑊𝑝 ∈ ℝ𝑚×ሺp−1ሻ contains the p-1 FDA vectors as columns and 𝑍 ∈ ℝ𝑁×ሺp−1ሻ is the data-
matrix in the transformed space where each row is the transformed sample. The transformed
samples are optimally separated in the FDA space.
MLforPSE.com|132
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Faultydata_train_scaled = scaler.fit_transform(data_Faulty_train)
Figure 8.10: FDA and PCA scores in 2 dimensions with 3 fault classes from TEP training
(top) and test (bottom) dataset
MLforPSE.com|133
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
Figure 8.10 shows the transformed samples in 2 dimensions after FDA and PCA. FDA is able
to provide a clear separation of Fault 5 samples; however, it could not separate Faults 10 and
19 data. Infact, the 2nd discriminant (FD2) contributes little to the discrimination and therefore,
only FD1 is needed to separate samples from Fault 5. To segregate Faults 10 and 19, kernel
FDA can be explored54. Linear PCA, on the other hand, fails to separate any of the classes.
While FDA is a powerful tool for fault diagnosis, it can also be used for fault
detection by including NOC data as a separate class.
After projecting the samples onto the FDA space, any classification technique can be chosen
to classify or diagnose the specific fault. A popular T2 statistic-based approach entails
computing a T2 control limit for each fault class using the training data. The T 2 control limit
2
(𝑇𝐶𝐿,𝑗 ) for the jth fault class represents a boundary around the projected samples from the jth
fault in the lower-dimensional space – any given sample lying inside the boundary belongs to
the jth fault class. Mathematically, this can be specified as
2 2
𝑇𝑠𝑎𝑚𝑝𝑙𝑒,𝑗 < 𝑇𝐶𝐿,𝑗 ⇒ 𝑠𝑎𝑚𝑝𝑙𝑒 𝑏𝑒𝑙𝑜𝑛𝑔𝑠 𝑡𝑜 𝑡ℎ𝑒 𝑗 𝑡ℎ 𝑓𝑎𝑢𝑙𝑡 𝑐𝑙𝑎𝑠𝑠
2
𝑇𝑠𝑎𝑚𝑝𝑙𝑒,𝑗 for a sample for the jth class is given by
2
𝑇𝑠𝑎𝑚𝑝𝑙𝑒,𝑗 = ሺ𝑧𝑠𝑎𝑚𝑝𝑙𝑒 − 𝜇̃𝑗 ሻ𝑇 𝑆̃𝑗 ሺ𝑧𝑠𝑎𝑚𝑝𝑙𝑒 − 𝜇̃𝑗 ሻ
where, 𝜇̃𝑗 and 𝑆̃𝑗 denote the mean and covariance matrix of the projected samples belonging
to the jth fault class in training dataset, respectively, and 𝑧𝑠𝑎𝑚𝑝𝑙𝑒 denotes the projected sample
2
in the FDA space. 𝑇𝐶𝐿,𝑗 can be obtained using the same expression we used in PCA, 𝑘ሺ𝑁𝑗2 −
1ሻ/𝑁𝑗 ൫𝑁𝑗 − 𝑘൯ 𝐹𝑘,𝑁−𝑘 ሺ𝛼ሻ, where k denotes the number of dimensions retained in the FDA
space. For illustration, let us see how many samples from the Fault 5 test data get correctly
identified.
54
Hyun-Woo Cho, Nonlinear feature extraction and classification of multivariate process data in kernel feature space,
Expert Systems with Applications, 2007
MLforPSE.com|134
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
MLforPSE.com|135
Chapter 8: Multivariate Statistical Process Monitoring for Linear and Steady-State Processes: Part 2
About 98% of the samples have been correctly identified as belonging to Fault 5. As shown
2
in Figure 8.11, some of the samples which fall far away from the mean violate the 𝑇𝐶𝐿,5 and
therefore are not classified as Fault 5.
Summary
With this chapter, we have now covered the vanilla version of four major MSPM techniques
that are frequently utilized for analyzing process data. This chapter has also provided an
important message: blind application of a single technique all the time may not yield best
results. The techniques should be chosen according to the process system (Gaussian vs non-
Gaussian) and objective (fault detection vs fault classification) at hand. ICA and FDA are
powerful techniques and there lies much more to them that what we have touched in this
chapter. While you are encouraged to explore more about these methods (now that you have
conceptual understandings), we will move to study variants of PCA and PLS suitable for
dynamic data in the next chapter.
MLforPSE.com|136
Chapter 9
Multivariate Statistical Process Monitoring for
Linear and Dynamic Processes
I
n the previous chapters, we saw how beautifully latent variable-based MSPM techniques
can extract hidden steady-state relationships from data. However, we imposed a major
restriction of absence of dynamics in the dataset. Unfortunately, it is common to have to
deal with industrial datasets that exhibit significant dynamics and the standard MSPM
techniques fail in extracting dynamic relationships among process variables. Nonetheless, the
MSPM research community came up with a simple but ingenious modification to the standard
MSPM techniques that made working with dynamic dataset very easy. The trick entails
including the past measurements as additional process variables. That’s it! The standard
techniques can then be employed on the augmented dataset. The dynamic variants of the
standard MSPM techniques are dynamic PCA (DPCA), dynamic PLS (DPLS), dynamic ICA
(DICA), etc.
Dynamic PCA and dynamic PLS are among the most popular techniques for monitoring linear
and dynamic processes; accordingly, these are covered in detail in this chapter. Additionally,
this chapter also introduces another very popular and powerful technique that is specially
designed to extract dynamic relationships from process data – canonical variate analysis
(CVA). Using numerical and industrial-scale case studies, we will see how to use these three
techniques to build fault detection tools. Specifically, the following topics are covered
137
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Dynamic PCA is the dynamic extension of conventional PCA designed to handle process data
that exhibit significant dynamics. DPCA simply entails application of conventional PCA to
augmented data matrix which, as shown in Figure 9.1, is generated by using past
measurements as additional process variables. Note that each ‘variable’ of the augmented
matrix is normalized to zero mean and unit variance as is done in conventional PCA. It may
seem surprising, but such a simple approach has achieved great success in process industry
and has been readily adopted due to ease of implementation.
Conventional PCA
DPCA Model
Figure 9.1: Dynamic PCA procedure [l denotes the number of lags used]
All the mathematical expressions for the computations of the score matrix55, residual matrix,
Hotelling’s T2, and SPE remain the same (Eq. 1 to Eq. 8) as that shown in Chapter 7, except
that now you will be using scaled Xaug instead of X, i.e., 𝑇 = 𝑋𝑎𝑢𝑔 𝑃; 𝐸 = 𝑋𝑎𝑢𝑔 − 𝑋𝑎𝑢𝑔 . The
procedure for determination of number of retained latent variables also remains the same.
You may, amongst other approaches, look for a ‘knee’ in the scree plot of the explained
variance or use the cumulative percent variance approach. If you choose ‘l’ correctly, then
both the static and dynamic relationships among process variables are captured and
correspondingly, the residuals and the Q statistic will not exhibit autocorrelations56. For test
dataset, you would again simply perform augmentation with time-lagged measurements.
Before we get into the nitty-gritties, let’s see a quick motivating example on why DPCA is
superior to PCA in the presence of dynamics.
55
The number of retained principal components in DPCA could be greater than m.
56
The DPCA scores can show autocorrelations.
MLforPSE.com|138
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Example 9.1:
To illustrate how DPCA can extract dynamic relationships, let’s consider the following
noise-free dynamic system.
The number of zero singular (eigen) values extracted during PCA indicates the number
of linear relationships that exist among the process variables. Let’s see if we can extract
out the above dynamic relationship using only data (1000 samples of x1 and x2).
As expected, only one singular value is very close to zero. All we now need to do is fetch
the singular vector corresponding to this singular value and check if it represents our
dynamic system.
MLforPSE.com|139
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Voila! DPCA has successfully extracted the underlying process dynamics. PCA on the
other hand does not reveal any relationship between the variables.
57
Ku et at., Disturbance detection and isolation by dynamic principal component analysis. Chemometrics and intelligent
laboratory systems, 1995.
58
Rato and Reis, Defining the structure of DPCA models and its impact on process monitoring and prediction activities.
Chemometrics and intelligent laboratory systems, 2013
MLforPSE.com|140
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
To illustrate each step of DPCA-based fault detection, we will use a simple synthetic 2 input,
2 output process59 as shown below
0.118 −0.191 1 2
𝑧ሺ𝑘ሻ = ቂ ቃ 𝑧ሺ𝑘 − 1ሻ + ቂ ቃ 𝑢ሺ𝑘 − 1ሻ
0.847 0.264 3 −4
0.811 −0.226 0.193 0.689
𝑢ሺ𝑘ሻ = ቂ ቃ 𝑢ሺ𝑘 − 1ሻ + ቂ ቃ 𝑤ሺ𝑘 − 1ሻ
0.477 0.415 −0.320 −0.749
𝑦ሺ𝑘ሻ = 𝑧ሺ𝑘ሻ + 𝑣ሺ𝑘ሻ
Here, the outputs (y(k)) and inputs (u(k)) measurements are available for analysis. w(k) and
v(k) are zero-mean random noise signals with variances 1 and 0.1, respectively. Simulated
data from the process are available in files multivariate_NOC_data.txt and
multivariate_test_data.txt. In test data, A disturbance with unit step change of w2 was
introduced at sample 50 to simulate a faulty condition.
We will now go through each step of building a DPCA-based fault detection solution. Let’s
begin by reading the datasets, generating the augmented data matrices, and then training the
DPCA model.
# fetch data
X_NOC = np.loadtxt('multivariate_NOC_data.txt')
X_test = np.loadtxt('multivariate_test_data.txt')
#%%%%%%%%%%%%%%%%%%%%%%%%%
# DPCA model training
#%%%%%%%%%%%%%%%%%%%%%%%%%
# augment and scale data
X_NOC_aug = augment(X_NOC, 1)
scaler = StandardScaler()
X_NOC_aug_scaled = scaler.fit_transform(X_NOC_aug)
# find n_component
explained_variance = 100*dpca.explained_variance_ratio_ # in percentage
59
used by Ku et al. in their seminal paper.
MLforPSE.com|141
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
T2_NOC = np.zeros((N,))
for i in range(N):
T2_NOC[i] = np.dot(np.dot(scores_NOC[i,:], lambda_k_inv), scores_NOC[i,:].T)
# control limits
Q_CL = np.percentile(Q_NOC, 99)
T2_CL = np.percentile(T2_NOC, 99)
Let’s now deploy the DPCA model on the test data and see if the monitoring statistics violate
the control limits.
MLforPSE.com|142
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
N_test = X_aug_test_scaled.shape[0]
T2_test = np.zeros((N_test,))
for i in range(N_test):
T2_test[i] = np.dot(np.dot(scores_test[i,:], lambda_k_inv), scores_test[i,:].T)
You can see that DPC-based fault detection procedure is the same as that for PCA-based
fault detection, except for the augmentation of data. In general, DPCA has been found to
perform better than PCA when temporal correlation is present in data.
MLforPSE.com|143
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Dynamic PLS is the dynamic extension of conventional PLS. As shown in Figure 9.2, DPLS
entails employment of lagged measurements as additional predictor variables. As shown,
there are two ways of generating a DPLS model: by including lagged values of only input
variables or both input and output variables. While the first approach leads to FIR type of
model, the later approach leads to ARX type of model. Like DPCA, once the augmented matrix
is ready, scaling is performed and conventional PLS is applied. All the mathematical
expressions for the computations of the score matrix, residual matrices, monitoring metrics,
etc., remain the same as that seen in Chapter 7 for conventional PLS.
X Xaug
FIR type
DPLS
Conventional PLS
Model
X Xaug
ARX type
DPLS
Conventional PLS
Model
There are two hyperparameters that needs to be specified: the time lags60 and the number
of latents retained. Both these hyperparameters are commonly found via cross-validation. We
have already seen how to modify the conventional FDI workflow to handle augmented data
matrices in DPCA and therefore, we will skip showing the implementation code for DPLS-
based FDI. Instead, we will move on to study a slightly different technique (state-space
modeling) for building dynamic models.
60
Time lags are often chosen to be the same for all the input (and output if building ARX type model) variables
MLforPSE.com|144
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Canonical variate analysis (CVA) is a classical MSPM technique suited for modeling dynamic
process systems. While PCA and PLS lead to input-output models, CVA is designed to
provide state-space models as shown in Figure 9.3. States are like latent variables which act
as a conduit between process inputs and outputs. For example, in the distillation column
system, it is clear that QB doesn’t affect xD directly. There are internal dynamics or states (e.g.,
arising from the liquid holdup at each column tray) that changes in QB have to pass through
before impacting xD. Similar argument could be made for QB vs D relationship. The outputs
share the same internal dynamics and this leads to better parameter efficiency of CVA model
(compared to DPCA and DPLS model)
estimate the system order n, the system matrices (A, B, C, D), the initial state x(0), and the statistical properties
(covariance matrices) of the noises (𝛴𝑤𝑤 , 𝛴𝑣𝑣 )
For CVA-based process monitoring applications, we do not need the system matrices; only
the computation of states for training and test datasets suffices. Tracking the systematic
variations of states and state noise allow us to build monitoring metrics. As it turns out, the
state vector at any time can be approximated as a linear combination of past inputs and
outputs. The specific combination is found as the one that gives the best prediction of future
outputs. Let’s now familiarize ourselves on how this is achieved.
MLforPSE.com|145
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Mathematical background
Using the state-space model and some smart re-arrangement61 of input-output data
sequence, it can be shown that state vector x(k) at time instant k can be approximated as a
linear combination of past inputs and outputs
unknown matrix
𝑇
defined as ൣ𝑦 𝑇 ሺ𝑘 − 1ሻ 𝑦 𝑇 ሺ𝑘 − 2ሻ ⋯ 𝑦 𝑇 ൫𝑘 − 𝑙𝑦 ൯ 𝑢𝑇 ሺ𝑘 − 1ሻ 𝑢𝑇 ሺ𝑘 − 2ሻ ⋯ 𝑢𝑇 ሺ𝑘 − 𝑙𝑢 ሻ ൧
In Eq. 1, ly and lu denote the number of lagged outputs and inputs used to build the p(k) vector.
Usually, ly = lu = l and it is a model hyperparameter. The CVA method utilizes CCA (canonical
covariate analysis), a multivariate statistical technique, to find J. A quick primer on the CCA
technique is provided below.
61
Chen et. al., A canonical variate analysis based process monitoring scheme and benchmark study. 2014
MLforPSE.com|146
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
Var 1 Var 2 Var m Var 1 Var 2 Var m Var 1 Var 2 Var r Var 1 Var 2 Var r
𝑈11 𝑈12 ⋯ 𝑈1𝑚 𝐶11 𝐶12 ⋯ 𝐶1𝑚 Ɗ11 Ɗ12 ⋯ Ɗ1𝑟 𝑌11 𝑌12 ⋯ 𝑌1𝑟
ۍ ۍ 𝐽 ې ې ۍ ۍ 𝐿ې ې
𝑈 ێ21 𝑈22 ⋯ 𝑈2𝑚 𝐶 ێ ۑ21 𝐶22 ⋯ 𝐶2𝑚 ۑ ێƊ21 Ɗ22 ⋯ Ɗ2𝑟 𝑌 ێ ۑ21 𝑌22 ⋯ 𝑌2𝑟 ۑ
ێ ێ ۑ ۑ ێ ێ ۑ ۑ
⋮ ێ ⋮ ⋱ ⋮ ⋮ ێ ۑ ⋮ ⋱ ⋮ ۑ ⋮ ێ ⋮ ⋱ ⋮ ⋮ ێ ۑ ⋮ ⋱ ⋮ ۑ
ێ ێ ۑ ۑ ێ ێ ۑ ۑ
𝑁𝑈ۏ1 𝑈𝑁2 ⋯ 𝑈𝑁𝑚 𝑁𝐶ۏ ے1 𝐶𝑁2 ⋯ 𝐶𝑁𝑚 ے ۏƊ𝑁1 Ɗ𝑁2 ⋯ Ɗ𝑁𝑟 𝑁𝑌ۏ ے1 𝑌𝑁2 ⋯ 𝑌𝑁𝑟 ے
Input data matrix Transformed matrix Transformed matrix Output data matrix
𝑈 ∈ ℝ𝑁𝑋𝑚 𝐶 ∈ ℝ𝑁𝑋𝑚 Ɗ ∈ ℝ𝑁𝑋𝑟 𝑌 ∈ ℝ𝑁𝑋𝑟
Canonical variable 𝑑2
Canonical variable 𝑑1
The matrices 𝐽 ∈ ℝ𝑚𝑋𝑚 and 𝐿 ∈ ℝ𝑟𝑋𝑟 are found via singular value decomposition
−1/2 −1/2
𝛴𝑢𝑢 𝛴𝑢𝑦 𝛴𝑦𝑦 = 𝑈𝛴𝑉 𝑇
−1/2 −1/2
where, 𝐽 = 𝑈 𝑇 𝛴𝑢𝑢 , 𝐿 = 𝑉 𝑇 𝛴𝑦𝑦 , 𝐷 = 𝛴
One may retain only those canonical variate pairs (say, top n pairs) that show significant
correlations and discard the others, and therefore achieve dimensionality reduction. Let
𝑐Ƹ and 𝑑̂ be the n-dimensional (n < m and r) vectors, then,
𝑐Ƹ = 𝐽𝑛 𝑢
where, Jn and Ln are first n rows of J and L
𝑑̂ = 𝐿𝑛 𝑦
* It is assumed that u and y are centered. Scaling, however, is not required as CCA is scale-invariant, i.e., the
estimated latent variables and the correlations remain the same with or without scaling. Also note that we have
not used the ‘k’ notation as CCA is designed for static data.
MLforPSE.com|147
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
CVA therefore solves the following singular value decomposition problem and finds linear
combinations of past inputs and outputs that are maximally correlated with linear combinations
of the current and future outputs.
= 𝑈𝛴𝑉𝑇
−1/2 −1/2
𝛴𝑝𝑝 𝛴𝑝𝑓 𝛴𝑓𝑓 eq. 2
We know that Σ is a diagonal matrix. Its elements are also called (Hankel) singular values. If
n is assumed to be the model order, then the optimal state at time k that is most predictive of
the future outputs is given by63
−1/2
Contains first n rows of matrix 𝐽 = 𝑈 𝑇 𝛴𝑝𝑝
Eq. 3 can be used to find the states for the training and test datasets.
Hyperparameter selection
The mathematical background of the CVA algorithm suggests specification of two
hyperparameters: the lag order (ℓ) and the model order (n). While model order specification
is not required prior to state estimation, the lag order needs to be known beforehand. Let’s
see the common approaches for estimation of these hyperparameters.
62
The number of lead values used to define f(k) is commonly taken equal to the number of lags used in p(k)
−1/2
63
Alternatively, 𝑥ොሺ𝑘ሻ = 𝑈𝑛𝑇 𝛴𝑝𝑝 𝑝ሺ𝑘ሻ where Un contains the first n columns of U.
MLforPSE.com|148
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
criterion. An alternative approach was suggested by Odiowei and Cao64 who suggest
checking the autocorrelation function of the summed squares of the process variables and
choosing the maximum value of lag for which the autocorrelation is significant as the lag order.
Another convenient approach was given by Zhang et al.65 who suggest choosing lag order as
the minimum s that that satisfies the following equation
2
ට σ𝑀
𝑗=1൫𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟ሺ𝑥𝑗 , 𝑠 + 1ሻ൯
𝐴𝑠+1 = ≤ 𝜗
2
ටσ𝑠𝑖=1 σ𝑀
𝑗=1൫𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟ሺ𝑥𝑗 , 𝑖ሻ൯ ⁄𝑠
Where 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟ሺ𝑥𝑗 , 𝑖ሻ is the autocorrelation coefficient of the jth variable at time lag i. The
threshold is chosen between 0 and 0.5.
1 2 3 4 5
model order (n)
Another practice is to choose n such that the normalized values of the (n+1)th onwards singular
values are below some threshold. However, very often there is no elbow or the singular values
decrease slowly. Therefore, the more extensively used criteria for model order selection is
AIC66.
64
Odiowei & Cao, Nonlinear Dynamic Process Monitoring using Canonical Variate Analysis and Kernel Density Estimations.
IEEE Trans, 2009.
65
Zhang et al., Simultaneous static and dynamic analysis for fine-scale identification of process operation statuses. 2019.
66
Let 𝑛ො be a trial model order in some range 0 to nmax. The outputs 𝑦ොሺ𝑘ሻ are predicted using the fitted state space model
and the AIC is computed. The value of 𝑛ො that gives minimum AIC becomes the optimal model order.
MLforPSE.com|149
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
𝑻𝟐𝒔 𝑻𝟐𝒆
𝑇𝑠2 ሺ𝑘ሻ = 𝑥 𝑇 ሺ𝑘ሻ𝑥ሺ𝑘ሻ 𝑇𝑒2 ሺ𝑘ሻ = 𝑝𝑇 ሺ𝑘ሻ 𝐽𝑒𝑇 𝐽𝑒 𝑝ሺ𝑘ሻ 𝑄ሺ𝑘ሻ = 𝑟 𝑇 ሺ𝑘ሻ𝑟ሺ𝑘ሻ
= 𝑝𝑇 ሺ𝑘ሻ 𝐽𝑛𝑇 𝐽𝑛 𝑝ሺ𝑘ሻ 𝑟ሺ𝑘ሻ = 𝑝ሺ𝑘ሻ − 𝐽𝑛𝑇 𝐽𝑛 𝑝ሺ𝑘ሻ
◼ Je contains all except the first n rows of J i.e., the last ℓ(𝑚 + 𝑟ሻ − 𝑛 rows of J
◼ 𝑧 = ℓ(𝑚 + 𝑟ሻ − 𝑛
◼ Fα(n, N-n) is the 1-α percentile of a F distribution with n and N-n degrees of freedom
◼ 𝜒𝛼2 ሺℎሻ is the (1-α) percentile of a chi-squared distribution with h degrees of freedom; μ denotes the mean
value and σ denotes the variance of the Q metric
◼ Significance level of α = 0.05 would mean that there is a 5% chance that an alert is false
𝑝ሺ𝑘ሻ = [𝑦 𝑇 ሺ𝑘 − 1ሻ 𝑦 𝑇 ሺ𝑘 − 2ሻ ⋯ 𝑦 𝑇 ሺ𝑘 − 𝑙ሻ]𝑇
MLforPSE.com|150
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
The flowchart (Figure 9.4) below summarizes the aforementioned steps for development of
CVA-based process fault detection solution.
Historical Real-time
I/O data I/O data
lag 𝑙 Form past and future vectors, p(k) and Form past vector p(k) lag 𝑙
f(k), for different k
O
ff
li Center data means Center p(k)
n
e O
n
Compute covariances 𝜮𝒑𝒑 , 𝜮𝒇𝒇 , 𝜮𝒑𝒇
li
M n
o e
d Perform SVD [Eq. 2]
el
M
model order n Jn , Je o
Get Jn , Je Compute metrics 𝑇𝑠2 , 𝑇𝑒2 , 𝑄 for p(k)
B n
ui it
ld o
in Compute metrics 𝑇𝑠2 , 𝑇𝑒2 , 𝑄 for each p(k) ri
g n
g
2
𝑇𝑠,𝐶𝐿 2
, 𝑇𝑒,𝐶𝐿 , 𝑄𝐶𝐿 𝑇𝑠2 > 𝑇𝑠,𝐶𝐿
2
or
2
Compute thresholds 𝑇𝑠,𝐶𝐿 , 𝑻𝟐𝒆,𝐶𝐿 , 𝑸𝐶𝐿 𝑇𝑒2 > 𝑇𝑒,𝐶𝐿
2
or
𝑄 > 𝑄𝐶𝐿
Yes
Process Fault!!
MLforPSE.com|151
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
In this section, we will learn the details of CVA-based monitoring tool development through
step-by-step application for our now familiar TEP process. The CVA monitoring procedures
provided in this book mirror the description in the work67Error! Bookmark not defined. of
Russell et al., and we will therefore attempt to replicate the CVA-based fault detection results
from their journal paper titled ‘Fault detection in industrial processes using canonical variate
analysis and dynamic principal component analysis’. Specifically, we will use the Fault 5
dataset (file d05_te.dat) as our test data and normal operation dataset (file d00.dat) as our
training data. We will use the manipulated variables as our inputs and the process
measurements as our outputs. Our objective is to build a monitoring tool that can accurately
flag the presence of a process fault with low frequency of false alerts. Let’s begin with a quick
exploration of the dataset.
# import packages
import matplotlib.pyplot as plt, numpy as np
from sklearn.preprocessing import StandardScaler
It is obvious that continuously monitoring all the 22 output (and 11 input) variables is not
practical. The single combined output plots above do seem to clearly indicate a process upset
around sample number 200 in faulty dataset; however, they also seem to give a wrong
67
Russell et al., Data-Driven Methods for Fault Detection and Diagnosis in Chemical Processes. Springer, 2001.
Russell et al., Fault detection in industrial processes using canonical variate analysis and dynamic principal component
analysis. Chemometrics and intelligent laboratory systems, 2000
MLforPSE.com|152
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
impression of normal process conditions sample number 500 onwards. It is known that Fault
5 continues until the end of the dataset. Let’s compute the monitoring indices for the training
dataset.
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
## generate past (p) and future (f) vectors for training dataset and center them
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = trainingData.shape[0]
l = 3 # as used by Russell et. al
m = uData_training.shape[1]
r = yData_training.shape[1]
# center data
p_scaler = StandardScaler(with_std=False); pMatrix_train_centered = p_scaler.fit_transform(pMatrix_train)
f_scaler = StandardScaler(with_std=False); fMatrix_train_centered = f_scaler.fit_transform(fMatrix_train)
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
## computes covariances and perform SVD
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
import scipy
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
## get fault detection metrics for training data
MLforPSE.com|153
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Xn_train = np.dot(Jn, pMatrix_train_centered.T)
Ts2_train = [np.dot(Xn_train[:,i], Xn_train[:,i]) for i in range(pMatrix_train_centered.shape[0])]
Threshold
The 𝑇𝑠2 metric measures the systematic variations ‘inside’ the state space defined by the state
variables. Correspondingly, 𝑇𝑟2 quantifies the variations ‘outside’ the state space. When only
𝑇𝑠2 violates the threshold, it indicates that the state variables are ‘out-of-control’ but the
process can still be well explained by the estimated state-space model. When 𝑇𝑠2 or 𝑄 violates
the threshold, it indicates that the characteristics of noise affecting the system has changed
or the estimated model cannot explain the new faulty observations.
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
## generate past vectors (p(k)) for test dataset and center
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Ntest = FaultyData.shape[0]
MLforPSE.com|154
Chapter 9: Multivariate Statistical Process Monitoring for Linear and Dynamic Processes
for i in range(l,Ntest+1):
pMatrix_test[i-l,:] = np.hstack((yData_test[i-l:i,:].flatten(), uData_test[i-l:i,:].flatten()))
pMatrix_test_centered = p_scaler.transform(pMatrix_test)
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
## get fault detection metrics for training data
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Xn_test = np.dot(Jn, pMatrix_test_centered.T)
Ts2_test = [np.dot(Xn_test[:,i], Xn_test[:,i]) for i in range(pMatrix_test_centered.shape[0])]
The control charts for the test data shows successful detection of process fault. The 𝑇𝑒2 metric
is known to be overly sensitive with high incidence of false alerts which can explain why it
seems to often breach the threshold before the onset of the actual fault.
Summary
In this chapter, we added more ML tools to our arsenal to handle process systems that show
significant dynamics. We looked at two different popular approaches: the first approach
entailed a simple extension of conventional MSPM tools through inclusion of lagged
measurements and the second approach entailed usage of state-space (CVA) models. While
dynamic MSPM methods are easier to understand and implement, the CVA model-based fault
detection have been found to provide superior results. Next, we will move on to learn how to
deal with nonlinear processes.
MLforPSE.com|155
Chapter 10
Multivariate Statistical Process Monitoring for
Nonlinear Processes
I
n the previous chapters, we saw how a simple trick of using time-lagged variables enabled
application of conventional MSPM techniques to dynamic processes. You may wonder if
anything similar exists for nonlinear processes. Fortunately, it does! The underlying
principle is to project the original variables onto a high-dimensional feature space where
features are linearly related. The challenging part is the determination of the nonlinear
mapping from the original measurement space to the feature space This is where a ‘kernel’
trick comes into picture wherein data gets projected without the need to explicitly define the
nonlinear mapping. Conventional MSPM is then employed in the feature space. Sounds
complicated? Once you are done with this chapter, you will realize that it’s much easier than
it may seem to you right now.
The main advantage of kernel-based MSPM techniques (KPCA, KPLS, KICA, KFDA, etc.) is
that they do not require nonlinear optimization and only linear algebra is involved.
Unsurprisingly, kernelized methods have become very attractive for dealing with nonlinear
datasets while retaining the simplicity of their linear counterparts. Among the kernel MSPM
techniques, kernel PCA and kernel PLS are the most widely adopted, have found
considerable successes in process monitoring applications, and therefore will be the focus of
our study in this chapter. Specifically, the following topics are covered
156
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Kernel PCA is the nonlinear extension of conventional PCA suitable for handling processes
that exhibit significant nonlinearity. To understand the motivation behind KPCA, consider the
simple scenarios illustrated in Figure 10.1. In Figure 10.1a, we can see that the data lie along
a line which can be obtained from the first eigenvector of linear PCA. In Figure 10.1b, data lie
along a curve; conventional PCA cannot help to find this nonlinear curve. Correspondingly,
PCA fails to detect the obvious outlier. However, all is not lost for the latter scenario. Instead
of working in the (x1, x2) measurement space, if we work in the ሺ𝑧1 , 𝑧2 ሻ = ሺ𝑥14 , 𝑥2 ሻ feature
space, then we end up with linearly related features and the abnormal data point can be
flagged as such. Unfortunately, the task of finding such (nonlinear) mapping that maps raw
data to feature variables is not trivial. Thankfully, there is something called ‘kernel trick’ that
allows you to work in the feature space without having to define the nonlinear mapping. We
will learn how this is accomplished in the next section.
(a) (b)
Figure 10.1: Nonlinearity impact on PCA-fault detection. Faulty sample shown in red. [One principal
component chosen in all simulations]
KPCA can work with arbitrary data distributions. As far as process monitoring applications are
concerned, KPCA can help you create an abnormality boundary around your NOC data as
shown below.
NOC sample
NOC boundary
To understand how KPCA works, let’s revisit the mathematical underpinnings of PCA.
MLforPSE.com|157
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Usage of kernel trick is not limited to KPCA and KPLS. Other ML techniques such as
SVM, CVA, etc., also use kernel functions to model nonlinear processes. So, what are
these kernel functions? Let’s try to understand them.
However, the map 𝜑ሺ. ሻ is unknown. Thankfully, in the mathematical formulation of many
ML algorithms, the inner (or dot) product of feature vectors, 𝜑ሺ𝑥ሻ𝑇 𝜑ሺ𝑥ሻ, is frequently
encountered. This inner product is denoted as
where k(.,.) is called the kernel function. Several forms of k(.,.) are available and the
most common form is Gaussian or radial basis function defined as
𝑇
ሺ𝑥 −𝑥 ሻ ሺ𝑥 −𝑥 ሻ
𝑘൫𝑥𝑖 , 𝑥𝑗 ൯ = 𝑒𝑥𝑝 ቈ− 𝑖 𝑗 𝜎2 𝑖 𝑗
MLforPSE.com|158
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Let’s use the polynomial kernel to illustrate how using kernel functions amounts to higher
dimensional mapping. Assume that we use the following kernel
2
𝑘ሺ𝑥, 𝑧ሻ = ሺ𝑥𝑇 𝑧 + 1ሻ
where 𝑥 = [𝑥1 , 𝑥2 ]𝑇 and 𝑧 = [𝑧1 , 𝑧2 ]𝑇 are two vectors in the original 2D space. We
claim that the above kernel is equivalent to the following mapping
Therefore, if you use the above polynomial kernel, you are implicitly projecting your data
onto a 6th dimensional feature space! If you were amazed by this illustration, you will find
it more interesting to know that Gaussian kernels map original space into an infinite
dimensional feature space! Luckily, we don’t need to know the form of this feature space.
Mathematical background
The main objective in KPCA is to obtain the score vectors in the feature space. To see how it
can be obtained, let 𝑥 ∈ ℝ𝑚 and 𝜑ሺ𝑥ሻ denote a sample in the original measurement space
and the projected feature vector in the high-dimensional feature space, respectively. We know
that the conventional PCA involves the solution of the following eigenvalue problem
Eigenvalue
1
𝑋 𝑇 𝑋𝑝 = 𝜆𝑝
𝑋 ∈ ℝ𝑁𝑥𝑚 𝑁−1
(training data Eigenvector
matrix) Covariance matrix
MLforPSE.com|159
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Let 𝑋𝐹 = [𝜑ሺ𝑥1 ሻ, 𝜑ሺ𝑥2 ሻ, ⋯ , 𝜑ሺ𝑥𝑁 ሻ]𝑇 denote the feature data matrix. We assume for now that
𝑋𝐹 is mean centered. PCA in the feature space, therefore, involves the following eigen-
decomposition problem
1 eq. 1
𝑋 𝑇 𝑋 𝜈 = 𝜆𝐹 𝜈
𝑁−1 𝐹 𝐹
Eigenvector in feature space
𝑋𝐹 𝑖s not known and, therefore, the above equation cannot be solved directly. Let’s pre-
multiply both sides of Eq. 1 by 𝑋𝐹 which results in
1
𝑋𝐹 𝑋𝐹 𝑇 𝑋𝐹 𝜈 = 𝜆𝐹 𝑋𝐹 𝜈 eq. 2
𝑁−1
Linear algebra shows that eigenvector68 can be written as a linear combination of the samples,
i.e.,
𝑇
𝜈 = σ𝑁
𝑖=1 𝛼𝑖 𝜑ሺ𝑥𝑖 ሻ = 𝑋𝐹 𝛼 eq. 3
𝑇
Let us also define the kernel matrix, 𝐾 = 𝑋𝐹 𝑋𝐹 . Let’s now substitute the kernel matrix and
𝜈 from Eq. 3 in Eq. 2 which leads to
𝐾𝛼 = ሺ𝑁 − 1ሻ𝜆𝐹 𝛼 eq. 4
Eq. 4 is an eigen-decomposition problem which can be solved for the kernel eigenvectors
𝛼 ሺ1ሻ , 𝛼 ሺ2ሻ , ⋯ , 𝛼 ሺ𝑁ሻ and kernel eigenvalues69 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑁 . The lth eigenvector 𝛼 ሺ𝑙ሻ ≡
ሺ𝑙ሻ ሺ𝑙ሻ ሺ𝑙ሻ
ቂ𝛼1 , 𝛼2 , ⋯ , 𝛼𝑁 ቃ ∈ ℝ𝑁 . One may retain only the first a eigenvectors for dimensionality
reduction in feature space. The corresponding eigenvector in feature space becomes
𝜈 ሺ𝑙ሻ = 𝑋𝐹 𝑇 𝛼 ሺ𝑙ሻ . To ensure eigenvector 𝜈 ሺ𝑙ሻ is of unit length, the following condition is
imposed on 𝛼 ሺ𝑙ሻ ,
𝑇 𝑇
‖𝜈 ሺ𝑙ሻ ‖2 = 𝜈 ሺ𝑙ሻ 𝜈 ሺ𝑙ሻ = 𝛼 ሺ𝑙ሻ 𝑋𝐹 𝑋𝐹 𝑇 𝛼 ሺ𝑙ሻ = 1
⇒ ሺ𝑁 − 1ሻ𝜆𝐹𝑙 ‖𝛼 ሺ𝑙ሻ ‖2 = 1
⇒ ‖𝛼 ሺ𝑙ሻ ‖2 = 1Τሺ𝑁 − 1ሻ𝜆𝐹𝑙 eq. 5
68
Note that with Gaussian kernels, 𝜑ሺxሻ and ν have infinite dimensions
69
Kernel eigenvalue 𝜆𝑙 can be divided by N-1 to obtain eigenvalue 𝜆𝐹𝑙 .
MLforPSE.com|160
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Above expression implies that the calculated kernel eigenvectors 𝛼 ሺ𝑙ሻ obtained from Eq. 4
corresponding to non-zero eigenvalues are scaled to satisfy Eq. 5. Note that the eigenvectors
𝜈 ሺ𝑙ሻ are still unobtainable as 𝑋𝐹 is unknown. However, our main variable of interest is score
vector 𝑡𝑗 ∈ ℝ𝑎 corresponding to sample 𝑥𝑗 . Before we see how 𝑡𝑗 can be obtained, we need
to come straight on a big assumption we made regarding data in feature space
𝜑ሺ𝑥1 ሻ, 𝜑ሺ𝑥2 ሻ, ⋯ , 𝜑ሺ𝑥𝑁 ሻ being mean centered. To ensure mean-centering, kernel matrix K is
modified as follows
̅ = 𝐾 − 1𝑁 𝐾 − 𝐾1𝑁 + 1𝑁 𝐾1𝑁
𝐾 eq. 6
1 1 ⋯ 1
Where 1𝑁 = ⋮ ⋱ ⋮൩
𝑁
1 ⋯ 1 𝑁𝑥𝑁
ത𝐹 𝑇 𝑋ത𝐹
̅ can be shown to be equal to 𝑋
→𝐾
𝑁 𝑁
ሺ𝑙ሻ ሺ𝑙ሻ ̅ ሺ𝑙ሻ 𝑇 ത
𝑡𝑗𝑙 =< 𝜈𝑙 , 𝜑ത൫𝑥𝑗 ൯ > = 𝛼𝑖 𝑘ത൫𝑥𝑖 , 𝑥𝑗 ൯ = 𝛼𝑖 𝐾 𝑗𝑖 = 𝛼 𝑘𝑗 eq. 7
𝑖=1 𝑖=1
MLforPSE.com|161
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
𝑇
𝑡𝑡𝑙 = 𝛼 ሺ𝑙ሻ 𝑘ത𝑡 eq. 8
1 from [Eq. 6]
1
kernel vector with 𝑘𝑡𝑖 = 𝑘ሺ𝑥𝑡 , 𝑥𝑖 ሻ 1𝑡 = ⋮൩
𝑁
where 𝑥𝑖 belongs to training 1 𝑁𝑥1
dataset and 𝑖 = 1,2, ⋯ , 𝑁.
Hyperparameter specification
Assuming that Gaussian kernel is chosen, you will need to specify the kernel width 𝜎 and the
number of retained latents (a).
MLforPSE.com|162
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
It is apparent that very small 𝜎 yields overfitted models leading to high rate of false alarms,
while large 𝜎 results in underfitted models because the abnormality contours are not able to
accurately capture the data distribution. The kernel width may be selected empirically or via
cross-validation. Empirically, 𝜎 is often set to be 5m where m is the number of process
variables70. In the cross-validation approach, 𝜎 is set to be the smallest value that results in
acceptable level of false alarm on a validation dataset. Tan et al., suggested the range [0,
𝑑𝑡𝑟𝑎𝑖𝑛,𝑚𝑎𝑥 ]71 to search for the optimal 𝜎 where 𝑑𝑡𝑟𝑎𝑖𝑛,𝑚𝑎𝑥 = max ට‖𝑥𝑖 − 𝑥𝑗 ‖ for 𝑥𝑖 , 𝑥𝑗 ∈ training
dataset.
𝑎 = 𝑎𝑟𝑔𝑚𝑖𝑛
𝑗∈[1,𝑁]
σ𝑗𝑖=1 𝜆𝑖 ⁄σ𝑁
𝑖=1 𝜆𝑖 ≥ 0.95
Another popular approach is the average eigenvalue method where only eigenvectors with
eigenvalues above the average of all eigenvalues are included.
Phew! It was a lot of mathematics in the previous section. Our aim was not to scare you away
by going into this level of details, but to provide you with enough background so that you can
carry out each step of KPCA-based fault detection confidently. In fact, Sklearn provides
KernelPCA class that does the job of computing the scores for you; all you need to do it
compute the monitoring statistic. To further clarify all the steps, we will use the following
synthetic process for demonstration. It is a system with 3 measurements x1, x2 and x3.
𝑥1 = 𝑢2 + 0.3 sinሺ2𝜋𝑢ሻ + 𝜀1
𝑥2 = 𝑢 + 𝜀2
𝑥3 = 𝑢3 + 𝑢 + 1 + 𝜀3
70
Cho et al., Fault identification for process monitoring using kernel principal component analysis. Chemical Engineering
Science, 2005.
71
Tan et al., . Monitoring Statistics and Tuning of Kernel Principal Component Analysis With Radial Basis Function Kernels.
IEEE Access, 2020.
MLforPSE.com|163
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Here, u is a variable uniformly sampled between -1 and 1, and 𝜀𝑖 is independent white noise
with variance 0.05. This system has been used by Nounou et al. to demonstrate superiority
of kernelized MSPM models in an open-access paper titled ‘Process monitoring using data-
based fault detection techniques: comparative studies (2017)’. The system was simulated to
generate 401 and 200 samples of NOC and test data (provided in files
multivariateKPLS_NOC_data.txt and multivariateKPLS_test_data.txt, respectively). In test
data, a single fault of unit magnitude is introduced between samples 75 and 125 in x1. Let’s
begin by reading the data files.
# read data
X_train = np.loadtxt('KPCA_NOC_data.txt')
X_test = np.loadtxt('KPCA_test_data.txt')
We will now go through each step of generating monitoring chart for the test data. Therefore,
let’s first understand how the monitoring statistic is computed.
MLforPSE.com|164
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Monitoring statistic
The primary statistic to monitor is Q, which for a sample xt is defined as
𝑄 = ‖𝜑 ̂ ሺ𝑥𝑡 ሻ‖2
̅ ሺ𝑥𝑡 ሻ − 𝜑
̅ 𝑎
reconstructed feature vector using a latents
The above expression can be expressed in terms of the score vector tt as follows72
𝑛 𝑎
𝑄 = 𝑡𝑡𝑖 − 𝑡𝑡𝑖 2
2
𝑖=1 𝑖=1
Note that Tan et al. have shown that the KPCA Q metric can help detect fault cases where
variables go out of healthy operating range as well as cases where test samples do not follow
model of the training data.
Let us now implement the KPCA model and generate monitoring charts.
# scale data
X_scaler = StandardScaler()
X_train_scaled = X_scaler.fit_transform(X_train)
72
Zhou et al., Randomized Kernel Principal Component Analysis for Modeling and Monitoring of Nonlinear Industrial
Processes with Massive Data. Industrial & Engineering Chemistry Research, 2019.
MLforPSE.com|165
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
kpca.fit(X_train_scaled)
#%%%%%%%%%%%%%%%%%%%%%%%%%
## kpca model application on test data
#%%%%%%%%%%%%%%%%%%%%%%%%%
X_test_scaled = X_scaler.transform(X_test)
scores_test = kpca.transform(X_test_scaled)
MLforPSE.com|166
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Hopefully, you now feel confident about being able to deploy a KPCA-based fault detection
system for nonlinear systems.
MLforPSE.com|167
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Kernel PLS is the nonlinear extension of conventional PLS to handle nonlinear process data
that exhibit significant nonlinearity. As shown in Figure 10.2 below, the input vectors are
mapped to an ‘unknown’ feature space such that the variables become linearly related. We
know from Chapter 7 that the NIPALS algorithm used to fit PLS model involves the term XXT
and therefore, like PCA, kernel functions are used to implicitly define 𝜑𝜑 𝑇 in the feature space,
obtain the score vectors, and thereafter, generate the predicted outputs. The overall
procedure relies on classic linear algebra and retains most of the favorable properties of
standard PLS.
𝑥𝑇 𝜑ሺ. ሻ 𝜑ሺ𝑥1 ሻ𝑇
ۍ1𝑇 ې ۍ 𝑇
ې
𝑥 ێ2 𝑋 ۑ Gaussian kernel- 𝜑 ێሺ𝑥2 ሻ ۑ
⋮ ێ ۑ based mapping ⋮ ێ ۑ
𝑇𝑁𝑥 ۏ ے 𝜑 ۏሺ𝑥𝑁 ሻ𝑇 ے
𝑁𝑋𝑚 𝑁𝑋∞
Conventional PLS
𝑦𝑇
ۍ1𝑇 ې KPLS
𝑦 ێ2 𝑌 ۑ Model
⋮ ێ ۑ
𝑇𝑁𝑦 ۏ ے
𝑁𝑋𝑝
Sklearn, unfortunately, does not provide any off-the-shelf module for KPLS. Nonetheless, it is
not tough to code the steps of KPLS. The mathematical exposition in the following subsection
will provide all the information you need to work on your own KPCA module.
𝑢 ≈ 𝑓ሺ𝑡ሻ
if f(.) is polynomial ⇒ polynomial PLS
if f(.) is neural network ⇒ neural net PLS
MLforPSE.com|168
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Mathematical background
Let 𝑋 ∈ ℝ𝑁𝑥𝑚 and Y ∈ ℝ𝑁𝑥𝑝 be the centered input and output data matrices of the training
samples. Let 𝜑തሺ𝑥𝑡 ሻ denote the mapped and centered feature vector of the ith sample or row
of X, 𝑋𝐹 ∈ ℝ𝑁𝑥∞ denote the input data matrix in feature space. The kernel matrix is computed
and centered as done in KPCA algorithm. The algorithm below summarizes the KPLS model
training via NIPALS that can provide us the input and output score matrices T and U.
̅1 = 𝐾
1) Initialize 𝑖 = 1, 𝐾 ̅ , 𝑌1 = 𝑌
𝑇
3) Compute score vector 𝑡𝑖 ሺ∈ ℝ𝑁𝑥1 ሻ = 𝑋ത𝐹𝑖 𝑤𝑖 = 𝑋ത𝐹𝑖 𝑋ത𝐹𝑖 𝑤𝑖 = 𝐾
̅𝑖 𝑢𝑖
9) Deflate matrices:
̅𝑖+1 = 𝐾
𝐾 ̅𝑖 − 𝑡𝑖 𝑡𝑖 𝑇 𝐾
̅𝑖 − 𝐾
̅𝑖 𝑡𝑖 𝑡𝑖 𝑇 + 𝑡𝑖 𝑡𝑖 𝑇 𝐾
̅𝑖 𝑡𝑖 𝑡𝑖 𝑇
𝑌𝑖+1 = 𝑌𝑖 − 𝑡𝑖 𝑡𝑖 𝑇 𝑌𝑖
10) Set 𝑖 = 𝑖 + 1 and return to step 2. Stop when 𝑖 > 𝑎 [number of latents to be retained]
𝑇
̅ 𝐹𝑅
𝑇= 𝑋 ത𝐹 𝑈ሺ𝑇 𝑇 𝐾
where 𝑅 = 𝑋 ̅ 𝑈ሻ−1 ∈ ℝ∞𝑥𝑎
̅ 𝑈ሺ𝑇 𝑇 𝐾
⇒𝑇 = 𝐾 ̅ 𝑈ሻ−1
MLforPSE.com|169
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
The R matrix will help us compute the score values for a test sample as we will see shortly.
The last missing piece of information is the regression coefficient matrix which can be used
to generate the output predictions
̅ 𝐹𝐵
𝑌 = 𝑋 ത𝐹 𝑇 𝑈ሺ𝑇 𝑇 𝐾
where 𝐵 = 𝑋 ̅ 𝑈ሻ−1 𝑇 𝑇 𝑌
⇒ 𝑌 = 𝐾
̅ 𝑈ሺ𝑇 𝑇 𝐾
̅ 𝑈ሻ−1 𝑇 𝑇 𝑌
𝑇
𝑡𝑡 = 𝑅 𝜑̅ ሺ𝑥𝑡 ሻ
̅ 𝑇ሻ−1 𝑈 𝑇 𝑋ത𝐹 𝜑തሺ𝑥𝑡 ሻ
= ሺ𝑈 𝑇 𝐾
̅ 𝑇ሻ−1 𝑈 𝑇 𝑘ത𝑡
= ሺ𝑈 𝑇 𝐾 Centered kernel vector ∈ ℝ𝑁𝑥1 with 𝑘𝑡𝑖 = 𝑘ሺ𝑥𝑡 , 𝑥𝑖 ሻ
where 𝑥𝑖 belongs to training dataset and 𝑖 = 1,2, ⋯ , 𝑁.
Given as in Eq. 8.
𝑇
𝑦ො = 𝐵 𝜑̅ ሺ𝑥𝑡 ሻ
̅ 𝑇ሻ−1 𝑈 𝑇 𝑋ത𝐹 𝜑തሺ𝑥𝑡 ሻ
= 𝑌 𝑇 𝑇ሺ𝑈 𝑇 𝐾
̅ 𝑇ሻ−1 𝑈 𝑇 𝑋ത𝐹 𝑘ത𝑡
= 𝑌 𝑇 𝑇ሺ𝑈 𝑇 𝐾
Hyperparameter specification
Like KPCA, KPLS involve selection of two critical hyperparameters: the Gaussian kernel (𝜎)
and the number of latents retained (a). The number of latents retained is commonly
determined via cross-validation, where we choose a that minimizes the error in predicted
outputs for validation dataset. For 𝜎, Jose et al. showed73 that setting 𝜎 = 2𝑚 provides robust
fault detection performance. Alternatively, 𝜎 can be chosen such that we obtain an acceptable
level of false alerts on a validation dataset of NOC samples.
73
Jose et al., New contributions to non-linear process monitoring through kernel partial least squares. Chemometrics
and Intelligent Laboratory Systems, 2014.
MLforPSE.com|170
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Monitoring statistics
Like KPCA, we will use the Q (or SPE) statistics for fault detection. SPE𝜑 and SPEy monitor
the residuals in the feature space and the output space, respectively. SPE𝜑 is computed as
follows for a sample xt (Jose et al.)
2
𝑆𝑃𝐸𝜑 = ‖𝜑 ̂
̅ ሺ𝑥𝑡 ሻ − 𝜑
̅ 𝑎 ሺ𝑥𝑡 ሻ‖
𝑇
= 𝑘തሺ𝑥𝑡 , 𝑥𝑡 ሻ − 2𝑘ത𝑡 𝐾̅ 𝑉𝑡𝑡 + 𝑡𝑡 𝑇 𝑇 𝑇 𝐾
̅ 𝑇𝑡𝑡
̅ 𝑈ሻ−1
𝑉 = 𝑈ሺ𝑇 𝑇 𝐾
𝑁 𝑁 𝑁
2 1
𝑘തሺ𝑥𝑡 , 𝑥𝑡 ሻ = 1 − 𝑘ሺ𝑥𝑖 , 𝑥𝑡 ሻ + 2 𝐾𝑖𝑗
𝑁 𝑁
𝑖=1 𝑖=1 𝑗=1
𝑆𝑃𝐸𝑦 = ԡ𝑦 − 𝑦ොԡ2
The flowchart below summarizes all the steps of the KPLS-based fault detection solution
development.
74
Peng at al., Quality-related process monitoring based on total kernel PLS model and its industrial application.
Mathematical Problems in Engineering, 2013.
MLforPSE.com|171
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
Historical Real-time
I/O data I/O data
Form input and output data matrices ( X Form input and output vectors, x and y
and Y)
O
ff
li Scale data Scaling Scale vectors
n parameters O
e n
(𝜎) Compute Kernel matrix K and generate Compute Kernel vector 𝒌𝒕 and generate (𝜎) li
mean-centered matrix 𝑲 ̅ mean-centered vector 𝒌 ̅𝒕 n
M e
o
(a) Get score matrices T and U using NIPALS.
d
Also compute response predictions 𝒀 ̂
el M
o
Compute prediction errors Compute prediction errors n
B it
ui o
ld ri
in Compute metrics 𝑆𝑃𝐸𝜑 and 𝑆𝑃𝐸𝑦 for each Compute metrics 𝑆𝑃𝐸𝜑 and 𝑆𝑃𝐸𝑦
sample
n
g g
Yes
Process Fault!!
MLforPSE.com|172
Chapter 10: Multivariate Statistical Process Monitoring for Nonlinear Processes
The kernel trick acts as a beautiful solution to allow the KPCA/KPLS scores to relate
nonlinearly to the original input data while retaining the simplicity of linear algebra-based
PCA/PLS. However, before we fall in love with kernelized MSPM techniques, remember that
are a few drawbacks as well! If number of samples, N, is very large then eigenvector
decomposition of N X N dimensional kernel matrix can become challenging. Also, judicious
selection of the hyperparameter 𝜎, is critical, else a poor model is obtained.
Summary
In this chapter, we acquainted ourselves with the procedures to obtain nonlinear versions of
classical MSPM techniques using kernels. Specifically, we looked at kernel PCA and kernel
PLS algorithms. These techniques can help you build robust monitoring solutions for process
systems that show significant nonlinearities. In the next chapter, we will learn how to deal with
multi-clustered.
MLforPSE.com|173
Chapter 11
Process Monitoring of Multi-Mode Processes
I
n the previous chapters, we witnessed the benefits of customizing the conventional MSPM
techniques for non-Gaussian, dynamic, and nonlinear processes. In this chapter, we will
remove the last remaining restriction of unimodal operation. In your career, you will
frequently encounter industrial datasets that exhibit multiple operating modes due to
variations in production levels, feedstock compositions, ambient temperature, product grades,
etc., and data-points from different modes tend to group into different clusters. The mean and
covariance of process variables may be different under different operation models and
therefore, when you are building a monitoring tool, judicious incorporation of the knowledge
of these data clusters into process models will lead to better performance and, alternatively,
failure to do so will often lead to unsatisfactory monitoring performance.
In absence of specific process knowledge or when the number of variables is large, it is not
trivial to find the number of clusters or to characterize the clusters. Fortunately, several
methodologies are available which you can choose from for your specific solution. In this
chapter, we will learn different ways of working with multimodal data, some of the popular
clustering algorithms, and understand their strengths and weaknesses. We will conclude by
building a monitoring tool for a multimode semiconductor process. Specifically, the following
topics are covered
174
Chapter 11: Process Monitoring for Multimode Processes
In process systems, multimode operations occur naturally due to varied reasons. For
example, in a power generation plant, production level changes according to the demand
leading to significantly different values of plant variables with potentially different inter-variable
correlations at different production levels. The multimode nature of data distribution causes
problems with traditional ML techniques. To understand this, consider the illustrations in
Figure 11.1. In subfigure (a), data indicates 2 distinct modes of operation. From process
monitoring perspective, it would make sense to draw separate monitoring boundaries around
the two clusters; doing so would clearly identify the red-colored data-point as an outlier or a
fault. The Conventional PCA-based monitoring, on the other hand, would fail to identify the
outlier. In subfigure (b), the correlation between the variables is different in the two clusters.
It would make sense to build separate models for the two clusters to capture the different
correlation structure. The Conventional PLS model would give inaccurate results.
Figure 11.1: Illustrative scenarios for which conventional ML techniques are ill-suited
A few different methodologies have been adopted by the PSE community to monitor
multimode process operations. Let’s familiarize ourselves quickly with these.
MLforPSE.com|175
Chapter 11: Process Monitoring for Multimode Processes
MLforPSE.com|176
Chapter 11: Process Monitoring for Multimode Processes
External analysis
In this strategy, the influence of process variables (called external variables) such as product
grade, feed flow, etc., that lead to multimode operation is removed from the other ‘main’
process variables and then the conventional MSPM techniques are employed on the ensuing
residuals as shown in the figure below.
Measured main
variables
Process
Predicted main Model
variables
Proximity-based approach
In this approach the presence of multiple NOC clusters is taken into account by making
inference for a test sample based on the sample’s proximity to the NOC samples. Proximity
measures such as distance from neighbors or local density is commonly employed. We will
cover these techniques in Chapter 14.
In this chapter, we will focus upon on the multi-modal approach which involves finding the
different clusters in the dataset. Therefore, let’s now acquaint ourselves with some popular
clustering techniques. However, before we do that, let’s take a quick look at the dataset that
we will work with.
MLforPSE.com|177
Chapter 11: Process Monitoring for Multimode Processes
In this chapter, we will work with dataset from a semiconductor manufacturing process75. The
dataset was obtained from multiple batches from an etching process and consists of 19
process variables measured over the course of 108 normal batches and 21 faulty batches.
The batch durations range from 95 to 112 seconds. In the rest of the chapter, we will
investigate whether the dataset exhibit multimode operations and devise a monitoring strategy
to automatically detect the faulty batches. The data is provided in a MATLAB structure array
format and so we will use a library to fetch data in Python environment.
# fetch data
matlab_data = scipy.io.loadmat('MACHINE_Data.mat', struct_as_record = False)
Etch_data = matlab_data['LAMDATA']
calibration_dataAll = Etch_data[0,0].calibration # calibration_dataAll[i,0] corresponds to a 2D
data from ith batch where columns correspond to different variables
variable_names = Etch_data[0,0].variables
Figure 11.5: Select variable plots for all batches in metal etch dataset. Each colored curve
corresponds to a batch.
75
Data can be downloaded from http://www.eigenvector.com/data/Etch.
MLforPSE.com|178
Chapter 11: Process Monitoring for Multimode Processes
Figure 11.5 does indicate multimode operations with mean and covariance changes. It is
however difficult to estimate the number of operation modes by examining high-dimensional
dataset directly. A popular practice is to reduce process dimensionality via PCA and then
apply clustering to facilitate visualization. Performing PCA serves other purposes as well. We
will see later that expectation-maximization (EM) algorithm is employed to estimate cluster
parameters in K-Means and GMM models. High dimensionality implies high number of
parameters to be estimated which increases possibility of EM converging to locally optimum
results and correlated variables cause EM convergence issues. PCA helps to overcome these
two problems simultaneously.
We will employ multiway PCA for this batch process dataset. We will follow the approach of
He et al.76 where for each batch 85 sample points are retained to deal with batch length
variability, first 5 samples are ignored to eliminate initial fluctuations in sensor measurements,
and 3 PCs are retained.
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(unfolded_dataMatrix)
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
score_train = pca.fit_transform(data_train_normal)
76
He and Wang, Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes, IEEE
Transaction on Semiconductor Manufacturing, 2007
MLforPSE.com|179
Chapter 11: Process Monitoring for Multimode Processes
Figure 11.6: Score plot of PC1 and PC2 for calibration batches in metal etch dataset
Figure 11.6 confirms existence of 3 operating modes. While visual inspection of score plots
can help to decide the number of clusters, we will, nonetheless, learn ways to estimate this in
a more automated way.
K-Means is one of the most popular clustering algorithms due to its simple concept, ease of
implementation, and computational efficiency. Let K denote the number of clusters and {xi}, i
= 1, … , N be the set of N m-dimensional points. The cluster assignment of the data points is
determined such that the following sum of squared errors, also called cluster inertia, is
minimized
𝑆𝑆𝐸 = σ𝐾 2
𝑘=1 σ𝑥𝑖 ∈ 𝑘 𝑡ℎ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ||𝑥𝑖 − 𝜇𝑘 ||2 eq. 1
Here, 𝜇𝑘 is the centroid of the kth cluster and ||𝑥𝑖 − 𝜇𝑘 ||22 denotes the Euclidean distance of
𝑥𝑖 from 𝜇𝑘 . To solve Eq. 1, k-means adopts the following intuitive iterative procedure.
MLforPSE.com|180
Chapter 11: Process Monitoring for Multimode Processes
n_cluster = 3
kmeans = KMeans(n_clusters = n_cluster, random_state = 100).fit(score_train)
cluster_label = kmeans.predict(score_train) # can also use kmeans.labels_
plt.figure()
plt.scatter(score_train[:, 0], score_train[:, 1], c = cluster_label, s = 20, cmap = 'viridis')
cluster_centers = kmeans.cluster_centers_
cluster_plot_labels = ['Cluster ' + str(i+1) for i in range(n_cluster)]
for i in range(n_cluster):
plt.scatter(cluster_centers[i,0], cluster_centers[i,1], c = 'red', marker = '*')
plt.annotate(cluster_plot_labels[i], (cluster_centers[i,0], cluster_centers[i,1]))
As expected, Figure 11.7 shows that k-means does a good job at cluster assignment. K-
means clustering results are strongly influenced by initial selection of cluster centers; a bad
selection can result in improper clustering. To overcome this, k-means algorithm allows a
parameter, n_init (default value is 10), which determines the number of times independent k-
means clustering is performed with different initial centroids assignment; the clustering with
the lowest SSE is selected as the final model. The strategy for selection of initial centroids
can also be changed via init parameter; the default k-means++ option adopts a smarter
(compared to the random option) way to speed up convergence by ensuring that the initial
centroids are far away from each-other.
MLforPSE.com|181
Chapter 11: Process Monitoring for Multimode Processes
plt.figure()
plt.plot(range(1,10), SSEs, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSEs')
Figure 11.8: Cluster inertias for different number of clusters for metal-etch dataset
MLforPSE.com|182
Chapter 11: Process Monitoring for Multimode Processes
Silhouette coefficient or value of a data-point ranges from -1 to 1 and is a measure of how far
the data-point is from data-points in neighboring cluster as compared to data-points in the
same cluster. A value of 1 indicates that the data-point is far away from the neighboring
cluster and values close to 0 indicate that the data-point is close to the boundary between the
two clusters. Negative values indicate wrong cluster assignment.
Figure 11.9 shows the silhouette plot for the cluster shown in Figure 11.7. Each of the colored
bands is formed by stacking the silhouette coefficient of all data-points in that cluster and
therefore the thickness of the band is an indication of the cluster size. The overall silhouette
score is simply the average of silhouette coefficients of all the data-points. As expected,
average score is high and cluster 2 shows highest coefficients as it is far away from the other
two clusters.
# silhouette plot
from matplotlib import cm
plt.figure()
silhouette_values = silhouette_samples(score_train, cluster_label)
y_lower, y_upper = 0, 0
yticks = []
for i in range(n_cluster):
cluster_silhouette_vals = silhouette_values[cluster_label == i]
cluster_silhouette_vals.sort()
y_upper += len(cluster_silhouette_vals)
color = cm.nipy_spectral(i / n_cluster)
plt.barh(range(y_lower, y_upper),cluster_silhouette_vals,height=1.0,edgecolor='none',color=color)
MLforPSE.com|183
Chapter 11: Process Monitoring for Multimode Processes
yticks.append((y_lower + y_upper) / 2)
y_lower += len(cluster_silhouette_vals)
Figure 11.9: Silhouette plot for metal-etch data with 3 clusters determined via k-means. Red
dashed line denotes the average silhouette value.
For comparison, let’s look at a silhouette plot for a sub-optimal clustering in Figure 11.10.
Lower sample-wise coefficients and lower overall score clearly indicate worse clustering.
MLforPSE.com|184
Chapter 11: Process Monitoring for Multimode Processes
n_samples = 1500
X, y = make_blobs(n_samples=n_samples, random_state=100)
plt.figure()
plt.scatter(X_transformed [:,0], X_transformed [:,1])
With the clusters determined, we can go ahead and build separate models for each cluster
and then monitor an incoming test sample using the cluster whose centroid is closest.
However, let us move quickly to the next clustering algorithm that naturally lends itself to
creation of fused decision metrics for fault detection.
MLforPSE.com|185
Chapter 11: Process Monitoring for Multimode Processes
Gaussian mixture model (GMM) is a clustering technique that works under the assumption
that even when the overall process data does not follow a (unimodal) Gaussian distribution,
it may still be appropriate to characterize data from each individual operating mode/cluster
through local Gaussian distributions. GMM find the centroids and covariances of the local
clusters automatically once the number of clusters have been specified. As shown in Figure
11.12, it works very well for multimodal data distributions.
plt.figure()
plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c = cluster_label, s=20, cmap='viridis')
Another big advantage with GMMs is that we can compute the (posterior) probability of a data-
point belonging to any cluster. This cluster membership measure is provided by
predict_proba method. Hard clustering is performed by assigning the data-point to the
cluster with highest probability. Let’s compute the probabilities for a data-point that lies
between clusters 3 and 2 (encircled in Figure 11.12).
MLforPSE.com|186
Chapter 11: Process Monitoring for Multimode Processes
# membership probabilities
probs = gmm.predict_proba(X_transformed[1069, np.newaxis]) # requires 2D array
print('Posterior probabilities of clusters 1, 2, 3 for the data-point: ', probs[-1,:])
GMM thinks that the data-point belongs to cluster 3 with 100% probability! It may seem
surprising given that the point seems to lie equidistant (in terms of Euclidean distance) to
clusters 3 and 2. We will study in the next subsection how these probabilities were obtained.
Mathematical background
Let 𝑥 ∈ ℝ𝑚 be a m-dimensional sample obtained from a multimode process with K operating
modes. In GMM, the overall probability density is formulated as a combination of local
Gaussian densities. Let Ci denote the ith local Gaussian cluster with parameters 𝜃𝑖 = {𝜇𝑖 , 𝛴𝑖 }
(mean vector and covariance matrix) and density
gሺx|𝜃𝑖 ሻ = 1
ሺ2𝜋ሻ𝑚/2 |𝛴𝑖 |1/2
𝑒𝑥𝑝ൣ−12 ሺ𝑥 − 𝜇𝑖 ሻ𝑇 𝛴𝑖−1 ሺ𝑥 − 𝜇𝑖 ሻ൧ eq. 2
𝑝ሺ𝑥|𝜃ሻ = σ𝐾
𝑖=1 𝜔𝑖 𝑔ሺ𝑥|𝜃𝑖 ሻ eq. 3
where, 𝜔𝑖 represents the prior probability that a new sample comes from the ith Gaussian
component and θ = {𝜃1 , . . . , 𝜃𝑘 }. The GMM model is constructed by estimating the
parameters 𝜃𝑖 , 𝜔𝑖 for all the clusters using the training samples 𝑋 ∈ ℝ𝑁 𝑋 𝑚 . The parameters
are estimated by optimizing the log-likelihood of the training dataset given as below
σ𝑁 𝐾
𝑗=1 𝑙𝑜𝑔 ൫σ𝑖=1 𝜔𝑖 gሺ𝑥𝑗 |𝜃𝑖 ሻ ൯ eq. 4
𝑃ሺ𝑠ሻ ሺ𝐶𝑖 |𝑥𝑗 ሻ denotes the posterior probability that the jth sample comes from the ith Gaussian
component.
MLforPSE.com|187
Chapter 11: Process Monitoring for Multimode Processes
ሺ𝑠+1ሻ σ𝑁 ሺ𝑠ሻ
𝑗=1 𝑃 ሺ𝐶𝑖 |𝑥𝑗 ሻ𝑥𝑗
𝜇𝑖 =
σ𝑁 ሺ𝑠ሻ
𝑗=1 𝑃 ሺ𝐶𝑖 |𝑥𝑗 ሻ
Update centroid and
σ𝑁 ሺ𝑠ሻ ሺ𝑠+1ሻ ሺ𝑠+1ሻ 𝑇 covariance of each cluster
ሺ𝑠+1ሻ 𝑗=1 𝑃 ሺ𝐶𝑖 |𝑥𝑗 ሻሺ𝑥𝑗 −𝜇𝑖 ሻሺ𝑥𝑗 −𝜇𝑖 ሻ
𝛴𝑖 = 𝑁 ሺ𝑠ሻ
using recomputed
σ𝑗=1 𝑃 ሺ𝐶𝑖 |𝑥𝑗 ሻ
memberships from E-step.
ሺ𝑠+1ሻ σ𝑁 ሺ𝑠ሻ
𝑗=1 𝑃 ሺ𝐶𝑖 |𝒙𝒋 ሻ
𝜔𝑖 = 𝑁
The iteration continues until some convergence criterion on log-likelihood objective is met.
Did you notice the conceptual similarity with the k-means algorithm for finding model
parameters? Previously, we computed posterior probabilities for data point 1069 using
predict_prob method. Let us now use eq. 5 to see if we get the same numbers.
import scipy.stats
g1 = scipy.stats.multivariate_normal(gmm.means_[0,:], gmm.covariances_[0,:]).pdf(x)
g2 = scipy.stats.multivariate_normal(gmm.means_[1,:], gmm.covariances_[1,:]).pdf(x)
g3 = scipy.stats.multivariate_normal(gmm.means_[2,:], gmm.covariances_[2,:]).pdf(x)
print('Local component densities: ', g1, g2, g3)
MLforPSE.com|188
Chapter 11: Process Monitoring for Multimode Processes
There is another method, called F-J algorithm77, which can be used to find the optimal number
of GMM components78 and model parameters in an integrated manner. Also, the number of
components does not need to be specified beforehand in F-J method. Internally, the method
initializes with a large number of components and adaptively adjusts this number by
eliminating Gaussian components with insignificant weights. F-J method also utilizes EM
77
Figueiredo & Jain, Unsupervised learning of finite mixture models, IEEE Trans Pattern Anal Mach Intell, 2002
78
Yu & Qin, Multimode process monitoring with Bayesian inference-based finite Gaussian mixture models, AIChE Journal,
2008
MLforPSE.com|189
Chapter 11: Process Monitoring for Multimode Processes
algorithm for parameter estimation, but with a slightly different weight update mechanism in
the M-step. The reader is encouraged to see the cited references for more details. A downside
of F-J method could be high computational time.
plt.figure()
plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c = cluster_label, cmap='viridis')
Figure 11.14: GMM based clustering of ellipsoidal data distribution via F-J method
Figure 11.15 shows the clustering for metal etch data via GMM method. BIC method correctly
identifies the optimal number of components. Note that F-J method results in 4 local clusters
for this dataset.
Figure 11.15: BIC plot and GMM clustering of metal etch data
MLforPSE.com|190
Chapter 11: Process Monitoring for Multimode Processes
Due to the probabilistic formulation, GMM is widely applied for monitoring process systems.
In this section, we will study one such application for the metal etch process. Figure 11.16
shows the metal etch calibration and faulty batches in the PCA score space. It is apparent
that the faulty batches tend to lie away from the calibration clusters. Our objective is to develop
a GMM-based monitoring tool that can automatically detect these faulty batches.
Figure 11.16: Calibration (in blue) and faulty (in red) batches in PCA score space
79
Xie & Shi, Dynamic multimode process modeling and monitoring using adaptive Gaussian mixture models, Industrial &
Engineering Chemistry Research, 2012
MLforPSE.com|191
Chapter 11: Process Monitoring for Multimode Processes
The global metric is then computed using the posterior probabilities of the test sample to each
Gaussian component
ሺ𝑘ሻ
𝐷𝑔𝑙𝑜𝑏𝑎𝑙 ሺ𝑥𝑡 ሻ = σ𝐾
𝑘 = 1 𝑃ሺ𝐶𝑘 |𝑥𝑡 ሻ 𝐷𝑙𝑜𝑐𝑎𝑙 ሺ𝑥𝑡 ሻ eq. 7
𝑟ሺ𝑁2 −1ሻ
𝐷𝑔𝑙𝑜𝑏𝑎𝑙,𝐶𝐿 = 𝐹𝑟,𝑁−𝑟 ሺ𝛼ሻ eq. 8
𝑁ሺ𝑁−𝑟ሻ
𝐹𝑟,𝑁−𝑟 ሺ𝛼ሻ is the (1-α) percentile of a F-distribution with r and n-r degrees of freedom, r is
variable dimension (we performed GMM in PCA score space with 3 latent variables, therefore,
r = 3). Test sample is considered abnormal if Dglobal > Dglobal, CL.
for i in range(N):
x = score_train[i,:,np.newaxis]
probs = gmm.predict_proba(x.T)
Figure 11.17 shows the control chart for the training data.
MLforPSE.com|192
Chapter 11: Process Monitoring for Multimode Processes
Figure 11.17: Global monitoring chart for metal etch calibration data
unfolded_TestdataMatrix = np.empty((1,n_vars*n_samples))
for expt in range(test_dataAll.size):
test_expt = test_dataAll[expt,0][5:90,2:]
unfolded_TestdataMatrix = unfolded_TestdataMatrix[1:,:]
# compute Dglobal_test
Dglobal_test = np.zeros((score_test.shape[0],))
for i in range(score_test.shape[0]):
x = score_test[i,:,np.newaxis]
probs = gmm.predict_proba(x.T)
MLforPSE.com|193
Chapter 11: Process Monitoring for Multimode Processes
print('Number of faults identified: ', np.sum(Dglobal_test > Dglobal_CL), ' out of ', len(Dglobal_test))
Summary
MLforPSE.com|194
Part 4
Classical Machine Learning Methods for Process
Monitoring
195
Chapter 12
Support Vector Machines for Fault Detection
I
n the previous chapters, we focused on multivariate statistical process monitoring methods
that modelled process data through extraction of latent variables. In this part of the book,
we will cover several classical ML techniques that come in handy in building process
monitoring applications. These techniques do not attempt to build any statistical model of the
underlying data distribution. Rather, the measurement space itself may be segregated into
favorable/unfavorable regions, high-density/low-density regions, or pair-wise distances
maybe computed to generate monitoring metrics, etc. SVM (support vector machine) is one
such algorithm which excels in dealing with high-dimensional, nonlinear, and small or
medium-sized data.
SVMs are extremely versatile and can be employed for classification and regression tasks in
both supervised and unsupervised settings. SVMs, by design, minimize overfitting to provide
excellent generalization performance. Infact, before ANNs became the craze in ML
community, SVMs were the toast of the town. Even today, SVM is a must-have tool in every
ML practitioner’s toolkit. You will find more about SVMs as you work through this chapter. In
terms of uses in process industry, SVMs have been employed for fault classification, fault
detection, outlier detection, soft sensing, etc. We will focus on process monitoring-related
usage in this chapter.
196
Chapter 12: Support Vector Machines for Fault Detection
The classical SVM is a supervised linear technique for solving binary classification problems.
For illustration, consider Figure 12.1a. Here, in a 2D system, the training data-points belong
to two distinct (positive and negative) classes. The task is to find a line/linear boundary that
separates these 2 classes. Two sample candidate lines are also shown. While these lines
clearly do the stated job, something seems amiss. Each of them passes very close to some
of the training data-points. This can cause poor generalization: for example, the shown test
observation ‘A’ lies closer to the positive samples but will get classified as negative class by
boundary L2. This clearly is undesirable.
Figure 12.1: (a) Training data with test sample A (b) Optimal separating boundary
The optimal separating line/decision boundary, line L3 in Figure 12.1b, lies as far away as
possible from either class of data. L3, as shown, lies midway of the support planes (planes
that pass-through training points closest to the separating boundary). During model fitting,
SVM simply finds this optimal boundary that corresponds to the maximum margin (distance
between the support planes). In Figure 12.1, any other orientation or position of L3 will reduce
the margin and will make L3 closer to one class than to the other. Large margins make model
predictions robust to small perturbations in the training samples.
Points that lie on the support planes are called support vectors80 and completely determine
the optimal boundary, and hence the name, support vector machines. In Figure 12.1, if
support vectors are moved, line L3 may change. However, if any non-support vectors are
removed, L3 won’t get affected at all. We will see later how the sole dependency on the
support vectors imparts computational advantage to the SVMs.
80
Calling data-points as vectors may seem weird. While this terminology is commonly used in general SVM literature,
support vectors refer to the vectors originating from origin with the data-points on support planes as their tips.
MLforPSE.com|197
Chapter 12: Support Vector Machines for Fault Detection
Mathematical background
Let there be N training samples (x, y) where x is an input vector in m-dimensional input space
and y is the class label (±1). Let the optimal separating hyperplane (a line in 2D space) be
represented as 𝑤 𝑇 𝑥 + 𝑏 = 0 where the model parameters (𝑤, 𝑏) are found such that
1 2
min ||𝑤|| eq. 1
𝑤,𝑏 2
s.t. 𝑦𝑖 ሺ𝑤 𝑇 𝑥𝑖 + 𝑏ሻ ≥ 1, 𝑖 = 1, ⋯ 𝑁
Once the model has been fitted, class predictions for test sample, 𝑥𝑡 , are made as follows.
The expression inside the sign function is also called decision function and therefore, positive
decision function results in positive class prediction and vice-versa.
The optimization formulation in Eq. 1 and all the others that we will see in
this chapter share a very favorable property of possessing a unique global
minimum. This is a huge advantage when compared to other powerful ML
methods like neural networks where the issue of local minimums can be an
inconvenience.
MLforPSE.com|198
Chapter 12: Support Vector Machines for Fault Detection
# read data
import numpy as np
data = np.loadtxt('toyDataset.csv', delimiter=',')
X = data[:, [0, 1]]; y = data[:, 2]
The above code provides us the optimal separating boundary shown in Figure 12.181. As with
other Sklearn estimators, predict() method can be used to predict the class of any test
observation. We will soon cover the hyperparameters (kernel and C) used in the above code.
81
Check out the online code to see how the separating boundary and support planes are plotted
MLforPSE.com|199
Chapter 12: Support Vector Machines for Fault Detection
Figure 12.2: Presence of the shown bad sample makes perfect linear separation infeasible
To deal with such scenarios, we add a little flexibility into our SVM optimization program by
modifying the constraints as shown below
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 for 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 + 𝜉𝑖 for 𝑦𝑖 = −1
Here, we use slack variables (𝜉𝑖 ) to allow each sample the freedom to end up on the wrong
side of the support plane and potentially be misclassified during model fitting. However, we
would like to keep the number of such violations low as well which we can achieve by
penalizing the violations. The revised SVM formulation looks like this
1 2
min ||𝑤|| + 𝐶 σ𝑁
𝑖=1 𝜉𝑖 eq. 2
𝑤,𝑏,𝜉 2
s.t. 𝑦𝑖 ሺ𝑤 𝑇 𝑥𝑖 + 𝑏ሻ ≥ 1 − 𝜉𝑖 , 𝑖 = 1, ⋯ 𝑁
𝜉𝑖 ≥ 0
The above formulation is called soft margin classification (as opposed to the previous hard
margin classification). Sklearn implements soft margin formulation. The positive constant, C,
is a hyperparameter (C=1 in Sklearn by default) and corresponds to the hyperparameter we
saw in the previous code. For our toy dataset 2 (in Figure 12.2), with the previously shown
code, we end up with the same separating boundary as shown in Figure 12.1. Class prediction
expression remains the same as 𝑦ෝ𝑡 = 𝑠𝑖𝑔𝑛ሺ𝑤 𝑇 𝑥𝑡 + 𝑏ሻ.
C as regularization hyperparameter
The slack variables not only help find a solution in the presence of gross impurity, but they
also help to avoid overfitting noisy data. For example, consider the scenario in Figure 12.3. If
no misclassifications are allowed, we end up with a very small margin, while with a single
MLforPSE.com|200
Chapter 12: Support Vector Machines for Fault Detection
misclassification we get a much better margin with potentially better generalization. Therefore,
we see that there is a trade-off between margin maximization and training accuracy.
The hyperparameter C is the knob to control the trade-off. A large value of C implies heavy
penalization of the constraint violations which will prevent misclassifications, while small value
of C allows more misclassifications during model fitting for better margin.
While soft margin classification formulation is quite flexible, it won’t work for nonlinear
classification problems where curved boundaries are warranted. Consider the dataset in
Figure 12.4. It is clear that a linear boundary is inappropriate here.
However, all is not lost here. One idea to circumvent this issue is to map the original input
variables/features into a higher dimensional space where they become linearly separable. For
the data in Figure 12.4, the following transformation would work quite well
MLforPSE.com|201
Chapter 12: Support Vector Machines for Fault Detection
As we can see in Figure 12.5, the data is easily linearly separable in the 3D space!
SVM can be trained on the new feature space to obtain the optimal separating hyperplane.
Any new test data point can be transformed via the same mapping function, 𝜑, for its class
determination via the fitted SVM model. While this solution looks great, there remains a small
issue. How do we find an appropriate mapping for a high-dimensional input dataset? As it
turns out, you don’t need to find this mapping explicitly and this is made possible by a neat
‘kernel trick’. Yes, the same kernel trick that we saw in Chapter 10! To see how kernels show
up in SVM formulation, we will revisit the mathematical background of SVMs.
1
min σ𝑁 𝑁 𝑇 𝑁
𝑖=1 σ𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑗 − σ𝑖=1 𝛼𝑖 eq. 3
𝛼 2
s.t. σ𝑁
𝑖=1 𝑦𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Here, the optimization parameters 𝑤, 𝑏 have been replaced by 𝛼𝑠 (also called Lagrange
multipliers). This equivalent form is called dual form (which you may remember from your
optimization classes; it’s perfectly fine if you have not encountered this term before). Once 𝛼𝑠
have been found, 𝑤 and 𝑏 can be computed via the following
MLforPSE.com|202
Chapter 12: Support Vector Machines for Fault Detection
𝑁
1
𝑤 = 𝛼𝑖 𝑦𝑖 𝑥𝑖 , 𝑏= 1 − 𝑦𝑖 𝑤 𝑇 𝑥𝑖
𝑁𝑠
𝑖=1 𝑖∈{𝑆𝑉}
where Ns is number of support vectors and {SV} is the set of support vector indices. Any test
data point can be classified as
𝑦ෝ𝑡 = 𝑠𝑖𝑔𝑛ሺσ𝑁 𝑇
𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑡 + 𝑏ሻ eq. 4
In the dual formulation, it is found that 𝛼𝑠 are non-zero for only the support vectors and zero
for the rest of the training samples. This implies that Eq. 4 can be reduced to
Strictly speaking, support vectors need not lie on the separating hyperplane.
For soft margin classification, data-points with non-zero slacks are also
support vectors and their 𝜶𝒔 are non-zero (defining characteristic of the
support vectors). The presence/absence of the support vectors impacts the
solution (the objective function and/or the model parameters).
At this point, you may be wondering why have we made things more complicated; why not
solve the problem in the original form (Eq. 2) which seemed more interpretable? The reason
for doing this will become clear to you very soon. For now, imagine that you are solving the
nonlinear problem where SVM finds a separating hyperplane in the higher dimension. Eq. 3
will look like the following
1
min σ𝑁 𝑁 𝑇 𝑁
𝑖=1 σ𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝜑ሺ𝑥𝑖 ሻ 𝜑ሺ𝑥𝑗 ሻ − σ𝑖=1 𝛼𝑖 eq. 6
𝛼 2
s.t. σ𝑁
𝑖=1 𝑦𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
The most crucial observation here is that the transformed variables (𝜑ሺ𝑥ሻ) appear only as
inner (dot) products. This allows us to use the kernel trick. Once a kernel function is chosen,
Eq. 6 becomes
MLforPSE.com|203
Chapter 12: Support Vector Machines for Fault Detection
1
min σ𝑁 𝑁 𝑁
𝑖=1 σ𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑘൫𝑥𝑖 , 𝑥𝑗 ൯ − σ𝑖=1 𝛼𝑖 eq. 8
𝛼 2
s.t. σ𝑁
𝑖=1 𝑦𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Above is the form in which a SVM model is fitted and predictions are made as follows
Like in KPCA and KPLS, usage of kernel functions allows us to obtain powerful nonlinear
classifiers while retaining all the benefits of the original linear SVM method!
# generate data
from sklearn.datasets import make_circles
X, y = make_circles(500, factor=.08, noise=.1, random_state=1)
# note that y = 0,1 here and need not be ±1; SVM does internal transformation accordingly
You will notice that Sklearn uses the hyperparameter gamma which is simply 1/𝜎 2 . Optimal
C and gamma come out to be 0.1 and 1, respectively, with the classifier solution shown in
Figure 12.6. The figure also shows the boundary regions for low and high values of the
hyperparameters. As we saw before, large C leads to overfitting (boundary impacted by the
noise). As far as gamma is concerned, large value (or small 𝜎) also leads to overfitting.
MLforPSE.com|204
Chapter 12: Support Vector Machines for Fault Detection
Figure 12.6: Nonlinear binary classification via kernel SVM and impact of hyperparameters.
[Code for plotting the boundaries is provided online]
Support vector data description (SVDD) is the unsupervised form of SVM algorithm used for
dealing with problems where training samples belong to only one class and the model
objective is to determine if any test/new sample belongs to the same class of data or not.
Consider the motivating example in Figure 12.7. Here, how do we obtain a model of the
training data to be able to call sample A an outlier/abnormal? Such problems are quite
common in process industry. Process monitoring and equipment health monitoring are some
example areas where most/all the available process data may belong to only normal plant
operations class and the modeling objective is to identify any abnormal data-point.
Figure 12.7: 2D training dataset with only one class. Boundary in red shows a potential
description model of the dataset that can be used to distinguish the test sample A from
training samples.
MLforPSE.com|205
Chapter 12: Support Vector Machines for Fault Detection
A better intuition behind the kernels can help us understand the impact of kernel
hyperparameters on the classification boundaries. Kernels provide an indirect
measure of similarity between any two points in the high-dimensional space.
Consider again the Gaussian kernel
Here, if two points (𝑥, 𝑧) are close to each-other in the original space, then their
similarity (or kernel value) in the mapped space will be higher, compared to when
𝑥 and 𝑧 are far away from each-other. Now, let’s look at the classifier prediction
formula for a sample 𝑧
Therefore, the classifier is nothing but a sum of Gaussian bumps from the support
vectors (plus an offset b)!
Given bandwidth (𝜎), during training, SVM tries to find the optimal values of the
bump multipliers (𝛼𝑠 ) such that the training samples get correct labels while
keeping maximum separation between classification boundary and training
samples. The boundary is simply the set of points where the net summations of
bumps and offset become zero. Small values of 𝜎 lead to very localized bumps near
any support vector, resulting in higher number of support vectors with too much
‘wiggles’ in the separating boundary which often indicates overfitting.
MLforPSE.com|206
Chapter 12: Support Vector Machines for Fault Detection
The idea behind SVDD is to envelop training data by a hypersphere (circle in 2D, sphere in
3D) containing maximum number of data-points within a smallest volume. Any new
observation that lies farther than the hypersphere radius from hypersphere center can be
regarded as abnormal observation. But the data in Figure 12.7 don’t look like it can be suitably
enveloped with a circle? That is correct and our recourse is to use kernel functions to implicitly
project original data onto a higher dimensional space where data can be adequately
enveloped within a compact hypersphere. The projection of the optimal hyperplane onto the
original space will show up as a tight nonlinear boundary around the dataset!
Just like classical 2-class SVM, only a small set of training samples get to completely define
the hypersphere. These data-points or support vectors lie on the circumference or outside of
the hypersphere (or the nonlinear boundary in the original space).
Mathematical background
Assume again that 𝜑ሺ𝑥ሻ represents a data-point in the higher dimensional feature space. In
this space, the optimal hypersphere is found via the following optimization problem
min 𝑅 2 + 𝐶 σ𝑁
𝑖=1 𝜉𝑖
𝑅,𝑎,𝜉
2 2
s.t. ||𝜑ሺ𝑥𝑖 ሻ − 𝑎|| ≤ 𝑅 + 𝜉𝑖 , 𝑖 = 1, ⋯ 𝑁
𝜉𝑖 ≥ 0
As is evident, the above program is trying to minimize the radius (R) of the hypersphere
centered at ‘a’ such that most of the data-points lie within the hypersphere. Slack variables,
𝜉 , allow certain samples to fall outside and the number of such violations is tuned via the
hyperparameter C. As before, the problem is solved in its dual form
min σ𝑁 𝑁 𝑁
𝑖=1 σ𝑗=1 𝛼𝑖 𝛼𝑗 𝐾൫𝑥𝑖 , 𝑥𝑗 ൯ − σ𝑖=1 𝛼𝑖 𝐾ሺ𝑥𝑖 , 𝑥𝑖 ሻ
𝛼
s.t. σ𝑁
𝑖=1 𝛼𝑖 = 1
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Like SVM, the alphas indicate the position of training samples w.r.t. the optimal boundary.
The following relationships hold true
MLforPSE.com|207
Chapter 12: Support Vector Machines for Fault Detection
where 𝑥𝑠 is any support vector lying on the boundary. Any test observation 𝑥𝑡 is abnormal if
its distance from center a in the mapped space is greater than R where the distance is given
as follows
2
𝐷𝑖𝑠𝑡ሺ𝜑ሺ𝑥𝑡 ሻ, 𝑎ሻ2 = ||𝜑ሺ𝑥𝑡 ሻ − 𝑎||
= 𝑘ሺ𝑥𝑡 , 𝑥𝑡 ሻ − 2 𝛼𝑖 𝑘ሺ𝑥𝑡 , 𝑥𝑖 ሻ + 𝛼𝑖 𝛼𝑗 𝑘൫𝑥𝑖 , 𝑥𝑗 ൯
𝑖∈{𝑆𝑉} 𝑖∈{𝑆𝑉} 𝑗∈{𝑆𝑉}
As you can see, specifications of kernel functions and other model hyperparameters is all that
is needed; no knowledge of mapping 𝜑 is required.
OC-SVM vs SVDD
There is another technique closely related to SVDD, called one class SVM (OC-SVM). Infact,
OC-SVM is the unsupervised SVM algorithm currently available in Sklearn. OC-SVM finds a
separating hyperplane that best separates the training data from the origin. Its kernelized dual
form is given by
1
min σ𝑁 𝑁
𝑖=1 σ𝑗=1 𝛼𝑖 𝛼𝑗 𝑘൫𝑥𝑖 , 𝑥𝑗 ൯
𝛼 2
s.t. σ𝑁
𝑖=1 𝛼𝑖 = 1
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
You will notice that for Gaussian kernel, OC-SVM formulation becomes equivalent to that of
SVDD because 𝑘ሺ𝑥𝑖 , 𝑥𝑖 ሻ = 1 and we end up with the same values of the multipliers. The
decision boundaries are the same as well. For other kernels with 𝑘ሺ𝑥𝑖 , 𝑥𝑖 ሻ ≠ 1, results would
be different.
MLforPSE.com|208
Chapter 12: Support Vector Machines for Fault Detection
Previously, we saw that C controls the trade-off between volume of hypersphere and the
number of misclassifications in the training dataset. C can also be written as
1
𝐶=
𝑁𝑓
Where N is the number of samples and f is the expected fraction of outliers in the training
dataset. Smaller value of f (correspondingly larger C) will lead to less samples being put
outside the hypersphere. Infact if C is set to 1 (or greater) the hypersphere will include all the
samples (as σ 𝛼𝑖 = 1 and 𝛼 = 𝐶 outside the hypersphere). Therefore, C can be set with some
educated presumptions on the outlier fractions. In absence of any advance knowledge, f =
0.01 is often specified to exclude away 1% of sample lying farthest from hypersphere center.
As far as 𝜎 is concerned, we previously saw that at low value of 𝜎, data boundary becomes
very wiggly with high number of support vectors, resulting in overfitting. Conversely, at high
value of 𝜎, boundary tends to become spherical in the original space itself resulting in
underfitting (or non-compact bounding of data). One approach for bandwidth selection is to
use empirical methods which are based on obtaining a kernel matrix (whose i,jth element is
𝑘൫𝑥𝑖 , 𝑥𝑗 ൯) with favorable properties. One such method, modified mean criterion 82, gives
bandwidth as follows
̅2
𝐷
𝜎= ඩ
𝑁−1
ln ቀ 2 ቁ
𝛿
2
σ𝑖<𝑗 ||𝑥𝑖 − 𝑥𝑗 ||
̅2 =
𝐷
𝑁ሺ𝑁 − 1ሻ
2
𝛿 = −0.14818008∅4 + 0.2846623624∅3 − 0.252853808∅2 + 0.159059498∅ − 0.001381145
82
Kalde & Sadek, The mean and median criterion for kernel bandwidth selection for support vector data description,
IEEE 2017
MLforPSE.com|209
Chapter 12: Support Vector Machines for Fault Detection
1
∅=
ln ሺ𝑁 − 1ሻ
Another approach for bandwidth selection is to choose largest value of 𝜎 that gives the desired
confidence level on the validation dataset. For example, for a confidence level of 99%, 𝜎 is
increased until 99% of validation samples are correctly classified as inliers. Any higher value
of 𝜎 will include more validation samples within the hypersphere. The modified mean criterion
can be used as the initial guess with subsequent search made around it. Let’s now find the
nonlinear boundary for the dataset in Figure 12.7.
# read data
import numpy as np
X = np.loadtxt('SVDD_toyDataset.csv', delimiter=',')
# compute bandwidth via modified mean criteria
import scipy
N = X.shape[0]
phi = 1/np.log(N-1)
delta = -0.14818008*np.power(phi,4) + 0.2846623624*np.power(phi,3) - 0.252853808*np.power(phi,2)
+ 0.159059498*phi - 0.001381145
D2 = np.sum(scipy.spatial.distance.pdist(X, 'sqeuclidean'))/(N*(N-1)/2)
sigma = np.sqrt(D2/np.log((N-1)/delta*delta))
gamma = 1/(2*sigma*sigma)
# SVM fit
from sklearn.svm import OneClassSVM
model = OneClassSVM(nu=0.01, gamma=gamma).fit(X) # nu corresponds to f
Figure 12.8 shows the bounding boundary for different values of gamma with f (or nu in
Sklearn) kept at 0.01. A value of gamma (= 1) close to that given by modified mean criterion
method (= 0.58) provided a satisfactory boundary.
MLforPSE.com|210
Chapter 12: Support Vector Machines for Fault Detection
Figure 12.8: SVDD application for data description and impact of model hyperparameter.
We hope that by now you are convinced of the powerful capabilities of SVM for discriminating
between different classes of data and compact bounding of normal operational data. A big
requirement for successful application of SVM is that the training dataset should be very
representative of the ‘normal’ dataset and fully characterize all the expected variations. Next,
we will look at a case study with real process data.
To illustrate a practical application of SVDD for process monitoring, we will re-use the
semiconductor manufacturing process data from Chapter 11. This batch process dataset
contains 19 process variables measured over the course of 108 normal batches and 21 faulty
batches. The batch durations range from 95 to 112 seconds. Figure 12.9 shows the training
samples and the faulty test samples in the principal component space.
Figure 12.9: Normal (in blue) and faulty (in red) batches in PCA score space
MLforPSE.com|211
Chapter 12: Support Vector Machines for Fault Detection
For this illustration, the raw data has been processed using multiway PCA and the
transformed 2D (score) data is provided in the Metal_etch_2DPCA_trainingData.csv file. Note
that we could also implement SVDD in the original input space but pre-processing via PCA to
remove variable correlation is generally a good practice. Moreover, we use the 2D PC space
for our analysis just for the ease of illustrating the SVDD boundary. In actual deployment, you
would work in higher dimensional PC space for better accuracy. Let’s see if our model can
identify the faulty samples as outliers or not in the multi-clustered dataset.
# read data
import numpy as np
X_train = np.loadtxt('Metal_etch_2DPCA_trainingData.csv', delimiter=',')
# fit SVM
from sklearn.svm import OneClassSVM
model = OneClassSVM(nu=0.01, gamma=0.025).fit(X_train) # gamma from modified mean
criterion = 0.0025
print('Number of faults identified: ', np.sum(y_test == -1), ' out of ', len(y_test))
Figure 12.10 shows the boundary around the training samples and the faulty samples labeled
according to their correct or incorrect identification. Seventeen out of twenty faulty data
samples have correctly been identified as outliers. This example illustrates the power of SVDD
for compactly describing clustered datasets.
Figure 12.10: (a) SVDD/OC-SVM boundary (in red) around metal-etch training dataset in 2D
PC space (b) Position of correctly and incorrectly diagnosed faulty samples
MLforPSE.com|212
Chapter 12: Support Vector Machines for Fault Detection
We should get the same results if we use the distances from the hypersphere center for fault
detection. The results from SVDD and OC-SVM will differ if RBFs are not used as kernel.
Unfortunately, Sklearn currently does not provide SVDD implementation. Nonetheless, a
SVDD package is available on GitHub83.
This concludes our look into support vector machines. SVMs are in a league of their own and
are well-suited for industrial processes with difficult to estimate process parameters. With
elegant mathematical background, just a few hyperparameters, excellent generalization
capabilities, and guaranteed unique global optimum, SVMs are among the best ML
algorithms.
Summary
In this chapter we studied the support vector machine algorithm and its varied forms for
supervised and unsupervised machine learning. We saw its applications for binary
classification, fault detection, and fault classification. Through kernelized learning, we learned
the adaptation of SVM for nonlinear modeling. In summary, we have added a powerful tool to
your data science toolkit. Next, we will continue building our toolkit and learn a few more
powerful classical ML algorithms.
83
https://github.com/iqiukp/SVDD
MLforPSE.com|213
Chapter 13
Decision Trees and Ensemble Learning for
Fault Detection
I
magine that you are in a situation where even after your best attempts your model could
not provide satisfactory performance. What if we tell you that there exists a class of
algorithms where you can combine several ‘versions’ of your ‘weak’ performing models
and generate a ‘strong’ performer that can provide more accurate and robust predictions
compared to its constituent ‘weak’ models? Sounds too good to be true? It’s true and these
algorithms are called ensemble methods.
Ensemble methods are often a crucial component of winning entries in online ML competitions
such as those on Kaggle. Ensemble learning is based on a simple philosophy that committee
wisdom can be better than an individual’s wisdom! In this chapter, we will look into how this
works and what makes ensembles so powerful. We will study popular ensemble methods like
random forests and XGBoost.
The base constituent models in forests and XGBoost are decision trees which are simple yet
versatile ML algorithms suitable for both regression and classification tasks. Decision trees
can fit complex and nonlinear datasets, and yet enjoy the enviable quality of providing
interpretable results. We will look at all these features in detail. Specifically, we will cover the
following topics
• Introduction to decision trees and random forests
• Introduction to ensemble learning techniques (bagging, Adaboost, gradient boosting)
• Fault detection and classification for gas boilers using decision trees and XGBoost
214
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Decision trees (DTs) are inductive learning methods which derive explicit rules from data to
make predictions. They partition the feature space into several (hyper) rectangles and then fit
a simple model (usually a constant) in each one. As shown in Figure 13.1 for a binary
classification problem in 2D feature space, the partition is achieved via a series of if-else
statements. As shown, the model is represented using branches and leaves which lead to a
tree-like structure and hence the name decision tree model. The questions asked at each
node make it very clear how the model predictions (class A or class B) are being generated.
Consequently, DTs become the model of choice for applications where ease of rationalization
of model results is very important.
Figure 13.1: A decision tree with constant model used for binary classification in a 2D space
The trick in DT model fitting lies in deciding which questions to ask in the if-else statements
at each node of the tree. During fitting, these questions split the feature space into smaller
and smaller subregions such that the training observations falling in a subregion are similar
to each-other. The splitting process stops when no further gains can be made or stopping
criteria have been met. Improper choices of splits will generate a model that does not
generalize well. In the next subsection, we will study a popular DT training algorithm called
CART (classification and regression trees) which judiciously determines the splits.
Mathematical background
CART algorithm creates a binary tree, i.e., at each node two branches are created that split
the dataset in such a way that overall data ‘impurity’ reduces. To understand this, consider
MLforPSE.com|215
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
the following node84 of a tree. Also assume that we are dealing with binary classification
problem with input vector 𝒙 ∈ 𝑅 𝑚 .
The algorithm needs to decide which one of the m input variables, xk, will be used to split the
set of n samples at this node and with what threshold r. CART makes this determination by
minimizing the following objective
𝑛𝑙𝑒𝑓𝑡 𝑛𝑟𝑖𝑔ℎ𝑡
𝐽ሺ𝑘, 𝑟ሻ = 𝐼𝑙𝑒𝑓𝑡 + 𝐼𝑟𝑖𝑔ℎ𝑡
𝑛 𝑛
where Ileft/right denote the data impurity of the left/right subsets of data and is given by
where pq is the ratio of samples of the corresponding data subset belonging to class q. For
example, if all the samples in a subset belong to class 1, then p1 = 1 and I = 0. Therefore, if
CART could find k and r such that the n samples get perfectly divided class-wise into the left
and right subsets, then the minimum value of J = 0 will be obtained. However, this is usually
not possible, and CART tries to do the best it can. The reduction in impurity (∆𝐼𝑛𝑜𝑑𝑒 ) achieved
𝑛 𝑛𝑟𝑖𝑔ℎ𝑡
by CART at this node is given by 𝑛𝐼𝑛𝑜𝑑𝑒 − 𝑙𝑒𝑓𝑡 𝐼𝑙𝑒𝑓𝑡 − 𝐼𝑟𝑖𝑔ℎ𝑡 .
𝑛 𝑛
CART simply follows the aforementioned branching scheme recursively. It starts from the top
node (root node) and keeps splitting the subsets. If a node cannot be split any further
(because impurity cannot be reduced anymore or some hyperparameter settings such as
min_samples_split, min_samples_leaf prevent any further split), the node becomes a leaf
or terminal node. For prediction, the leaf node corresponding to the test sample is found and
the majority class from the leaf’s training subset is assigned to the test sample. Note that a
probabilistic prediction for class q can also be made by simply looking at the ratio of the leaf’s
training samples belonging to class q.
84
There are algorithms like ID3 that create more than 2 branches at a node.
MLforPSE.com|216
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
In classical DTs, as we have seen, the split decision rules (or split
functions) at the nodes are based on a single variable. This results in
axis-aligned hyperplanes that split the input space into several
hyperrectangles. Complex split functions using multiple variables may
also be used which may be more suitable for certain datasets – one such
example is shown below
For regression problems, a tree is built in the same way with a different objective function,
J(k,r), which now is given by
𝑛𝑙𝑒𝑓𝑡 𝑛𝑟𝑖𝑔ℎ𝑡
𝐽ሺ𝑘, 𝑟ሻ = 𝑀𝑆𝐸𝑙𝑒𝑓𝑡 + 𝑀𝑆𝐸𝑟𝑖𝑔ℎ𝑡
𝑛 𝑛
2
σ𝑠𝑎𝑚𝑝𝑙𝑒 ∈ 𝑠𝑢𝑏𝑠𝑒𝑡൫𝑦ො − 𝑦𝑠𝑎𝑚𝑝𝑙𝑒 ൯
𝑀𝑆𝐸 =
# 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑦ො is the average output of the samples belonging to a subset. Prediction for a test sample is
also taken as the average output value of all training samples assigned to the test sample’s
leaf node.
You are not confined to using the constant predictive models at the leaves
of a regression tree. Linear and polynomial predictive models may be more
suitable for certain problems. Such DTs are called model trees. While
Sklearn allows only the constant model, there exists a package85 called
‘linear-tree’ that allows building model trees with linear models at the leaves.
85
https://github.com/cerlymarco/linear-tree
MLforPSE.com|217
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Impurity metric
The impurity measure used in Eq. 1 is called Gini impurity. Another commonly employed
measure is entropy and is given as follows for a dataset with 2 classes
2
𝐼𝐻 = − 𝑝𝑞 log ሺ𝑝𝑞 ሻ
𝑞=1
IH becomes 0 when p1=1 (or p2=0) and 1 when p1=p2=0.5. Therefore, reduction of entropy
leads to more data purity. In practice, both Gini impurity and entropy provide similar results.
# generate data
Import numpy as np
x = np.linspace(-1, 1, 50)[:, None]
y = x*x + 0.25 + np.random.normal(0, 0.15, (50,1))
Figure 13.2: Decision tree regression predictions using unregularized and regularized
models.
MLforPSE.com|218
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
You can use the plot_tree function to plot the tree itself to understand how the training
dataset has been partitioned by the DT model.
# plot tree
plt.figure(figsize=(20,8))
tree.plot_tree(model, feature_names=['x'], filled=True, rounded=True)
While DTs appear to be very flexible and useful modeling mechanism, they are seldom used
as a standalone model. In Figure 13.2, we saw that DTs can easily overfit and give non-
smooth or piece-wise constant approximations. Another disadvantage with DTs is instability,
i.e., small variations in training dataset can result in a very different tree. However, there is a
reason why we invested time in learning DTs. A single tree may not be useful, but when you
combine multiple trees, you get amazing results. We will learn how this is made possible in
the next section.
MLforPSE.com|219
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Figure 13.3: A random forest prediction is combination of predictions from multiple decision
trees. [For a classification problem, Sklearn returns the class corresponding to the highest
average class probability]
In random forest, the trees are grown to full extent, and therefore, hyperparameter selection
for the trees is not a concern. This makes RF’s training and execution simple and quick. Infact
RFs have very small number of tunable hyperparameters. RFs also lend themselves useful
for computation of variable importances. All these qualities have led to the popularity of
random forests.
Mathematical background
For training random forests, different trees need to be generated that are as ‘distinct from
each-other’ as possible but at the same time provide good descriptions of the training dataset.
This variety among the trees are achieved via the following two means:
MLforPSE.com|220
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
1) Using different training datasets for each tree: If each tree is trained on the same
dataset, they will end up being identical, make the same kind of errors, and therefore
combining them will offer little benefit. Since we only have a single training dataset to
train the RF, bootstrapping is employed. If original dataset has N samples,
bootstrapping allows creation of multiple datasets, each with Nb (≤ 𝑁) samples, such
that each new dataset is also a good representative of the underlying process that
generated the original dataset. Each bootstrap dataset is generated by randomly
selecting Nb samples with replacement from the original dataset. In RF, Nb = N and the
bootstrapping scheme is illustrated below for N=10.
Figure 13.4 Creation of separate DT models using bootstrap samples. Si denotes the
ith training sample.
2) Using random subsets of input variables to find the optimal split: A very non-intuitive,
somewhat surprising, but incredibly effective aspect of RF training is that not all the
input variables are considered for determining the optimal split function at any node of
any tree in the forest. Instead, a random subset of variables is chosen and then the
node impurity is minimized with these chosen variables. This random selection is
performed at every node. If input variable 𝑥 ∈ 𝑅 𝑚 , then the number of random split
variables (M) is recommended to be the floored squared root of m.
The above two tricks during training result in trees being minimally correlated to each-other.
Figure below summarizes the RF model fitting procedure. For illustration, it is assumed that
𝑥 ∈ 𝑅 9 and M=3.
MLforPSE.com|221
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
You can see that there are only two main hyperparameters to decide: number of trees and
the size of variable subset. Another advantage with RF is that the constituent trees can be
trained in parallel.
MLforPSE.com|222
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Figure 13.6: (Left) Random forest regression predictions, (middle) predictions from a couple of
constituent decision trees, (right) Impact of number of trees in RF model on validation error
Figure 13.6 also shows the validation error for different number of trees in the forest. What
we see here is a general characteristic of RFs: as number of trees increase, the validation
errors plateau out. Therefore, if computation time is not a concern, it is preferable to use as
many trees as possible until the error levels out. As far as M is concerned, a lower value
leads to greater reduction in variance but increases the model bias. Do note that RF primarily
brings down the variance (compared to using single DT models) and not the bias. Therefore,
if DT itself is underfitting, then RF won’t be able to provide high accuracy (or low bias). This
is why full-grown trees are used in RF. You will find out the reason behind this later in this
chapter.
To illustrate the usage of random forests for fault classification, we will use simulated dataset86
from a gas boiler system presented in the publicly available paper titled ‘Machine learning
algorithms for classification of boiler faults using a simulated dataset’. The representative
system is shown below. Fuel combusts in combustion chamber and the resulting hot gas
heats up the water stream. Three types of faults, viz, usage of excessive air, gas-side fouling
of the heat exchanger, and water-side scaling, have been simulated. A total of 27280
simulation have been provided that includes NOC samples for a range of operational
parameters (gas fuel rate from 1 kg/s to 4 kg/s, water mass flow rate from 3 kg/s to 12.5 kg/s,
and combustion air temperature from 283 K to 303 K). The original paper reports random
forests among the top performing models. Let’s see if we can replicate the reported
performance.
86
Shohet et al., Simulated boiler data for fault detection and classification. Available at https://ieee-dataport.org/open-
access/simulated-boiler-data-fault-detection-and-classification, IEEE Dataport, 2019. Data shared under Creative
Commons Attribution license (https://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|223
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
# import packages
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# read data
data = pd.read_csv('Boiler_emulator_dataset.txt', delimiter=',')
# scale data
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
MLforPSE.com|224
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
y_train_pred = clf.predict(X_train_scaled)
y_test_pred = clf.predict(X_test_scaled)
plt.figure(figsize=(12,8)), sn.set(font_scale=2)
sn.heatmap(conf_mat, fmt='.0f', annot=True, cmap='Blues', xticklabels=le.classes_,
yticklabels=le.classes_)
plt.ylabel('True Fault Class', fontsize=30, color='maroon')
plt.xlabel('Predicted Fault Class', fontsize=30, color='green')
The confusion matrix shows that random forest classifier does a pretty good job (that too with
default values of hyperparameters) at correctly identifying the sample’s classes. This
concludes our study of RFs. Hopefully, we have been able to convince you that RFs could be
quite a powerful weapon in your ML arsenal. Let’s proceed to learn about ensemble learning
to understand what is it that imparts power to RFs.
MLforPSE.com|225
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
The idea of combining multiple ‘not so good’/weak models to generate a strong model is not
restricted to random forests. The idea is more generic and is called ensemble learning.
Specific methods that implement the idea (like RFs) are called ensemble methods. Figure
13.7 below shows an ensemble modeling scheme employing a diverse set of base models.
The shown model is heterogeneous ensemble model as individual models employ different
learning methodologies. In contrast, in the homogeneous models, the base models use the
same learning algorithm.
Figure 13.7: Heterogeneous ensemble learning scheme87. RF, which itself is an ensemble
method, is used as a base model here.
There are several options at our disposal to aggregate the predictions from the base models.
For classification, we can use majority voting and pick the class that is predicted by most of
the base models. Alternatively, if base models return class probabilities, then soft voting can
also be used. For regression, we can use simple or weighted averaging.
In the ensemble world, a weak model is any model that has poor performance due to either
high bias or high variance. However, together, the weak models combine to achieve better,
87
Another popular heterogeneous ensemble technique is stacking where base models’ predictions serve as inputs to
another meta-model which is trained separately
MLforPSE.com|226
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
more accurate (lower bias), and/or robust (lower variance) predictions. Can we then combine
any set of poor performing models and obtain a super accurate ensemble model? Both yes
and no! There are certain criteria that must be met to be able to reap the benefits of ensemble
learning. First, the base models must individually perform better than random guessing.
Second, the base models must be diverse (i.e., the errors made on unseen data must be
uncorrelated). Consider the following simple illustration to understand these requirements.
In above illustration, the ensemble model will have an accuracy of 82.6% although each base
model is only 60% accurate. While is it relatively easy to fulfill this first criterion of building
base models with > 50% accuracies, it is tough to obtain diversification/independence among
the base models. In Figure 13.7, if all the base models make identical mistakes all the time,
combining them will offer no benefit. We already saw some diversification mechanisms in RF.
There are other ways as well and the schematic below provides an overview of some popular
ensemble diversification techniques.
Figure 13.8: An overview of ensemble modeling techniques with homogeneous base models
MLforPSE.com|227
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Bagging
In bagging (bootstrap aggregating) technique, the diverse set of base models are generated
by employing the same modeling algorithm with different bootstrap samples of the original
training dataset. Unlike RF, the input variables are kept the same in each base model and the
size of bootstrap dataset is usually kept less than the original number of training samples.
Moreover, the base model is not restricted to be a DT model. As the figure below shows, the
base models are trained independently (allowing parallel trainings) and ensemble prediction
is obtained by combining base models’ predictions.
Figure 13.9: Bagging ensemble technique. Note that if you have multiple independently
sampled original training datasets, then bootstrapping is not needed.
MLforPSE.com|228
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
The defining characteristic of bagging is that it helps in reducing variance but not bias. It can
be shown that if 𝜌ҧ is the average correlation among base model predictions then variance of
̅
1−𝜌
yensemble equals 𝜎ത 2 ቀ𝜌ҧ + ቁ where 𝜎ത 2 is variance of an individual base model’s predictions
𝐾
and K is number of base models. Therefore, 𝜌ҧ = 1 implies no reduction in variance.
Bagging can be used for both regression and classification. Sklearn provides
BaggingClassifier and BaggingRegressor for them, respectively. A simple illustration below
shows how bagging can help achieve smoother results (classification boundaries in this case).
# fit bagging model (Sklearn uses decision trees as base models by default)
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=500, max_samples=50, random_state=100).fit(X,
y) # K=500 and each DT is trained on 50 training samples randomly drawn with replacement
Figure 13.10: Classification boundaries obtained using (left) fully grown decision tree and
(right) bagging with fully grown decision trees
Boosting
In boosting ensemble technique, the base models are again obtained using the same learning
algorithm but are fitted sequentially and not independently from each other. During training,
each base model tries to ‘correct’ the errors made by the boosted model at the previous step.
At the end of the process, we obtain an ensemble model that shows lower bias 88 than the
base models.
88
Variance may or may not decrease. Boosting has been seen to cause overfitting sometimes.
MLforPSE.com|229
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Boosting is preferred if base model exhibit underfitting/high bias. Like bagging, boosting is not
restricted to DTs as base models, but if DTs are used, shallow trees (trees with only a few
depths) are recommended that don’t exhibit high variance. Moreover, shallow trees or any
other less complex base model are computationally tractable to train sequentially during
boosting process. Boosting can be used for both regression and classification, and there are
primarily two popular boosting methods, namely, Adaboost and Gradient Boosting.
MLforPSE.com|230
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
Gradient Boosting
Adopting an approach different from Adaboost, Gradient boosting approach corrects the
errors of its predecessor by sequentially training a base model using the residual error made
by the previous base model. Figure 13.12 shows the algorithm’s scheme.
Figure 13.12: Gradient Boosting ensemble scheme. Here, 𝑦ො𝑖 denote prediction for training
samples from the ith base model. Note that the hyperparameter 𝜗 need not be constant (as
used here) for the different base models.
In the scheme above, the hyperparameter 𝜗 ∈ ሺ0,1] is called shrinkage parameter or learning
rate. It is used to prevent overfitting and is recommended to be assigned a small value like
0.1. Very small learning rate will necessitate usage of large number of base models. The
number of iterations or base models, K, is another important hyperparameter that needs to
be tuned carefully. Too small K will not achieve sufficient bias reduction while too large K will
cause overfitting. It can be optimized using cross-validation or early stopping (keeping track
of validation error at different stages/iterations of ensemble model training and stopping when
errors stop decreasing or model residuals do not have any more pattern that can be modeled).
MLforPSE.com|231
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
If you have been active in the ML world in the recent times, then you must have heard the
term XGBoost. It stands for eXtreme Gradient Boosting and is a popular library that
implements several tricks and heuristics to make gradient boosting-based model training very
effective, especially for large datasets. XGBoost uses DTs as base models. Let’s apply
XGBoost for the gas-boiler fault classification problem and see if we can better RF model’s
performance.
plt.figure(figsize=(12,8)), sn.set(font_scale=2)
sn.heatmap(conf_mat, fmt='.0f', annot=True, cmap='Blues', xticklabels=le.classes_,
yticklabels=le.classes_)
plt.ylabel('True Fault Class', fontsize=30, color='maroon')
plt.xlabel('Predicted Fault Class', fontsize=30, color='green')
MLforPSE.com|232
Chapter 13: Decision Trees and Ensemble Learning for Fault Detection
A quick cursory glance of the confusion matrix suggests similar performance as that from RF
model. Perhaps, a systematic hyperparameter optimization may provide even better
performance. Our aim was to demonstrate the ease with which reasonably good classifier
models can be built with default settings of XGBoost.
This chapter was a whirlwind tour of ensemble learning which is an important pillar of modern
machine learning. Each topic introduced in the chapter has several more advanced aspects
that we could not cover here. However, you now have the requisite fundamentals and
familiarization in place to explore these advanced aspects.
Summary
In this chapter, we learnt the decision tree modeling methodology. We saw how random
forests can help overcome the high variance issues with trees. Then, we studied the broader
concept of ensemble learning which can be used to overcome the high bias and/or high
variance issues with weak models. Two popular ensemble methods, viz, bagging and
boosting, were conceptually introduced. Finally, we used the gas boiler dataset to illustrate
the ease with which fault classification models can be built using random forests and XGBoost
(a popular implementation of gradient boosting trees). This chapter has added several
powerful techniques to your ML arsenal and you will find yourself using these quite often.
MLforPSE.com|233
Chapter 14
Proximity-based Techniques for Fault
Detection
M
ost of the anomaly detection techniques that we have studied so far work by finding
some structure in training dataset, such as the low-dimensional manifold in PCA,
NOC boundary in SVDD, optimal separating hyperplane in SVM, etc. However,
another popular class of methods exists that utilizes a very straightforward and natural notion
of anomalies as data points that are far away or isolated from the NOC data samples; logically,
these methods are classified as proximity-based methods.
Proximity of a data point can simply be defined as its distance (as done in k-NN method) from
its neighbors. An abnormal data point lies far away from other NOC data and therefore, its
nearest neighbors’ distances will be large compared to those for NOC samples. Another
related but slightly different notion of proximity is the density or number of other data points in
a local region around a test sample. Local outlier factor (LOF) is a popular method in this
category wherein samples not lying in dense region are classified as anomalies. The third
technique, isolation forest (IF), that we will study in this chapter uses the similar notion that
anomalies are ‘far and between’. Here, the data space is split until each data point gets
‘isolated’. Anomalies can be isolated easily and require very few splits compared to NOC
samples that lie close to each other.
You may have realized that these techniques generate interpretable results and are easy to
understand. Correspondingly, they come in pretty handy to analyze complex system whose
characteristics may not be well-known a priori. Let’s now get down to business. We will cover
the following topics
234
Chapter 14: Proximity-based Techniques for Fault Detection
The k-nearest neighbors (k-NN or KNN) algorithm is a versatile technique based on a simple
intuitive idea that the label/value for a new sample can be obtained from the labels/values of
closest neighboring samples (in the feature space) from the training dataset. The parameter
k denotes the number of neighboring samples utilized by the algorithm. As shown in Figure
14.1, k-NN can be used for both classification and regression. For classification, k-NN assigns
test sample to the class that appears the most amongst the k neighbors. For regression, the
predicted output is the average of the value of the k neighbors. Due to its simplicity, k-NN is
widely used for pattern classification and was included in the list of top 10 algorithms in data
mining.89
Figure 14.1: k-NN illustration for classification (left) and regression (right). Yellow data point
denotes unknown test sample. The grey-shaded region represents the neighborhood with 3
nearest samples.
k-NN belongs to the class of lazy learners where models are not built explicitly
until test samples are received. At the other end of the spectrum, eager
learners (like, SVM, decision trees, ANN) ‘learn’ explicit models from training
samples. Unsurprisingly, training is slower, and testing is faster for eager
learners. KNN requires computing the distance of the test sample from all the
training samples, therefore, k-NN also falls under the classification of instance-
based learning. Instance-based learners make predictions by comparing the
test sample with training instances stored in memory. On the other hand,
model-based learners do not need to store the training instances for making
predictions.
89
Wu et al., Top 10 algorithms in data mining. Knowledge and Information systems, 2008.
MLforPSE.com|235
Chapter 14: Proximity-based Techniques for Fault Detection
Conceptual background
Apart from an integer k and input-output training pairs, k-NN algorithm needs a distance metric
to quantify the closeness of a test sample with the training samples. The standard Euclidean
metric is commonly employed. Once the nearest neighbors have been determined, two
approaches, namely uniform and distance-based, can be employed to decide weights
assigned to each neighbor which impacts the neighbor’s contribution in prediction. In uniform
weighting, all k neighbors are treated equally while, in distance-based weighting, each of the
k neighbors is weighted by the inverse of their distance from the test sample so that closer
neighbors will have greater contributions. The figure below illustrates the difference between
the two weight schemes for a classification problem
Figure 14.2: Illustration on impact of weight-scheme on k-NN output. Dashed circles are for
distance references.
In Figure 14.2, with uniform weighting (also called majority voting for classification problems),
the test sample is assigned to class 1 for k = 1 or 3 and class 2 for k = 6. For k = 8, no decision
can be made. With distance-weighting, test sample is always classified as class 1 for k = 1,
3, 6, or 8. This illustration shows that distance weighting can help reduce the prediction
dependence on the choice of k.
For predictions, k-NN needs to compute the distance of test samples from all the training
samples. For large training sets, this computation can become expensive. However,
specialized techniques, such as KDTree and BallTree, have been developed to speed up the
extraction of neighboring points without impacting prediction accuracies. These techniques
utilize the structure in data to avoid computing distances from all training samples. The
NearestNeighbors implementation in scikit-learn automatically selects the algorithm best
suited to the problem at hand. The KNeighborsRegressor and KNeighborsClassifier
modules are provided by Scikit-learn for regression and classification, respectively.
MLforPSE.com|236
Chapter 14: Proximity-based Techniques for Fault Detection
A couple of things to pay careful attention in k-NN include variable selection and variable
scaling. Variables that are not important for output predictions should be removed; otherwise
unimportant variables will undesirably impact the determination of nearest neighbors. Further,
the selected variables should be properly scaled to ensure that variables with large
magnitudes do not dwarf the contribution of other variables during distance computations.
A few other notable applications of k-NN for process systems include the work of Facco et al.
on automatic maintenance of soft sensors92, Borghesan et al. on forecasting of process
disturbances93, Cecilio et al. on detecting transient process disturbances94, and Zhou et al. on
fault identification in industrial processes95. These applications may not utilize the k-NN
method directly for classification or regression but use the underlying concept of similarity of
nearest neighbors.
90
Dong Wang, K-nearest neighbors-based methods for identification of different gear crack levels under different motor
speeds and loads: Revisited, Mechanical Systems and Signal Processing, 2016
91
He and Wang, Fault detection using k-nearest neighbor rule for semiconductor manufacturing processes, IEEE
Transactions on Semiconductor Manufacturing, 2007.
92
Facco et al., Nearest-neighbor method for the automatic maintenance of multivariate statistical soft sensors in batch
processing, Industrial Engineering & Chemistry Research, 2010.
93
Borghesan et al., Forecasting of process disturbances using k-nearest neighbors, with an application in process control,
Computers and Chemical Engineering, 2019.
94
Cecilio et al., Nearest neighbors methods for detecting transient disturbances in process and electromechanical systems,
Journal of Process Control, 2014.
95
Zhou et al., Fault identification using fast k-nearest neighbor reconstruction, Processes, 2019
MLforPSE.com|237
Chapter 14: Proximity-based Techniques for Fault Detection
Fault detection by k-NN28 (FD-KNN) is based on a simple idea that distance of a faulty test
sample from the nearest training samples (obtained from normal operating plant conditions)
must be greater than a normal sample’s distance from the neighboring training samples.
Incorporating this idea into the process monitoring framework, a monitoring metric (termed k-
NN squared distance) is defined for each training sample as follows
𝑘
𝐷𝑖2 = 2
𝑑𝑖𝑗
𝑗=1
2
where 𝑑𝑖𝑗 is the distance of ith sample from its jth nearest neighbor. After computing k-NN
squared distances for all the training samples, a threshold corresponding to the desired
confidence limit can be computed. A test sample would be considered faulty if its k-NN
squared distance is greater than the threshold.
96
Note that we are using PC scores as model inputs rather than original variables. This is primarily for visualization
convenience.
MLforPSE.com|238
Chapter 14: Proximity-based Techniques for Fault Detection
D2_log_CL = np.percentile(D2_log,95)
Figure 14.3 shows the resulting monitoring chart for the training samples. Figure 14.4 shows
the monitoring chart for the faulty test samples. 15 out of 20 samples are correctly flagged as
faulty while 5 samples are misdiagnosed as normal.
# D2_log_test
d2_nbrs_test, indices = nbrs.kneighbors(score_test)
d2_nbrs_test = d2_nbrs_test[:,0:5] # we want only 5 nearest neighbors
d2_sqrd_nbrs_test = d2_nbrs_test**2
D2_test = np.sum(d2_sqrd_nbrs_test, axis = 1)
D2_log_test = np.log(D2_test)
MLforPSE.com|239
Chapter 14: Proximity-based Techniques for Fault Detection
The simple nature and powerful capabilities of k-NN has put it amongst the top data-mining
algorithms. However, it is not rare to encounter scenarios where k-NN-based abnormality
detection gives poor performance. One such scenario involves NOC training samples
distributed among different clusters with varying densities as shown in Figure 14.5. A few test
samples are also shown in the figure. K-NN-based FD will correctly identify X2, X3, and X4
as anomalies, but will fail to detect X1 as an abnormal sample. This will happen because the
NOC samples in cluster A are sparsely distributed which will make the threshold for fault
detection large enough that the average k-NN-distance of X1 will lie below the threshold. For
our naked eye, X1 is an obvious outlier because it is located ‘far’ away from its ‘local’
neighbors which are densely distributed. But how do we modify the k-NN algorithm to embed
this logic? Well, one thing that can be done is to compare the local density of a test sample
with the local densities of its neighbors. If the test sample has substantially lower density than
its neighbors, then it is potentially an anomaly. This is the guiding principle behind the local
outlier factor (LOF) algorithm.
Test point X3
Cluster 1
➢ k-distance of 𝑥𝑡
For a specified hyperparameter k, k-distance of 𝑥𝑡 is simply the distance between 𝑥𝑡 and
the kth nearest neighbor of 𝑥𝑡 , i.e.,
𝑚
∗ሻ
k-distanceሺ𝑥𝑡 ሻ = 𝑑𝑘 ሺ𝑥𝑡 , 𝑥 = ඩሺ𝑥𝑡𝑖 − 𝑥𝑖∗ ሻ2
𝑖=1 𝑥 ∗ is kth nearest neighbor
MLforPSE.com|240
Chapter 14: Proximity-based Techniques for Fault Detection
𝑥𝑡
𝑘=3
➢ k-distance neighborhood of 𝑥𝑡
k-distance (𝑥 𝑜 )
𝑥𝑡
reach-dist(𝑥𝑡 , 𝑥 𝑜 )
MLforPSE.com|241
Chapter 14: Proximity-based Techniques for Fault Detection
➢ LOF of 𝑥𝑡
The local reachability density of 𝑥𝑡 is compared with those of the neighbors to compute
the LOF. Specifically, the local outlier factor is the average of the ratio of local reachability
density of the neighbors to that of 𝑥𝑡 .
σ𝑥 𝑜 ∈ 𝑁𝑘ሺ𝑥𝑡ሻ 𝑙𝑟𝑑𝑘 ሺ𝑥 𝑜 ሻ
𝐿𝑂𝐹𝑘 ሺ𝑥𝑡 ሻ =
|𝑁𝑘 ሺ𝑥𝑡 ሻ| 𝑙𝑟𝑑𝑘 ሺ𝑥𝑡 ሻ
A threshold for LOF is computed during training based on a ‘contamination parameter’ that
specifies the proportion of anomalies in training data (or the acceptable false alarm rate if
training dataset contains only NOC samples).
MLforPSE.com|242
Chapter 14: Proximity-based Techniques for Fault Detection
Figure 14.6 shows the monitoring chart for the training samples. Note that a specification of
contamination of 5% results in 5 training samples being flagged as abnormal. Figure 14.7
shows the monitoring chart for the faulty test samples. 16 out of 20 samples are correctly
flagged as faulty.
# scale and perform PCA on faulty test data; then find LOF values
score_test = pipe.transform(unfolded_TestdataMatrix)
Lof_test = -lof_model.score_samples(score_test)
print('Number of flagged faults (using control chart): ', np.sum(lof_test > lof_CL))
# can also use predict() function of LOF class to flag faulty samples
print('Number of flagged faults (using predict): ', np.sum(lof_model.predict(score_test) == -1))
MLforPSE.com|243
Chapter 14: Proximity-based Techniques for Fault Detection
Isolation forest (IF) is an ensemble of binary decision trees, also called isolation trees. IF is
an unsupervised variant of random forests and are popularly employed for outlier detection
and novelty/abnormality detection. Model training involves construction of several isolation
trees, and each tree is generated in such a way that every training sample ends up in its own
separate leaf (i.e., gets ‘isolated’). Figure 14.8 provides a simple ‘illustration’. Here, a cluster
of four NOC data points and two sample isolation trees are shown. It is apparent that each
tree splits the measurement space through several binary decision rules to ‘isolate’ every
training sample. Post model-training, a test sample is run through all the trained trees and its
abnormality is judged based on the average depth the sample reaches in the trees. In general,
an anomaly tends to lie close to the root node and therefore has shorter path length compared
to the NOC points that travel deep into the trees. This is the guiding principle of isolation
forests.
Model Training
Point a ‘isolated’
𝑎 𝑥2
𝑥2 𝛼
𝑏
𝜗 𝛾
𝑥2 𝑐
𝑑
𝜏 𝑥1 Sample tree 2 𝑥1
𝛽 𝑥1
𝛿 Sample tree 1
𝑎
𝑏
𝑥2 𝑐
𝑑
𝑥1
In Chapter 13, we had seen that random forests acquire strong predictive capabilities through
diversification of the constituent decision trees. In IFs, diversification is achieved by using a
random subset of training samples (sampled without replacement) for training each tree.
Furthermore, at every node of a tree, a feature/variable and the split value97 is randomly
chosen to form a decision rule.
97
The split value is chosen within the maximum and minimum values of the chosen variable.
MLforPSE.com|244
Chapter 14: Proximity-based Techniques for Fault Detection
Figure 14.9 shows the monitoring chart for the training samples. Note that 95th percentile of
the training scores is used as the control limit which results in 6 training samples being flagged
as abnormal. Figure 14.10 shows the monitoring chart for the faulty test samples. Only 7 out
of 20 samples are correctly flagged as faulty, suggesting that the IF model is inappropriate for
this dataset.
# scale and perform PCA on faulty test data; then find IF score values
score_test = pipe.transform(unfolded_TestdataMatrix)
IFscore_test = -IF_model.score_samples(score_test)
print('Number of flagged faults (using control chart): ', np.sum(IFscore_test > IF_CL))
# can also use predict() function of IF class to flag faulty samples
print('Number of flagged faults (using predict): ', np.sum(IF_model.predict(score_test) == -1))
MLforPSE.com|245
Chapter 14: Proximity-based Techniques for Fault Detection
IF model
The IF failed in the above illustration due to the restriction of using horizontal and vertical
splits only. An alternative algorithm, called Extended Isolation Forest
(https://github.com/sahandha/eif), has been provided by Hariri et al. that uses random
slopes for the splits and therefore achieves better performance (albeit at a higher
computational cost).
MLforPSE.com|246
Chapter 14: Proximity-based Techniques for Fault Detection
Summary
In this chapter, we studied anomaly detection methods that rely on different notions of
proximity. Specifically, we covered k-NN, local outlier factor, and isolation forests. Although,
isolation forests are technically not included in the class of proximity-based techniques, we
clubbed them with other proximity-based methods in this chapter due to their treatment of
anomalies as samples that are ‘far and between’. With this, we have covered the major
classical ML techniques that are widely employed for fault detection in process systems. We
will next move onto artificial neural networks-based techniques for process monitoring.
MLforPSE.com|247
Part 5
Artificial Neural Networks for Process Monitoring
248
Chapter 15
Fault Detection & Diagnosis via Supervised
Artificial Neural Networks Modeling
t won’t be an exaggeration to say that artificial neural networks (ANNs) are currently the
I most powerful modeling construct for describing generic nonlinear processes. ANNs can
capture any kind of complex nonlinearities, don’t impose any specific process
characteristics, and don’t demand specification of process insights prior to model fitting.
Furthermore, several recent technical breakthroughs and computational advancements have
enabled (deep) ANNs to provide remarkable results for a wide range of problems.
Correspondingly, ANNs have re(caught) the fascination of data scientists and the process
industry is witnessing a surge in successful applications of ML-based process control,
predictive maintenance, inferential modeling, and process monitoring.
ANNs can be used in both supervised and unsupervised learning settings. While we will cover
the supervised learning-based FDD applications of ANNs in this chapter, unsupervised
learning is covered in the next chapter. Supervised fitting of ANN models are applicable if you
have adequate number of historical faulty samples (so that you can fit a fault classification
model) or your signals are categorizable into predictors and response variables (so that you
can fit a regression model and monitor residuals). Different forms of ANN architectures have
been devised (such as FFNNs, RNNs, CNNs) to deal with datasets with different
characteristics. CNNS are mostly used with image data and therefore, we will study FFNN
and RNN in this chapter.
There is no doubt that ANNs have proven to be monstrously powerful. However, it is not easy
to tame this monster. If the model hyperparameters are not set judiciously, it is very easy to
end up with disappointing results. The reader is referred to Part 3 of Book 1 of this series for
a detailed exposition on ANN training strategies and different facets of ANN models. In this
chapter, the focus is on exposing the user to how ANNs can be used to build process
monitoring applications. Specifically, the following topics are covered
• Introduction to ANNs
• Introduction to RNNs
• Process monitoring using ANNs via external analysis
249
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
Artificial neural networks (ANNs) are nonlinear empirical models which can capture complex
relationships between input-output variables via supervised learning or recognize data
patterns via unsupervised learning. Architecturally, ANNs were inspired by human brain and
are a complex network of interconnected neurons as shown in Figure 15.1. An ANN consists
of an input layer, a series of hidden layers, and an output layer. The basic unit of the network,
neuron, accepts a vector of inputs from the source input layer or the previous layer of the
network, takes a weighted sum of the inputs, and then performs a nonlinear transformation to
produce a single real-valued output. Each hidden layer can contain any number of neurons.
Nonlinear mapping
f(.)
Figure 15.1: Architecture of a single neuron and feedforward neural network with 2 hidden layers
MLforPSE.com|250
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
We will use data from a CCPP to illustrate the ease with which neural network models can be
built in Python. The dataset98 comes from a combined cycle power plant composed of gas
turbine (GT), steam turbine (ST) and heat recovery steam generator as shown in Figure 15.2.
Here, energy from fuel combustion generates electricity in a gas turbine and residual energy
in the hot exhaust/flue gas from GT is recovered to produce steam This steam is used to
generate further electricity in a steam turbine. The combined electric power generated by both
GT and ST over a period of 6 years (with hourly averages) is provided in the dataset. Hourly
average values of ambient temperature (AT), ambient pressure (AP), relative humidity (RH),
and exhaust vacuum (V) are also provided. These variables influence the net hourly electrical
energy output (also provided in the dataset) of the plant operating at full load and will be the
target variable in our ANN model.
98
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant
MLforPSE.com|251
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
Figure 15.3 clearly indicates the impact of input variables on the electrical power (EP).
Figure 15.3: Plots of influencing variables (on x-axis) vs Electrical Power (on y-axis)
There is also a hint of nonlinear relationship between exhaust vacuum and power. While it
may seem that AP and RH do not influence power strongly, it is a known fact that power
increases with increasing AP and RH individually99. Let us now build a FFNN model with 2
hidden layers to predict power. We first split the dataset into training and test data, and then
scale the variables.
99
Pinar Tufekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using
machine learning methods, Electrical Power and Energy Systems, 2014
MLforPSE.com|252
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
# scale data
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
To build FFNN model, we will import relevant Keras libraries and add different layers of the
network sequentially. The Dense library is used to define a layer that is fully-connected to the
previous layer.
# define model
model = Sequential()
model.add(Dense(8, activation='relu', kernel_initializer='he_normal', input_shape=(4,)))
# 8 neurons in 1st hidden layer
model.add(Dense(5, activation='relu', kernel_initializer='he_normal'))
# 5 neurons in 2nd layer
model.add(Dense(1))
# 1 neuron in output layer
The above 4-line code completely define the structure of the FFNN. Next, we will compile and
fit the model.
# compile model
model.compile(loss='mse', optimizer='Adam') # mean-squared error is to be minimized
# fit model
model.fit(X_train_scaled, y_train_scaled, epochs=25, batch_size=50)
# predict y_test
y_test_scaled_pred = model.predict(X_test_scaled)
y_test_pred = y_scaler.inverse_transform(y_test_scaled_pred)
MLforPSE.com|253
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
The above lines are all it takes to a build FFNN and make predictions. Quite convenient, isn’t
it? Figure 15.4 compares actual vs predicted power for the test data.
Figure 15.4: Actual vs predicted target for CCPP dataset (obtained R2 = 0.93)
For the relatively simple CCPP dataset, we can obtain a reasonable model with just 1 hidden
layer with 2 neurons. Nonetheless, this example has now familiarized us with the process of
creating a DNN.
MLforPSE.com|254
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
We know that each input sample is 4-dimensional (𝑥 ∈ 𝑅 4 ). In the forward pass (also
called forward propagation), the augmented input is first processed by the neurons of the
first hidden layer. In the jth neuron of this layer, the weighted sum of the inputs are non-
linearly transformed via an activation function, g
𝑎𝑗 = 𝑔ሺ𝑤𝑗𝑇 𝑥 + 𝑏𝑗 ሻ
𝑤𝑗 ∈ 𝑅 4 are the weights applied to the inputs and 𝑏𝑗 is the bias added to the sum. Thus,
each neuron has 5 parameters (4 weights and a bias) leading to 40 parameters for all
the 8 neurons of the 1st layer that need to be estimated. Outputs of all the 8 neurons form
vector 𝑎ሺ1ሻ ∈ 𝑅 8
𝑎 ሺ1ሻ = 𝑔ሺ1ሻ ሺ𝑊 ሺ1ሻ 𝑥 + 𝑏 ሺ1ሻ ሻ
where each row of 𝑊 ሺ1ሻ ∈ 𝑅 8×4 contains the weights of a neuron. The same activation
function is used by all the neurons of a layer. 𝑎ሺ1ሻ becomes the input to the 2nd hidden
layer.
𝑎ሺ2ሻ ∈ 𝑅 5 = 𝑔ሺ2ሻ ሺ𝑊 ሺ2ሻ 𝑎ሺ1ሻ + 𝑏 ሺ2ሻ ሻ
where 𝑊 ሺ2ሻ ∈ 𝑅 5×8 . Each neuron in the 2nd layer has 8 weights and a bias parameter,
leading to 45 parameters in the layer. The final output layer had a single neuron with 6
parameters and no activation function, giving the network output as follows
where 𝑤 ሺ3ሻ ∈ 𝑅 5 .
MLforPSE.com|255
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
Recurrent neural networks (RNNs) are ANNs for dealing with sequential data, where the order
of occurrence of data holds significance. In the FFNN-based NARX model that we studied in
the previous section, there is no provision for implicit or explicit specification of
temporal/sequential nature of data, i.e., x(k-2) comes before x(k-1) for example. There is no
efficient mechanism to specify this temporal order of data in a FFNN. RNNs accomplish this
by processing elements of a sequence recurrently and storing a hidden state that summarizes
the past information during the processing. The basic unit in a RNN is called a RNN cell which
simply contains a layer of neurons. Figure 15.5 shows the architecture of a RNN consisting of
a single cell and how it processes a data sequence with ten samples.
Figure 15.5: Representation of an RNN cell in rolled and unrolled format. The feedback
signal in rolled format denotes the recurrent nature of the cell. Here, the hidden state (h) is
assumed to be same as intermediate output (y). h(0) is usually taken as zero vector.
We can see that the ten samples are processed in the same order of their occurrence and not
at once. An output, y(i), is generated at the ith step and then fed to the next step for processing
along with x(i+1). By way of this arrangement, y(i+1) is a function of x(i+1) and y(i). Since, y(i) itself
is a function of x(i) and y(i-1), y(i+1) effectively becomes a function of x(i+1), x(i), and y(i-1). Continuing
the logic further implies that the final sequence output, y(10), is a function of all ten inputs of
the sequence, that is, x(10), x(9),…, x(1). This ‘recurrent’ mechanism leads to efficient capturing
of temporal patterns in data.
MLforPSE.com|256
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
RNN outputs
If the neural layer in the RNN cell in Figure 15.5 contains n neurons (n equals 4 in the shown
figure), then each y(i) or h(i) is a n-dimensional vector. For simple RNN cells, y(i) equals h(i). Let
x be a m-dimensional vector. At any ith step, we can write the following relationship
where 𝑊𝑥 ∈ 𝑅 𝑛×𝑚 with each row containing the weight parameters of a neuron as applied to
x vector, 𝑊𝑦 ∈ 𝑅 𝑛×𝑛 with each row containing the weight parameters of a neuron as applied
to y vector, 𝑏 ∈ 𝑅 𝑛 contains the bias parameters, and 𝑔 denotes the activation function. The
same neural parameters are used at each step.
If all the outputs of the sequence are of interest, then it is called sequence-to-sequence or
many-to-many network. However, very often only the last step output is needed, leading to
sequence-to-vector or many-to-one network. Moreover, the last step output may need to be
further processed and so a FC layer is often added. Figure 15.6 shows one such topology.
LSTM networks
RNNs are powerful dynamic models, however, the vanilla RNN (with a single neural layer in
a cell) introduced before faces difficulty learning long-term dependencies, i.e., when number
of steps in a sequence is large (~ ≥10). This happens due to the vanishing gradient problem
during gradient backpropagation. To overcome this issue, LSTM (Long Short-Term Memory)
cells have been devised which are able to learn very long-term dependencies (even greater
than 1000) with ease. Unlike vanilla RNN cells, LSTM cells have 4 separate neural layers as
shown in Figure 15.7. Moreover, in a LSTM cell, the internal state is stored in two separate
MLforPSE.com|257
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
vectors, h(t) or hidden state and c(t) or cell state. Both these states are passed from one LSTM
cell to the next during sequence processing. h(t) can be thought of as the short-term
state/memory and c(t) as the long-term state/memory and hence the name LSTM.
The vector outputs of the FC layers interact with each-other and the long-term state via three
‘gates’ where element-wise multiplications occur. These gates control what information go
into the long-term and short-term states at any sequence processing step. A quick description
of each gate’s purpose is the following:
• Forget gate: determines what parts of long-term state, c(t), are retained and erased
• Input gate: determines what parts of new information (obtained from processing of x(t)
and h(t-1)) are stored in long-term state
• Output gate: determines what parts of long-term state are passed on as short-term state
Figure 15.7: Architecture of a LSTM cell. Three FC layers use sigmoid activation functions
and one FC layer uses tanh activation function. Each of these neural layers have its own
parameters 𝑊𝑥 , 𝑊ℎ , and 𝑏
This flexibility in being able to manipulate what information are passed down the chain during
sequence processing is what makes LSTM networks so successful at capturing long-term
patterns in sequential datasets. Consequently, LSTM network is the default RNN architecture
employed now-a-days. In Chapter 20, we will employ an LSTM model to predict remaining
useful life of gas turbines. Let’s next learn how to use supervised ANN models for process
fault detection.
MLforPSE.com|258
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
There is another popular variant of RNN cell, called GRU cell. As shown in the illustration
below, GRU cell is simpler than LSTM cell. GRU cell has 3 neural layers and its internal
state is represented using a single vector, h(t). For several common tasks, GRU cell-
based RNN seems to provide similar performance as LSTM cell-based RNN and
therefore, it is slowly gaining more popularity.
Consider Figure 15.8 below that illustrates the ANN-based external analysis100 strategy for
process monitoring. The idea is simple: predict response variables using predictor variables
via ANN, compute the residuals, and then monitor the residuals using classical statistical
techniques.
Measured
response variables
Predictor
variables Residuals Classical statistical Fault / No
monitoring fault
Predicted response
variables
100
Yamamoto et at., Applications of statistical process monitoring with external analysis to an industrial monomer plant.
IFAC Advanced Control of Chemical Processes, 2003.
MLforPSE.com|259
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
Let’s apply this strategy to detect faults in a debutanizer column from a petroleum refinery.
Debutanizer columns are standard units in petroleum refineries and are used to convert raw
naphtha feed into LPG (as top product) and gasoline (as bottom product). The butane (C4)
content in gasoline product is desired to be kept low and is monitored regularly via gas
chromatography. To add another layer of supervision, a model is desired that can provide
‘soft’ measurements of the C4 content and therefore, any deviation in actual C4 content from
the expected value can be used to raise an alert to the plant operators. For this purpose, we
will build a fault detection model using FFNN-based externally analysis. The dataset101 is
provided as supplementary material at https://link.springer.com/book/10.1007/978-1-84628-
480-9.
The base dataset contains 2394 samples of (normalized) input-output process values. An
artificial sensor drift is introduced in the last 200 samples to simulate faulty sensor conditions.
Let’s see if we can detect this fault at the earliest. We start with loading the data.
101
Fortuna et. al., Soft sensors for monitoring and control of industrial processes, Springer, 2007
MLforPSE.com|260
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
model.compile(loss='mse', optimizer=Adam(learning_rate=0.005))
es = EarlyStopping(monitor='val_loss', patience=50)
history = model.fit(X_est, y_est, epochs=2000, batch_size=64, validation_data=(X_val, y_val),
callbacks=es)
The plot102 below shows the evolution of mean squared prediction error over the fitting and
validation datasets. As can be expected, mse is higher for validation data and the flattening
of the curves indicate model fitting convergence.
102
The online code shows how to generate such validation curves
MLforPSE.com|261
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
# predict C4 content
y_test_pred = model.predict(X_test)
y_val_pred = model.predict(X_val)
y_est_pred = model.predict(X_est)
y_train_pred = model.predict(X_train)
The prediction accuracy is reasonably good and therefore, we can proceed with fault detection
model development with the obtained ANN model. Let generate monitoring chart for the
training data.
The control chart for the test data shows prompt detection of the fault with monitoring statistic
mostly lying below the alert threshold during the first (fault-free) hundred samples.
MLforPSE.com|262
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
faulty period
If historical faulty samples are available, then one can use FFNN or RNN models to directly
predict the fault classes of test samples. The figure below shows a sample architecture that
can help achieve this.
Softmax is an exponential function that generates normalized activations so that they sum up
to 1. In Figure 15.10, activation aj (𝑗 ∈ [1,2,3]) is generated as follows
𝑒 𝑧𝑗
𝑎𝑗 = 𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ሺ𝑧𝑗 ሻ =
σ3𝑘=1 𝑒 𝑧𝑘
MLforPSE.com|263
Chapter 15: Fault Detection & Diagnosis via Supervised Artificial Neural Networks Modeling
In ANN world, the pre-activations (𝑧𝑗 ) that are fed as inputs to the softmax function as also
called logits. The softmax activations lie between 0 and 1, and hence, they are interpreted as
class-membership probabilities. The predicted class label is taken as the class with maximum
probability (or activation)
The predicted class probabilities (and not the predicted class label) are directly used during
model training.
Loss functions
For binary-class classification problems, binary cross-entropy is the default loss function. Let
y (can take value 0 and 1) be the true label for a data sample and p (or yො) be the predicted
probability (of y = 1) obtained from sigmoid output layer. The cross-entropy loss is given by
−𝑙𝑜𝑔ሺ1 − 𝑝ሻ, 𝑦 = 0
𝐶𝑟𝑜𝑠𝑠 − 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 = −𝑦 ∗ 𝑙𝑜𝑔ሺ𝑝ሻ − ሺ1 − 𝑦ሻ ∗ 𝑙𝑜𝑔ሺ1 − 𝑝ሻ = {
−𝑙𝑜𝑔ሺ𝑝ሻ, 𝑦 = 1
The above expression is generalized as follows for a multiclass classification, where overall
loss is sum of separate losses for each class label
# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
where, 𝑦𝑐 and 𝑝𝑐 are binary indicator and predicted probability of a sample belonging to class
c, respectively. Note that for multi-class classification, the target variable will be in one-hot
encoded form.
Summary
In this chapter, we familiarized ourselves with the basic ANN architectures that are employed
to handle static and dynamic processes. We looked at ways to deploy ANN models for
process fault detection; specifically, we worked through an implementation of FFNN-based
process monitoring solution wherein the residuals between actual measurements and model
predictions are used to ascertain the presence of process faults. In the following chapter, we
will see how process monitoring solutions can be built using unsupervised neural network
models.
MLforPSE.com|264
Chapter 16
Fault Detection & Diagnosis via Unsupervised
Artificial Neural Networks Modeling
I
n the previous chapter, we looked at supervised fitting of artificial neural networks where
either the faults labels were available for historical samples or the process variables were
divided into predictors and response variable sets. However, you are very likely to
encounter situations where you only have NOC samples in your training dataset without any
predictor/response division. In Part 3 of this book, we studied a powerful technique suitable
for such datasets, called PCA; PCA, however, is limited to linear processes. Nonetheless, the
underlying mechanism of extracting the most representative features of training dataset and
compressing it into a feature space with reduced dimensionality need not be limited to linear
systems. ANNs excel at handling nonlinear systems and extracting hidden patterns in high-
dimensional datasets. Unsurprisingly, clever neural network-based architectures have been
devised to enable unsupervised fitting of nonlinear datasets. Two popular models in this
category are autoencoders (AEs) and self-organizing maps (SOMs)
Autoencoders are ANN-based counterparts of PCA for nonlinear processes. Here, low-
dimensional latent feature space is derived via nonlinear transformation and, just like we did
for PCA, the systematic variations in the feature space and the reconstruction errors are
handled separately to provide the monitoring statistics. Autoencoders are very popular for
building FDD solutions for nonlinear processes. They are also commonly used to provide
intermediate low-dimensional features which are then used for subsequent modeling
(clustering, fault classification, etc.). SOM is another variant of neural network-based
architecture that project a high-dimensional dataset onto a 2D grid (yes, you read that right!).
Here, latent variables are not derived, albeit the focus is on ensuring that the topology of the
projected data is similar to that in the original measurement space. This feature renders SOMs
very useful for data visualization, clustering, and fault detection applications.
We will undertake in depth study of both these powerful techniques in this chapter.
Specifically, the following topics are covered
265
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
An autoencoder (AE) in its basic form is a 3-layered ANN consisting of an input layer, a hidden
layer, and an output layer as shown in Figure 16.1. An AE takes an input 𝑥 ∈ ℝ𝑛 and predicts
a reconstructed 𝑥ො ∈ ℝ𝑛 as an output. To prevent the network from trivially copying 𝑥 to 𝑥ො, the
hidden layer is constrained to be much smaller than n (the number of neurons in the hidden
layer, say m, gives the dimension of the latent/feature space). This forces the network to
capture only the systematic variations in input data and learn only the most representative
features as the latent variables. The nonlinear activation function of the neurons in the hidden
layer enables the latent variables to be nonlinearly related to the input variables. During model
fitting, the gap between 𝑥 and 𝑥ො (termed reconstruction error) is minimized to find network
parameters. The basic AE network can be made deeper by adding more hidden layers
resulting in deep (or stacked) autoencoders.
𝑊2 , 𝑏2
𝑊1 , 𝑏1
⋮
⋮
Latent variables
Stacked AE
Bottleneck layer
MLforPSE.com|266
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
The symmetrical and sandwich nature of the deep AE architecture should be apparent
wherein the sizes of the layers first decrease and then increase. Care must be taken though
to not use too many hidden layers; otherwise, the network will overfit and may simply learn
the identity mapping from 𝑥 to 𝑥ො! Moreover, in the previous figure, you will notice that AE
architecture is divide into an encoder part and a decoder part. An encoder projects or codify
an input sample x to lower dimensional feature h. The decoder maps the feature vector back
to the input space. The encoder-decoder form makes the AR architecture very flexible. Once
an AE has been trained, one can use the encoder as a standalone network to obtain the latent
variables. Moreover, you are not limited to using only FFNN in the encoders and decoders.
RNNs and CNNs are also frequently employed. RNN-based AE is used as a nonlinear
counterpart of dynamic PCA.
Vanilla AE vs Denoising AE
The form of autoencoder we saw in Figure 16.1 is the conventional or vanilla form wherein
the network is forced to find patterns in data by constraining the size of coding/latent
variable (m) to be less than the size of input variable (n). This is also called an
undercomplete autoencoder. An alternative way of forcing an autoencoder to learn only
the systematic variation in data is by corrupting input data by adding synthetic noise and
then training the network to reconstruct the uncorrupted input. Such autoencoders are
called denoising autoencoders and its representative architecture is shown below. Note
that we did not explicitly represent encoder having number of neurons in hidden layer less
than the number of input variables. Denoising AE allow having 𝑚 ≥ 𝑛.
MLforPSE.com|267
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
To see autoencoders in action, let’s apply it for the dimensionality reduction of a simulated
dataset from a fluid catalytic cracking unit (FCCU103) shown in Figure 16.2. FCCUs are critical
units in modern oil refineries and convert heavy hydrocarbons into lighter and valuable
products such as LPG, gasoline, etc. As shown, the FCCU operation involves catalytic
reaction, catalyst regeneration, and distillation. A total of 46 signals are made available as
outputs (recorded every minute). Data has been provided in 7 CSV files. Each file contains
data from one simulation. One of the CSV files contain NOC data over a period of 7 days with
varying feed flow. Five faults have been simulated one at a time in 5 separate simulations.
We will work with the 7 days of NOC data.
We know that most of the variability in the data is driven from the variations in the feed flow
and therefore, we will attempt to generate a 1D latent space (m=1).
103
Details on the system and datasets available are provided in detail at https://mlforpse.com/fccu-dataset/.
MLforPSE.com|268
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
# read data, split into fitting and validation datasets, and scale
X_train = pd.read_csv('NOC_varyingFeedFlow_outputs.csv', header=None).values
X_train = X_train[:,1:] # first column contains timestamps
X_fit, X_val, _, _ = train_test_split(X_train, X_train, test_size=0.2, random_state=10)
scaler = StandardScaler()
X_fit_scaled, X_val_scaled = scaler.fit_transform(X_fit), scaler.transform(X_val)
X_train_scaled = scaler.transform(X_train)
You may notice that the way we have defined our ANN model is slightly different than that in
the previous chapters. This is done to enable us to use the fitted encoder separately later on.
Let’s now fit our autoencoder.
# fit model
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', patience=10)
history = autoencoder.fit(X_fit_scaled, X_fit_scaled, epochs=300, batch_size=256,
validation_data=(X_val_scaled, X_val_scaled), callbacks=es)
MLforPSE.com|269
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
The validation curves suggest that the model has converged. Let’s check if a 1D latent space
is good enough to be able to reconstruct the original data.
var = 7
plt.figure(), plt.plot(X_train[:,var], 'seagreen', linewidth=1), plt.plot(X_train_pred[:,var],'red')
plt.xlabel('time (mins)'), plt.ylabel('Furnace firebox temperature (T3) ')
The reconstructed data follows the systematic variations in original data pretty well. A look
into the latent signal can help explain these results. The plot below shows that the latent
variable is primarily an encoding of the feed flow. The encoded feed flow is then used to
reconstruct the rest of the variables by the decoder.
# predict latents
h_train = encoder.predict(X_train_scaled)
MLforPSE.com|270
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
In the above example, the dimension of the feature space was known
beforehand. However, in general, this becomes a hyperparameter that needs to
be optimized along with other hyperparameters such as the number of hidden
layers, the learning rate, batch-size, etc. A generic guidance is that the
compression of neurons between successive layers (in encoder) should not be
very steep. A compression ratio of 0.5 is often a good starting value.
Fault detection and diagnosis via AE proceed along the same way as done for PCA. The
reconstruction error and latent variables are computed for each training sample, and
monitoring metrics are generated. Let’s see how the computations are performed. We will re-
utilize the FCCU dataset. We will build an autoencoder model using the 7 days of NOC data
and the heat exchanger fouling simulation data (UAf_decrease_outputs.csv) is used as a test
dataset. Let’s take a quick look at the heat exchanger fouling fault. Reduced heat transfer to
the feed leads the controller TC1 to open valve V1 more to increase the fuel flow so as to
maintain T2 at the setpoint. Moreover, less heat going into preheating the feed implies that
the flue gas temperature (T3) in the furnace increases. The diagram below summarizes the
scenario. Variables F5 and T3 have been shown here and drifts in their values are evident;
however, the ‘faulty values’ are not very far away from the normal values obtained under fault-
free operations with varying feed flow.
Our objective is to use AE model to flag faulty samples by using only NOC samples for model
training.
MLforPSE.com|271
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
𝑄 = 𝑒𝑇𝑒
= ሺ𝑥 − 𝑥ොሻ𝑇 ሺ𝑥 − 𝑥ොሻ
𝑇 = ℎ𝑇 𝛬−1 ℎ
Covariance matrix of the latent variables
Let’s generate the monitoring charts for the training samples. A stacked AE model is used for
building the fault detection solution.
# fit model
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', patience=10)
history = autoencoder.fit(X_fit_scaled, X_fit_scaled, epochs=300, batch_size=256,
validation_data=(X_val_scaled, X_val_scaled), callbacks=es)
MLforPSE.com|272
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
################################################
Monitoring statistics for training samples
#####################################
h_train = encoder.predict(X_train_scaled)
MLforPSE.com|273
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
Above monitoring charts indicate good performance of the model: no false alerts before the
onset of fault and fault flagged within 2 hours after fault onset. You are encouraged to build
monitoring charts using the AE model from Section 16.1 that had 1-D latent variable; you will
notice unsatisfactory model performance due to large delay in fault detection.
MLforPSE.com|274
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
# Q contribution
sample = 250
error_test_sample = error_test[sample-1,]
Q_contri = error_test_sample*error_test_sample # vector of contributions
The contribution plot has correctly flagged the furnace firebox temperature as the faulty
variable most responsible to the deviation of the test sample from the NOC behavior.
104
Due to nonlinear transformation involved in generating AE-based latent variables, the contribution analysis for T2
statistic is not as convenient as it was for PCA.
MLforPSE.com|275
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
Self-organizing maps (SOMs)105 are unsupervised neural networks that aim to map a high-
dimensional dataset onto a 2D grid of neurons as shown in Figure 16.3. The mapping is done
in such a way that the topology of data is preserved, i.e., data samples close to each other in
the original input space are mapped to nearby neurons on the SOM grid. Accordingly, SOMs
render themselves very useful for visualization and clustering of complex high-dimensional
datasets. SOMs are named as such as no supervision/guidance is needed to determine the
mapping; this makes SOM an excellent EDA tool. Additionally, like autoencoders, no
restriction is imposed on the input dataset regarding the data distribution, linearity, and
independence among variables, etc.
The neuron that a data sample gets mapped to is called the sample’s BMU (best matching
unit). During training, multiple data samples may get assigned the same BMU. In a way, the
BMUs discretize the input space during model training into several local sub-regions and the
samples lying in a sub-region are mapped to the same BMU. For a test data sample, if there
105
SOM is also called as Kohonen map in honor of Prof. Teuvo Kohonen who introduced SOM.
MLforPSE.com|276
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
is an unusually large distance between the sample and the sub-region represented by the
BMU, then an anomaly alert may be raised. Seems interesting, right? To understand how
exactly this is achieved, let’s delve into the mathematical details of SOM training.
Mathematical background
A SOM is not fitted in the same manner as a traditional ANN fitting, i.e., the back-propagation
algorithm is not employed. To understand the model fitting procedure, first consider the end
result of fitting. Let the fitting dataset consists of N samples, where each sample is a n-
dimensional vector. After model fitting, each SOM node (or neuron) is assigned a n-
dimensional reference vector (or weight vector), say 𝑚𝑗 ∈ ℝ𝑛 for the jth node, as shown in the
illustration below. In a way, the jth node represents the local region around the vector 𝑚𝑗 in
the input space. Once the reference vectors have been generated, a sample’s BMU can be
found readily. To see how this is done, let’s go through each step of SOM model fitting.
𝑚1 ∈ ℝ𝑛
A
𝑚 2 ∈ ℝ𝑛
𝑚 3 ∈ ℝ𝑛 2D SOM grid
High-dimensional
measurement space
measurement space
A sample xi is randomly selected from the fitting dataset. The neuron whose reference
vector has the smallest (Euclidean106) distance from xi is defined as xi’s BMU, i.e.,
xi’s BMU
106
Other metrics of distance can also be used
MLforPSE.com|277
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
The term ℎ𝑏𝑖 𝑗 is the neighborhood function centered on BMU bi and ensures that
only the BMU bi and its neighboring neurons on the SOM grid are moved closer to the
sample xi. It is defined as follows
neighborhood width
𝛼ሺ𝑡ሻ is the learning rate which controls the rate at which the weight vectors get
updated
𝛼ሺ𝑡ሻ and 𝜎ሺ𝑡ሻ are assigned large values initially and are deceased monotonically
with iterations.
➢ Increment iteration (𝑡 → 𝑡 + 1) and go to step 2 if training has not converged and maximum
value of t has not been reached.
The average quantization error over all the fitting samples can provide a measure of goodness
of fit but it can be misleading as illustrated below
MLforPSE.com|278
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
The quantization error is 0 here, but it’s obvious that the topology has not been preserved
Therefore, different metrics are used to judge the fitted map quality. A couple of commonly
used metrics are described below.
Quantization error
As described before, the overall quantization error (EQ) for fitting samples is given as
2
σ𝑖‖𝑥𝑖 − 𝑚𝑏𝑖 ‖
𝐸𝑄 =
𝑁
Although EQ does not give a good measure of topological preservation, it can indicate how
good the map fits to data. EQ can also be used to infer overfitting and underfitting. Adding
more neurons lead to lower EQ and therefore very low EQ may imply overfitted model. High
values of EQ can indicate insufficient number of neurons or unconverged network learning.
Topographic error
This metric measures how well the original shape of data has been preserved on SOM grid.
It is computed by simply counting the number of samples for which the first and second BMUs
are not adjacent neurons on the SOM grid.
1
𝐸𝑇 = σ𝑁
𝑖=1 𝑢ሺ𝑥𝑖 ሻ
𝑁
MLforPSE.com|279
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
To see SOM in action, let’s apply it to our semiconductor metal-etch dataset. We will again
work with the first three PC scores of the original unfolded data as done in Chapters 11 and
14 for better visualization of SOM’s input dataset. Let’s see if SOM can indicate presence of
3 clusters. We will employ MiniSom107 package for building our SOM model.
N = score_train.shape[0]
N_neurons = 5*np.sqrt(N)
The plots of the evolution of quantization and topographic errors during training are shown
below. It is apparent that both the errors show sharp decrease in the first few iterations. Overall
10000 seem to be a good value for the number of iterations.
107
https://github.com/JustGlowing/minisom
MLforPSE.com|280
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
The most common approach to visualize a SOM and look for clusters is to generate a U-
matrix108 (unified distance matrix) as shown below for our fitted SOM model. Here, each
neuron is colored based on the average distance between the weight vectors of the neuron
and its neighbors. A lighter color imply that the neuron and its neighbors are mapped to the
same region of the input space. From the shown U-matrix, three clusters are clearly evident.
# plot U-matrix
plt.figure(figsize=(9, 9))
plt.pcolor(som.distance_map().T, cmap='bone_r') # plotting the distance map as background
plt.colorbar()
Cluster 3
Cluster 1
Cluster 2
108
Other common visualization tools include the component planes, frequency maps, class representation maps, etc.
MLforPSE.com|281
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
In SOM-based process monitoring, the statistic that is monitored is the quantization error
defined in Eq. 1. Large value of this metric indicates that the test sample is away from the
normal operation data and is faulty. Let’s generate the control chart for the training samples
of the metal-etch dataset.
MLforPSE.com|282
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
plt.plot([1,len(QEs_train)],[QE_CL,QE_CL], color='red')
plt.xlabel('sample #'), plt.ylabel('QEs for training samples')
print('Number of flagged faults (using control chart): ', np.sum(QEs_test > QE_CL))
>>> Number of flagged faults (using control chart): 19
MLforPSE.com|283
Chapter 16: Fault Detection & Diagnosis via Unsupervised Artificial Neural Networks Modeling
𝑚
2
𝐸𝑄,𝑖 = ‖𝑥𝑖𝑗 − 𝑚𝑏𝑖 𝑗 ‖
𝑗=1
where 𝑥𝑖𝑗 is the value of the jth variable for the ith test sample and 𝑚𝑏𝑖 𝑗 is the value of the jth
variable in the weight vector of the corresponding BMU.
Summary
In this chapter, we looked at two popular unsupervised neural network models, viz,
autoencoders and self-organizing maps. We studied how to employ these models for building
process monitoring solutions. Case-studies using industrial-scale systems (fluid catalytic
cracking unit from oil refineries and metal-etch semiconductor process) were shown to
illustrate the step-by-step procedures.
MLforPSE.com|284
Part 6
Vibration-based Condition Monitoring
285
Chapter 17
Vibration-based Condition Monitoring: Signal
Processing and Feature Extraction
R
otating machinery, which includes motors, compressors, pumps, turbines, fans, etc.,
form the backbone of industrial operations. Unsurprisingly, a large fraction of operation
downtime can be attributed to the failures of these machines. Over the last decade,
the process industry has adopted predictive maintenance as the means to proactively handle
these failures and the technique that has largely become synonymous with predictive
maintenance is vibration-based condition monitoring (VCM). All rotating machines exhibit
vibratory motions and different kind of faults produce characteristic vibratory signatures. This
makes VCM a reliable and effective tool for health management of rotating equipment.
Considering the importance of VCM in process industry, its different aspects are covered in
this part of the book.
Vibrations are usually measured at very high frequency and the large volume of data makes
analysis of raw data difficult. Correspondingly, processing vibration data and extracting
meaningful features that can provide early signs of failures become very crucial. Traditionally,
these features have been analyzed by vibration experts. However, in recent times, several
successful applications of ML-based VCM have been reported. All the techniques that we
have studied in the previous parts of the book can be used for VCM. While we will look at ML-
based VCM in the next chapter, this chapter sets the foundations for VCM and covers vibration
data processing and feature extraction.
Over the years, VCM practitioners and researchers have fine-tuned the art of vibration
monitoring and have come up with several specialized and advanced techniques. Arguably,
it is easy for a beginner to feel ‘lost’ in the world of VCM. The current and the following
chapters will help provide some order to this seemingly chaotic world. Specifically, the
following topics are covered
• Basics of vibrations
• VCM workflow
• Spectral analysis of vibration signal
• Time domain, frequency domain, and time-frequency domain feature extraction
286
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Vibrations are simply back and forth motion of machines around their position of rest. All
rotating machines (motors, blowers, chillers, compressors, turbines, etc.) exhibit vibratory
motion under normal and faulty conditions. Figure 17.1 shows a representative setup for
vibration sensing of an industrial machine. The sensors (transducers) convert vibratory motion
(of displacement, velocity, or acceleration) into analogue electrical signals which are digitized
and stored. The figure below shows how the recorded signal looks like on a time-axis for a
machine with gradually degrading condition. The increasing vibration levels indicate
underlying machine issues.
Motor
The components of rotating machines (rotors, bearings, gears) undergo different types of
failures due to well-studies causes such as mechanical looseness, misalignment, cracks, etc.
These faults produce characteristic vibration patterns. However, it is difficult to ‘see’ or extract
109
Romanssini et al., A Review on Vibration Monitoring Techniques for Predictive Maintenance of Rotating Machinery.
Eng, 2023. This article is an open access article distributed under the terms and conditions of the Creative Commons
Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|287
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
these specific signatures in the raw time domain vibration signal. Therefore, the standard
approach is to convert time domain signal into frequency domain to extract features that
provide early signs of failures and aid diagnosis of underlying faults.
In general, while vibration levels indicate the severity of faults, the dominant
frequency components of the vibration signal can indicate the source of faults.
The sequence of steps that are undertaken to convert raw vibration signal into useful features
and to deploy ML models is presented in the next section. The focus of this chapter is to
understand how to extract these features that can eventually be used as inputs to ML models.
top/crest
Amplitude time(s)
MLforPSE.com|288
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Sampling rate
1.333 Hz
In the above illustration, the sampled signal does not resemble the original signal! There
is a theorem called Nyquist sampling theorem which states that for a sampling frequency
fs, original signals will be collected correctly only up to frequency fs/2. Correspondingly,
the frequency fs/2 is known as Nyquist frequency (fq). For the above illustration, to be able
to correctly capture 1 Hz vibration, fs must be 2 Hz (2 samples per second) which as
shown below does work nicely.
Sampling rate
2 Hz
The phenomenon of an original high frequency signal appearing as a low frequency signal
due to undersampling is called aliasing. This is undesirable and therefore, during vibration
signal acquisition, anti-aliasing filters are frequently employed to filter out components
with frequencies higher than fq.
MLforPSE.com|289
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
We have so far talked about the different aspects of VCM in only bits and pieces. Figure 17.2
summarizes the commonly employed steps involved in VCM which are briefly described
below.
Signal Preprocessing
Data -- Mean removal
acquisition -- Trend removal
-- Noise removal
-- Filtering
Time waveform
Signal Processing
-- Frequency domain
representation (e.g., FFT)
-- Time frequency domain
representation (e.g., STFT)
- Spectrum
- Spectrogram
Feature extraction
(and selection)
-- Time domain features
-- Frequency domain features
-- Time-frequency domain
features
Insights
MLforPSE.com|290
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Data acquisition:
As a common practice, the vibration data is analyzed in batches of equal length acquired at
regular time intervals as shown in the illustration below. Here, at every minute, 1 second of
vibration data is acquired and sent for pre-processing and vibration monitoring. If the sampling
frequency is 1024 Hz, then each collected waveform contains 1024 data points.
Signal pre-processing:
Tasks involved in this step attempt to improve the signal-to-noise ratio (SNR) of the acquired
vibration signal to prepare it for the subsequent tasks. Unwanted frequency components of
the signal due to noise and measurement system errors are removed. During the early stages
of failure, the impact on the vibration signal is weak and can be difficult to detect in the
absence of adequate signal pre-processing. As an example, the illustration below shows the
benefits of trend removal.
detrended
Characteristic frequencies
more prominent
Signal processing:
A quick glimpse of waveform can tell you if a machine is vibrating at abnormally high levels.
However, waveforms are notoriously difficult to use to detect failures at incipient stages of
faults when fault characteristics are weak. Moreover, increased amplitude levels in vibration
waveform does not tell much about the underlying source of failure. To derive more actionable
information, spectrum analysis via fast Fourier transform (FFT) is commonly performed to find
MLforPSE.com|291
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Feature extraction:
One can directly compare the current waveform, spectrum, and spectrogram with those from
NOC periods to detect abnormalities. Alternatively, one can pass the entire waveform and/or
spectrum as input to a ML model. However, a more convenient approach is to extract
meaningful features that summarize the vibration signals adequately and provide crucial clues
regarding the operational state of a machine. Usage of features increases the chances of
obtaining effective ML models. The commonly used features include, amongst others, RMS
and kurtosis of waveform, frequency center and peak frequency of spectrum, and mean
frequency from spectrogram. Once the features have been extracted, they can be used for
fault detection and classification. While we will deal with this last step of VCM workflow in the
next chapter, let’s obtain better familiarization with the vibration signal processing and feature
extraction tasks.
MLforPSE.com|292
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
As alluded to in the previous section, analyzing the time-domain waveform manually can be
very inconvenient and may not provide sufficient leading indicators of incipient faults. A more
useful representation of vibration data can be obtained in frequency domain and time-
frequency domain.
Consider the following time domain and frequency domain vibration data of a machine.
Spectrum
For an untrained eye, the waveform does not provide much useful information about ‘how’ the
machine is vibrating. However, in the frequency domain, one can clearly see two distinct
frequency components potentially arising from different rotating components of the machine.
Comparison of peak amplitudes and peak frequencies with NOC values can indicate the
severity and type of fault, if any exists. This extremely useful plot of frequency versus
amplitude is called a spectrum and is commonly obtained via FFT.
Fourier transformation
Frequency domain analysis entails decomposition of a waveform into a sum of several
sinusoids of different frequencies. This decomposition is called Fourier transformation. To see
how it is calculated, let us define a few terms first. Let the sampling frequency be 𝑓𝑠 , the
𝑓
waveform contain N data points, and the frequency resolution, 𝑑𝑓, be defined as 𝑠 .
𝑁
Sampling
instant
dt (N-1)dt
1
𝑑𝑡 =
𝑓𝑠
MLforPSE.com|293
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
𝑁−1
2 2𝜋𝑖𝑘 𝑁
𝑌ሺ𝑘𝑑𝑓ሻ = 𝑦ሺ𝑡𝑖 ሻ 𝑒 −𝑗 𝑁 𝑘 = 0,1,2, … , −1
𝑁 2
𝑖=0
𝑌ሺ𝑓𝑘 ሻ is a complex number and the amplitude corresponding to 𝑓𝑘 that is shown on a spectrum
plot is given as the amplitude of 𝑌ሺ𝑓𝑘 ሻ, i.e., |𝑌ሺ𝑓𝑘 ሻ |. A spectrum can be very easily generated
in Python. Let us do it for our above waveform.
# simulate signal
import numpy as np, matplotlib.pyplot as plt
fs = 1000 # 1000 Hz
dt = 1.0/fs
t = np.arange(0,0.5,dt) # sampling instants
# generate spectrum
from scipy.fft import rfft, rfftfreq
N = len(t)
Y_spectrum = rfft(y)
freq_spectrum = rfftfreq(N, dt)
MLforPSE.com|294
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Incomplete
cycle
undesired spread
in frequencies
reduced leakage
MLforPSE.com|295
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
FFT
FFT
It is obvious that the two waveforms are different - two frequency components are present
throughout the time in scenario 1 and in scenario 2, the two components are present at
different points in time. However, the spectrum for both the scenarios are similar (ignoring the
little ripples in the 2nd spectrum and the amplitude differences of peak frequencies).
Understandably, the spectrum for scenario 2 can lead to wrong inferences. The solution to
handle scenario 2 is to analyze the waveform in both time and frequency domains together
and the most popular tool to do this is short-time Fourier transform (STFT) whose procedure
is in Figure 17.3110.
Here, you can see that the waveform is divided into overlapped segments, and Fourier
transformation is performed, resulting in a 2D matrix, 𝑆𝑦 ሺ𝑓, 𝜏ሻ. The STFT is visualized using
spectrogram and waterfall plots. A waterfall simply plots each FT spectrum one after the other
giving an impression of a waterfall in a 3D plot. A spectrogram is a 2D plot on time-frequency
axis with amplitude variations shown using colors. These diagrams help to see how
constituent frequencies change within a waveform over time. The code below shows how to
obtain time-frequency decomposition of our waveform in scenario 2.
110
The figure is adapted from the open-access article by Kim et al. title ‘Diagnostics 101: A Tutorial for Fault Diagnostics
of Rolling Element Bearing Using Envelope Analysis in MATLAB’ and published in Applied Sciences (2020). The article is
distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license
(http://creativecommons.org/licenses/by/4.0/).
MLforPSE.com|296
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
𝑦ሺ𝑡ሻ
Data sequence 𝑁
𝐹𝐹𝑇
𝐹𝐹𝑇
𝜏1 𝜏2
# simulate signal
import numpy as np, matplotlib.pyplot as plt
fs = 1000 # 1000 Hz
dt = 1.0/fs
t = np.arange(0,1,dt)
# generate spectrogram
from scipy import signal
f, t, Sxx = signal.stft(y, fs)
MLforPSE.com|297
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
By the time you reach this step of your VCM workflow, you will have a waveform, and its
spectrum and spectrogram at hand from which you can extract meaningful features. To
illustrate feature extraction, we will use data from a real wind turbine. The wind turbine
dataset111 consists of vibration acceleration recorded from a wind turbine for a period of 50
days. Six seconds of vibration data was recorded each day A total of 50 files have been
provided with each file containing data for a day. A bearing fault leads to increasing vibration
levels with failure occurring on the 50th day. Figure 17.4 shows the combined vibration signal
for the 50 days. An increasing level of vibration is clearly evident. Let’s understand how
features can be extracted for a waveform from one day of data. We will reuse this dataset for
building a metric called health indicator using only time domain features in Part 7 of the book
and therefore, we will look at the code for time domain feature generation.
When a fault occurs in a rotating machine, the vibration levels and distribution of waveform
may change. Waveform features can reflect these changes. Table 1 below shows the
commonly used time domain features. We assume that the signal is y(n) for where 𝑛 =
1,2, ⋯ , 𝑁 where N is the number of data points.
111
Available at https://github.com/mathworks/WindTurbineHighSpeedBearingPrognosis-Data. Data has been shared by
MathWorks under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license
(https://creativecommons.org/licenses/by-nc-sa/4.0/). Permission was granted by the original author of the dataset, Eric
Bechhoefer, to use the data in this book.
MLforPSE.com|298
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
Features like RMS (root mean square), STD (standard deviation), peak, and P2P (peak to
peak) show increasing trend as a fault becomes more and more severe. RMS quantifies the
energy content in the vibration signal and is a popular metric for vibration monitoring. RMS is
also more stable and robust to noise compared to other metrics such as peak and P2P. The
code below shows how to compute these features.
MLforPSE.com|299
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
# import packages
import numpy as np, matplotlib.pyplot as plt, pandas as pd
import scipy.io
plt.figure(figsize=(15,6))
plt.plot(vib_data, linewidth=0.2)
plt.xlabel('sample #', fontsize=25), plt.ylabel('Acceleration (g)', fontsize=25)
RMS (and peak, P2P) are not capable of detecting faults during early stages. Kurtosis and
skewness are more preferred for incipient faults. Kurtosis is a good indicator of an
impulsive/spiky signal. Skewness quantifies the asymmetry of the distribution of the vibration
signal amplitude. It is known that for a normally functioning machine, kurtosis and skewness
are around 3 and 0, respectively. For faulty machines, kurtosis becomes higher than 3 and
skewness deviate from 0 significantly. Note that kurtosis is known to be not suitable for very
severe faults as its value decreases when the waveform becomes less impulsive.
MLforPSE.com|300
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
>>> Mean, Shape Factor, Crest Factor, Impulse Factor, Margin Factor: 0.160, 1.296, 6.998, 9.073,
10.854
Although time domain features are easy to compute and interpret, a major shortcoming with
them is that they do not provide clues regarding the underlying source of abnormal vibrations.
Frequency domain features enjoy several advantages over time domain features. Abnormal
frequencies show up with significant amplitudes in the vibration spectrum in the presence of
faults. Frequency domain features are good indicators of both incipient and severe faults.
Moreover, presence of specific frequency components provide direct clues regarding which
component of a rotating machine has failed.
When provided a vibration spectrum, one of the first thing an expert may look at is the
presence of harmonic frequencies corresponding to the rotational speed of the machine. For
example, if a machine is rotating at 3600 RPM (rotations per minute), then the amplitudes at
60 Hz (called 1X), 120112 Hz (called 2X or second harmonic), and higher harmonics are
extracted as features. You will see in the next chapter how these frequency components are
useful for FDD. Analogous to Table 1, Table 2 shows the common statistical features derived
from vibration spectrum. Here, Y(k) is the amplitude at the kth spectrum line and fk is the
corresponding frequency for 𝑘 = 1,2, ⋯ , 𝐾.
112
Sub-harmonics (0.5X, 0.25X, …) are also used as useful features
MLforPSE.com|301
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
𝐾 σ𝐾
𝑘=1ሺ𝑌ሺ𝑘ሻ − 𝐹1 ሻ
2 σ𝐾
𝑘=1ሺ𝑌ሺ𝑘ሻ − 𝐹1 ሻ
3
1 𝐹2 = 𝐹3 =
𝐹1 = 𝑌ሺ𝑘ሻ 𝐾−1 3
𝑁 𝐾൫ඥ𝐹2 ൯
𝑘=1
σ𝐾
𝑘=1ሺ𝑌ሺ𝑘ሻ − 𝐹1 ሻ
4
σ𝐾
𝑘=1 𝑓𝑘 𝑌ሺ𝑘ሻ
𝐹4 = 𝐹5 = 𝐹𝐹𝐶 = σ𝐾𝑘=1ሺ𝑓𝑘 − 𝐹5 ሻ2 𝑌ሺ𝑘ሻ
𝐾𝐹2 2 σ𝐾
𝑘=1 𝑌ሺ𝑘ሻ 𝐹6 = ඨ
𝐾
H factor
113
https://github.com/Oybek90/Machine_Learning_from_scratch/blob/main/frequency-domain-feature-extraction-
methods.ipynb.
Yaguo Lei, Intelligent Fault Diagnosis and Remaining Useful Life Prediction of Rotating Machinery.
MLforPSE.com|302
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
The first feature, mean of the spectrum, is the average of the amplitudes of all the frequencies
in the spectrum. As fault severity increases, overall vibration energy goes up and so does F1.
Frequency center (F5 or Fc) and root mean square frequency (F7 or RMSF) are used to track
the dominant frequencies in the spectrum. The peak frequencies (say, top 10) and their
respective amplitudes are also commonly used for fault detection.
A commonly extracted time-frequency domain feature is mean peak frequency which is simply
average of the peak frequencies at different time instances in the spectrogram. It is computed
as shown below where 𝑁𝜏 is the number of segments used in SFTF.
Another common strategy is to extract frequency domain features from each FFT spectrum
of the waterfall.
MLforPSE.com|303
Chapter 17: Vibration-based Condition Monitoring: Signal Processing and Feature Extraction
With this we have completed our study of the first phase of VCM which is summarizing
complex waveforms from rotating machines into a bunch of features. These features are used
for fault detection, fault classification, and fault prognosis as we will see in the upcoming
chapters. Many more advanced techniques than what we have covered in this chapter have
been devised for analysis of vibration signals. These techniques include, amongst others,
envelope analysis, wavelet transforms, empirical mode decomposition, and Hilbert-Huang
transform. Although we have only scratched the surface of the broad field of vibration signal
processing, you now have the fundamentals in place to navigate this world confidently!
Summary
In this chapter, we acquainted ourselves with the basics of processing vibration signals as a
precursor to building ML-based fault detection solutions. We studied the concept of vibration
spectrum and understood how to generate the spectrum. We looked into some detail how to
extract time domain, frequency domain, and time-frequency domain features. We will continue
building upon this foundation and see how to utilize these features to detect faults in rotating
machines.
MLforPSE.com|304
Chapter 18
Vibration-based Condition Monitoring: Fault
Detection & Diagnosis
V
ibration-based condition monitoring was already a widely adopted technique in process
industry long before ML craze took over the manufacturing world. The International
Organization for Standardization (ISO) has come up with alarm limits for vibration RMS
for different classes of rotating machines. Additionally, VCM researchers have worked
diligently to discover the characteristics signatures of failures in different components of a
rotating machines. Correspondingly, several rules of thumb and heuristics have been devised
to pinpoint root causes of faults using vibration features. However, these heuristics do not
cover all possible fault scenarios and a vibration expert is still required to conduct analysis
and interpretation of vibration signal features. Fortunately, the advent of machine learning has
made VCM more accessible to generic process data scientists.
Several different types of ML models have been reported in VCM literature for fault
classification, fault detection, and fault diagnosis. For example, fault detection applications
have been built by using the whole spectrum (or waveform) as input to an autoencoder or
spectrogram image as input to a CNN (convolutional neural network) model. ML models don
the cap of a vibration expert to find the patterns in vibration signal, distinguish between NOC
and abnormal vibrations, and discriminate between different fault conditions. In this chapter,
we will look at one such implementation of ML-based VCM. Specifically, the following topics
are covered
• VCM workflow
• Classical approaches for VCM
• SVM-based fault classification of motors
305
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
Vibration signals contain indicators of machine faults. Previously, we saw the steps commonly
taken to ‘amplify’ these indicators through judicious extraction of features. In this chapter, we
will focus on how these features are used to make inferences regarding health of rotating
machinery. Figure 18.1 shows some of the approaches commonly employed. The classical
approaches include, amongst others, simply looking for the presence of harmonics in the
spectrum and comparing individual features against ISO-recommended thresholds. In recent
times, ML-based VCM is gradually becoming more popular. Any of the ML techniques that we
have seen in the previous parts of the book can be employed.
Signal Preprocessing
Data • Mean removal, Trend removal, Noise removal, Filtering
acquisition
Time waveform
Signal Processing
• Frequency domain representation (e.g., FFT)
• Time frequency domain representation (e.g., STFT)
- Spectrum
- Spectrogram
MLforPSE.com|306
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
The phenomenal success of deep learning in areas of computer vision and natural language
processing has encouraged VCM practitioners to bypass explicit feature extraction and
directly use the vibration waveform/spectrum/spectrogram as inputs to the deep learning
models; for example, passing 2D spectrogram images as inputs to convolutional neural
networks models. Deep learning automatically learns the fault signatures in vibration data.
Illustrations below compares shallow learning- and deep learning-based VCM using time
domain waveforms.
Deep learning-based
VCM
Fault / No fault
Std
Shallow learning-
RMS
based VCM
Kurtosis
Skewness
Fault / No fault
Peak
Peak2Peak
⋮
Below we discuss a few (easy to implement and understand) classical approaches for VCM.
In this approach, some key features (say, RMS, kurtosis, etc.) are selected as the fault
indicators to monitor. Each selected indicator is monitored using Shewhart control charts, i.e.,
a machine is considered healthy if each feature remains within the range 𝜇 ± 3𝜎 where the
mean 𝜇 and standard deviation 𝜎 are computed from NOC waveforms.
MLforPSE.com|307
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
ISO Code
Harmonics-based FDD
As alluded to in Chapter 17, harmonics of a machine’s rotating speed provide crucial clues
regarding specific causes of abnormal vibrations. A few examples of abnormal spectra of a
machine rotating at 2040 RPM are shown along with fault inferences.
More details on these heuristics-based inferences can be found in any classical book on
VCM115.
114
https://www.vibsens.com/index.php/knowledge-base/iso10816-iso7919-charts/iso10816-charts
115
Jyoti K. Sinha, Industrial Approaches in Vibration-based Condition Monitoring. CRC Press, 2020.
MLforPSE.com|308
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
To demonstrate ML-based VCM, we will use a popular dataset called CWRU bearing
dataset116. The dataset includes vibrations collected from an electric motor under NOC and
different faulty conditions, under varying loads, and at 48 kHz and 12 kHz. For the purpose of
our implementation, we will consider data collected at 48 kHz from the drive end of the motor
with 1 hp load. In total, vibrations are recorded under the following 10 different conditions:
The data from each condition is provided in a separate file (such as 110.mat) on the CWRU
website117. Data from each file is divided into smaller segments of 2048 data points and 230
segments are obtained from each of the 10 files corresponding to C1 to C10. In total 2300
waveforms are available to us. For each segment, nine time domain features (peak, peak-to-
peak, mean, standard deviation, RMS, skewness, kurtosis, crest factor, and shape factor) are
extracted. This results in a feature matrix of size 2300 X 9. A column indicating the condition
class is also added to the end and the final data matrix is provided in the file
feature_time_48k_2048_load_1.csv118. Our objective is to develop a ML model that can
predict the operating condition of the motor using the time domain features. For this, we will
build a kernel SVM-based classifier as shown below.
# import packages
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
116
https://engineering.case.edu/bearingdatacenter
117
https://engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data
118
The file and the shown SVM implementation have been adapted from the work (Copyright (c) 2022 Biswajit Sahoo) of
Biswajit Sahoo (https://github.com/biswajitsahoo1111/cbm_codes_open) which is shared under MIT License
(https://github.com/biswajitsahoo1111/cbm_codes_open/blob/master/LICENSE.md)
MLforPSE.com|309
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
# scale data
scaler = StandardScaler()
train_data_scaled = scaler.fit_transform(train_data.iloc[:,:-1])
test_data_scaled = scaler.transform(test_data.iloc[:,:-1])
MLforPSE.com|310
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
fault_type = ['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10']
sns.heatmap(CM_train, annot= True, fmt = "d", xticklabels=fault_type, yticklabels=fault_type,
cmap = "Blues", cbar = False)
sns.heatmap(CM_test, annot = True, xticklabels=fault_type, yticklabels=fault_type, cmap =
"Blues", cbar = False)
MLforPSE.com|311
Chapter 18: Vibration-based Condition Monitoring: Fault Detection & Diagnosis
The confusion matrices indicate good performance by the SVM classifier. With this, we end our
quick look at vibration-based monitoring of rotating equipment. VCM is a huge field of study by
itself, and it was impossible to fit this vast area in a single part of this book. Nonetheless, we
hope that the current and the previous chapters have imparted you a working-level knowledge
of deploying ML for VCM.
Summary
In this chapter, we focused on the fault detection and diagnosis phase of VCM workflow. We
looked at the classical and ML-based approaches for making inferences from processed
vibration signals. Finally, we implemented a SVM-based classifier for fault classification of a
motor using the popular CWRU dataset.
MLforPSE.com|312
Part 7
Predictive Maintenance
313
Chapter 19
Fault Prognosis: Concepts & Methodologies
A
ll machines eventually break and plant operators have traditionally relied upon regular
time-based (preventive) maintenance to avoid costly downtimes due to machinery
failures. Although economically inefficient, preventive maintenance remained the
default approach in process industry for a long time. Only in recent times, condition-based
maintenance approach has gained widespread acceptance wherein a machine’s real-time
data is used to assess the machine’s health, detect failures, and trigger (on-demand)
maintenance. However, the recent advancement in data mining has brought another step
change in the mindset of plant reliability personnel: mere detection of machine faults is no
longer good enough; accurate forecast of the fault’s progression leading to predictive
maintenance (PdM) is the new vogue. The lure of PdM is obvious – it facilitates advance
planning of maintenance, better management of spare part’s inventory, etc. Correspondingly,
PdM is the holy grail that industrial executives are striving for to remain competitive.
PdM, in essence, involves fault prognosis or the prediction of a machine’s health degradation
over time after detection of incipient faults. Different PdM methodologies are employed
depending on the availability of fundamental knowledge of fault’s mechanism, historical run-
to-failure data, etc. The dominant PdM approach involves computation of a health indicator
(HI) that summarizes the state of a machine health and shows a clear degradation trend as
an incipient fault progresses from incipience to high severity. HI allows computation of RUL
(remaining useful life) which is the remaining time until fault severity crosses failure threshold
necessitating the machine being taken out of service.
Several different strategies have been devised for computation of HIs and the subsequent
RUL estimation. While the RUL estimation strategies are covered in detail in the next chapter,
this chapter focusses on the data-driven methods for HI computations. Specifically, the
following topics are covered
314
Chapter 19: Fault Prognosis: Concepts & Methodologies
Fault prognosis simply refers to the task of estimating the progression of health degradation
of a machine119. Fault prognosis kick in after a fault has been detected. The end objective of
fault prognosis is to estimate the time remaining until fault severity hits failure threshold. A
machine or an operation unit may be kept in operation (even with faults) until it reaches failure
conditions. Therefore, estimation of the time remaining or RUL can help plant operators
maximize an equipment lifetime and plan maintenance judiciously. Figure 19.1 presents the
different prognostic methodologies that can be employed depending on the level of available
information about fault mechanism and past fault data.
Among the shown approaches, HI-based approach is very popular. A shown in Figure 19.2,
a curve showing the current trend of fault severity or health condition is computed. Thereafter,
the future progression of the curve is predicted to estimate the RUL. In this chapter, we will
look at how such curves can be generated in a data-driven way. The strategy for HI forecast
is covered in detail in the next chapter.
119
Fault prognosis is not limited to health prediction of machines only. It is applicable to a subprocess of a plant and the
whole plant as well.
MLforPSE.com|315
Chapter 19: Fault Prognosis: Concepts & Methodologies
distribution of
estimated RUL
Failure threshold 1
Health condition
Fault severity Predicted end
fault of useful life
detected
RUL RUL
time 0 time
healthy stage fault onset
Figure 19.2: Fault severity and health condition progression with time
Figure 19.3 shows the typical workflow for data-driven fault prognosis. Although fault
prognosis is preceded by fault detection, technically the models used for the two tasks can be
different; therefore, fault prognosis has been presented as a standalone modeling exercise. It
is not uncommon to have the same HI used first for fault detection and then subsequently for
fault prognosis. Figure 19.3 also shows two non-HI-based approaches where RULs are
estimated directly, i.e., pre-processed raw data or features are used as model inputs and RUL
is the predicted output. The requirement of abundant historical run-to-failure data (which are
rarely available in abundance) for training the DL and ML models make these regression-
based approaches less common.
Feature Selection
ML model
Deep learning
model
HI Construction
RUL RUL
RUL Estimation
RUL
Let’s now move on and learn how HIs are actually computed.
MLforPSE.com|316
Chapter 19: Fault Prognosis: Concepts & Methodologies
Maintenance Strategies
The maintenance strategies in process industry have been strongly influenced by the
advances in ML and sensor technologies. It has evolved from time-based preventive
maintenance to proactive monitoring-based condition-based maintenance (CBM),
and to advanced prediction-based predictive maintenance (PdM).
By now you understand that a health indicator is simply a metric that (directly or indirectly)
quantifies the fault status of an equipment or process. For example, consider a catalyst-filled
reformer tube in a steam-methane reformer that develops a crack on its outer surface.
Specialized tools exist that can measure the depth of crack and thus directly tell us the fault
severity. Alternatively, one can measure the surface and furnace gas temperature around the
crack which will show abnormal values due to the crack. The value of abnormal deviation
(difference between observed and expected value) would indirectly give us the level of fault
severity.
tube crack
HI can simply be one of the measured signals or a combination of them. It may also be derived
from a model. Many ML techniques that we worked with in the previous chapters already
provide metrics that can be used a health indicator. For example, in SVDD, the distance from
the center in the feature space can be used as a HI as it indicates how far a test system has
moved away from the NOC behavior. Figure 19.4 below summarizes the different approaches
for HI construction. Let’s take a brief look at them.
MLforPSE.com|317
Chapter 19: Fault Prognosis: Concepts & Methodologies
parameters
Statistical
factor, kurtosis, etc., are used as
HIs.
used as HI.
MLforPSE.com|318
Chapter 19: Fault Prognosis: Concepts & Methodologies
Feature selection
Another aspect is the selection of features for creating fusion-based HI or intermediate model.
A common practice is to include only those features that show clear monotonic trend as health
condition deteriorates.
Monotonicity is usually computed as the Spearman correlation between the feature values
and time which indicates whether the feature continuously increases (or decreases) with time
or not. Monotonicity ranges from 1- to 1 where a large absolute value is favorable for
prognosis.
Feature smoothing
Extracted features are almost always noisy as shown in the illustration below. Noise can
impact the accuracy of RUL estimation and monotonicity valve (and therefore feature
selection result). Therefore, a common practice is to smooth out the noise in features.
Smoothing is commonly conducted via (causal) moving averaging, wherein, the smoothed
value at any time is given by the average of the recent past data.
MLforPSE.com|319
Chapter 19: Fault Prognosis: Concepts & Methodologies
Spearman Correlation
Spearman correlation between two variables x and y is simply the Pearson correlation
between the rank of the variables and is often used to quantify nonlinear relationship
between a pair of variables.
𝑐𝑜𝑣ሺ𝑟𝑥 , 𝑟𝑦 ሻ
𝜌𝑠𝑝𝑒𝑎𝑟𝑚𝑎𝑛 = ൘𝜎𝑟 𝜎𝑟
𝑥 𝑦
The plots above make it apparent that 𝜌𝑠𝑝𝑒𝑎𝑟𝑚𝑎𝑛 can capture the monotonic relationship
between two variables very well.
120
Complete sequence of a machine’s measured signals from healthy state to failure
MLforPSE.com|320
Chapter 19: Fault Prognosis: Concepts & Methodologies
run-to-failure sequence 1
incipient fault detected
Motor healthy Motor failed
HI=1 HI=0
HI
0 time 𝑡𝑓1
run-to-failure sequence 2
incipient fault detected
Motor healthy Motor failed
HI=1 HI=0
HI
0 time 𝑡𝑓2
⋮
ANN training
Process data
Estimated HI
HI model
Let us bring all the covered concepts together and see an application of approach three from
Figure 19.4 for HI construction for a (wind) turbine using vibration signals.
MLforPSE.com|321
Chapter 19: Fault Prognosis: Concepts & Methodologies
As alluded to before in Chapter 17, the wind turbine dataset121 consists of vibration
acceleration recorded from a wind turbine for a period of 50 days. Six seconds of vibration
data was recorded each day A total of 50 files have been provided with each file containing
data for a day. A fault in bearing leads to increasing vibration levels with failure occurring on
the 50th day. Figure 19.6 shows the combined vibration signal for the 50 days. An increasing
level of vibration is clearly evident. We will use this dataset to demonstrate how a HI can be
constructed from extracted time domain features that exhibits a clear health degradation trend
over time122.
Let’s begin by importing required packages and defining a utility function that extracts time
domain features from a vibration waveform.
# import packages
import numpy as np, matplotlib.pyplot as plt, pandas as pd
import scipy.io
import glob
from scipy.stats import kurtosis, skew, spearmanr
121
Available at https://github.com/mathworks/WindTurbineHighSpeedBearingPrognosis-Data. Data has been shared by
MathWorks under Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license
(https://creativecommons.org/licenses/by-nc-sa/4.0/). Permission was granted by the original author of the dataset, Eric
Bechhoefer, to use the data in this book.
122
MathWorks has provided a case study (https://www.mathworks.com/help/predmaint/ug/wind-turbine-high-speed-
bearing-prognosis.html) using this dataset wherein they demonstrate how to predict RUL using fusion-based HI. We adopt
a similar approach in our case study.
MLforPSE.com|322
Chapter 19: Fault Prognosis: Concepts & Methodologies
Let’s now read data from all the 50 files and extract features.
Nfeatures = 11
features50days = np.zeros((50, Nfeatures))
for i in range(len(FilenamesList)):
matlab_data = scipy.io.loadmat(Filenames[i], struct_as_record = False) # reads data from file123
vib_data = matlab_data['vibration'][:,0]
features = timeDomainFeatures(vib_data)
features50days[i,:] = features
123
Files are in ‘.mat’ format
MLforPSE.com|323
Chapter 19: Fault Prognosis: Concepts & Methodologies
As we had alluded to in the text, the extracted features are noisy. We will smooth them using
moving average.
The plots above show that not all features show clear degradation trend. We will use
monotonicity to select HI-relevant features for further processing; specifically, we will use only
features with monotonicity greater than 0.7. Note that our purpose of computing HI is to
eventually predict RUL. For this, we will assume that 32 days of data are available to guide
the construction of HI and make a prediction for RUL. The specific strategy used to predict
the RUL is shown in the next chapter.
# monotonicity of features
feature_monotonicity = np.zeros((Nfeatures,))
for feature in range(Nfeatures):
result = spearmanr(range(Ndays_train), features_train[:,feature])
feature_monotonicity[feature] = result.statistic
# bar plot
featureNames = ['Mean', 'Std', 'RMS', 'Peak', 'Peak2Peak', 'Skewness', 'Kurtosis', 'ShapeFactor',
'CrestFactor', 'ImpulseFactor', 'MarginFactor']
MLforPSE.com|324
Chapter 19: Fault Prognosis: Concepts & Methodologies
plt.figure(figsize=(15,5))
plt.bar(range(Nfeatures), feature_monotonicity, tick_label=featureNames)
plt.xticks(rotation=45), plt.ylabel('Monotonicity'), plt.grid(axis='y')
Let’s now fuse the selected features into a single metric. For this purpose, we will utilize the
PCA technique as shown below. Notice that we again use only the training data to generate
the PCA model.
# perform PCA and extract scores along the first principal component
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(selectFeatures_train)
selectFeatures_train_normal = scaler.transform(selectFeatures_train)
selectFeatures_all_normal = scaler.transform(selectFeatures_all)
pca = PCA().fit(selectFeatures_train_normal)
PCA_all_scores = pca.transform(selectFeatures_all_normal)
The first component explains almost 95% of the data variance and the plot below shows that
the first principal component exhibits a nice monotonic trend as bearing fault progresses;
therefore it is a good metric to be used as a health indicator.
MLforPSE.com|325
Chapter 19: Fault Prognosis: Concepts & Methodologies
P-F Curve
In the machinery reliability world, the P-F (potential failure) curve is a frequently used
concept. The curve represents the progression of an equipment’s health condition
over time from a healthy state to initiation of defect to failure. The changes in
machine’s characteristics or failure symptoms such as high temperature, high
vibration, high acoustic noise, etc., are specified on the curve which indicates the
technique that may be effective in fault detection at any given stage of fault.
Defect
appears Fault detectable via
vibration monitoring
Asset condition
time
Functional failure
Hopefully, now you have a better understanding of what fault prognosis entails and the
different methodologies available to implement predictive maintenance solutions.
Summary
In this chapter, we acquainted ourselves with the methodologies employed for predictive
maintenance and undertook a detailed conceptual study of health indicators (HIs). Generation
of HIs is one of the primary means for the estimation of RUL and therefore, we looked at the
methods commonly used to compute health indicators. We consolidated the concepts
covered by working through a real case study of HI generation for a wind turbine.
MLforPSE.com|326
Chapter 20
Fault Prognosis: RUL Estimation
I
n the previous chapter, we introduced the concept of remaining useful life which is simply
the time remaining until failure of an equipment. Three broad data-based techniques were
mentioned that are: 1) reliability data-based approach wherein lifespan distribution of
similar equipment is utilized to find the expected RUL 2) direct computation of RUL via
regression-based ML modeling 3) computation of health indicator as an intermediate step.
The first two approaches require information about the past lifespan of equipment and
complete run-to-failure histories. However, it is difficult to get these data in process industry
as very often machines get repaired before they reach failure stages (remember preventive
maintenance!). This makes HI-based approach more suitable and, unsurprisingly, more
popular. In the previous chapter, we saw how to compute HI for a wind turbine. We will take
this case study to completion and show how to estimate the RUL.
Within the HI-based approach, two strategies are widely adopted. If decent amount of past
run-to-failure data are available, then one can simply pick up the historical HI trend that
matches the most with the current equipment’s HI trajectory and use the historical lifespan to
compute the required RUL. This is called similarity-based approach. A popular alternative is
to simply use the existing HI values of current equipment and fit a curve to it to extrapolate it
in the future and find when the failure threshold is breached. This is called degradation-based
approach. We will go into more details into these two strategies in this chapter. Overall, the
following topics are covered
• Introduction to RUL
• Health indicator-based RUL estimation strategies
• Health indicator degradation modeling for RUL estimation of a wind turbine
• Deep learning-based direct RUL estimation for a gas turbine
327
Chapter 20: Fault Prognosis: RUL Estimation
In the previous chapter, we looked at some broad classes of strategies for RUL estimation.
We also looked at how a health indicator can be calculated. Figure 20.1 reproduces Figure
19.1 and adds more details regarding HI-based approaches for RUL computation. The figure
also highlights the four commonly employed strategies. As alluded to earlier, the choice of
model depends on the type and amount of information available on past failures. If large
amount of past run-to-failure data are available, then one can build a deep learning model to
directly predict the RUL. We will see one such application in this chapter.
MLforPSE.com|328
Chapter 20: Fault Prognosis: RUL Estimation
current
state 𝜀 = 𝑅𝑈𝐿𝑝𝑟𝑒𝑑 − 𝑅𝑈𝐿𝑎𝑐𝑡𝑢𝑎𝑙
fault RUL actual
detected
RUL predicted
The preference for negative ε over positive ε is often specified by putting greater penalty for
ε>0 when computing the overall performance (for example, during model fitting).
MLforPSE.com|329
Chapter 20: Fault Prognosis: RUL Estimation
In Figure 20.2, we saw that by the time instance tk, HI already shows a degradation trend. For
RUL estimation, the trick lies in figuring out how the HI evolves further until it reaches the
failure threshold. Let’s look at two popular ways of approaching this problem.
In the degradation model approach, a HI model is fitted, as shown in the figure below, using
the available HI data for the equipment under study. As shown, the model can take different
forms. At time tk, the model is used to predict tF when the fitted HI curve breaches the failure
threshold; this gives RUL prediction at time tk as tF - tk. The model training and RUL estimation
is redone at time tk+1 when becomes hk+1 available. We will apply this strategy to the wind
turbine case-study introduced in the previous chapter.
Failure threshold
Fault severity as HI
𝑡𝑘 𝑡𝐹 time
Exponential model: ℎ𝑡 = 𝛼𝑒 𝛽𝑡 + 𝛾
Polynomial model: ℎ𝑡 = 𝛼0 + 𝛼1 𝑡 + 𝛼2 𝑡 2 + ⋯
Time series model: ℎሺ𝑘ሻ = 𝑓ሺℎ𝑘−1 + ℎ𝑘−2 + ⋯ ሻ
The selection of failure threshold is non-trivial. The threshold may be easy to specify if HI is
based out of a single measurement or feature such as RMS of vibration signal or temperature
of equipment. However, for a generic case, domain specific knowledge or historical failure
records would be needed.
MLforPSE.com|330
Chapter 20: Fault Prognosis: RUL Estimation
In similarity-based approach, known HI trends from past equipment failures are gathered as
shown in Figure 20.4 below. At time tk, the trajectories that are similar124 to the current
equipment’s HI trajectory until time tk are selected. The lifespans of these selected records
are combined (simple average or weighted combination) to estimate the RUL of the current
equipment.
failure time
HI
current trajectory
𝑡𝑘 Life span
124
Similarity between two trajectories are often quantified via Euclidean distance. Alternatively, dynamic time warping
(DTW) has also been employed.
MLforPSE.com|331
Chapter 20: Fault Prognosis: RUL Estimation
The actual HI trajectory seems to have significant stochastic drift characteristic in the training
period and therefore, it is not surprising our deterministic exponential fit cannot closely follow
the actual trajectory. Noentheless, let’s check out the error in RUL prediction. We will use the
last value of the computed HI (day 50) as the failure threshold and estimate when the fitted
trajectory crosses this threshold.
The above plot indicates that the simple exponential degradation model predicts the end of
useful life perfectly!
MLforPSE.com|332
Chapter 20: Fault Prognosis: RUL Estimation
For illustration of regression-based direct prediction of RUL using past run-to-failure histories,
we will use simulated aircraft gas turbine engine dataset125 which consists of operational and
dynamic data (such as temperature, pressure) from multiple sensors from several engine
operation simulations. Each simulation starts with an engine (with different degrees of initial
wear) operating within normal limits. Engine degradation starts at some point during the
simulation and continues until engine failure (engine health margin, a function of efficiency
and flow, falling below a threshold). Training datasets contain complete data until engine
failures, while the test dataset contains data until some point prior to failure. Actual RULs for
the engines have been provided for the test dataset. Our objective is to develop a PdM model
to predict engine failure using simulation data in the test dataset.
Figure 20.5: Sensor reading from training dataset (left) and test dataset (right). The actual
RUL for the shown engine ID 90 is 28 as provided in the RUL_FD001.txt.
125
This dataset (NASA Turbofan Jet Engine Data Set) is in public domain and is provided by NASA
(https://data.nasa.gov/Aerospace/CMAPSS-Jet-Engine-Simulated-Data/ff5v-kuh6/about_data).
MLforPSE.com|333
Chapter 20: Fault Prognosis: RUL Estimation
# training data
train_df = pd.read_csv('PM_train.txt', sep=" ", header=None)
train_df.drop(train_df.columns[[26, 27]], axis=1, inplace=True) # last two columns are blank
train_df.columns = ['EngineID', 'cycle', 'OPsetting1', 'OPsetting2', 'OPsetting3', 's1', 's2', 's3',
's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
# test data
test_df = pd.read_csv('PM_test.txt', sep=" ", header=None)
test_df.drop(test_df.columns[[26, 27]], axis=1, inplace=True)
test_df.columns = ['EngineID', 'cycle', 'OPsetting1', 'OPsetting2', 'OPsetting3', 's1', 's2', 's3',
's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
While most of the code above is self-explanatory, the last part reads the true RUL data,
removes the redundant columns with ‘nan’ data, and adds an additional column ‘EngineID’ as
shown below
Next, we will do some dataframe manipulations to compute RUL at any given cycle for an
engine. The code below basically adds the highlighted column shown in the snippet below to
126
https://github.com/umbertogriffo/Predictive-Maintenance-using-LSTM
MLforPSE.com|334
Chapter 20: Fault Prognosis: RUL Estimation
the training and test dataframes. We will also clip the RUL values in training dataset at the
threshold 150.
# training dataset
maxCycle_df = pd.DataFrame(train_df.groupby('EngineID')['cycle'].max()).reset_index()
maxCycle_df.columns = ['EngineID', 'maxEngineCycle']
# compute maxEngineCycle for test data using data from test_df and truth_df
maxCycle_df = pd.DataFrame(test_df.groupby('EngineID')['cycle'].max()).reset_index()
maxCycle_df.columns = ['EngineID', 'maxEngineCycle']
truth_df['maxEngineCycle'] = maxCycle_df ['maxEngineCycle'] + truth_df['finalRUL']
truth_df.drop('finalRUL', axis=1, inplace=True)
MLforPSE.com|335
Chapter 20: Fault Prognosis: RUL Estimation
Next, we will create the sequence samples. For each engine, any continuous block of 50
cycles forms a sequence. To accomplish this, we will define a utility function as shown below
MLforPSE.com|336
Chapter 20: Fault Prognosis: RUL Estimation
engine_df = engine_df[['OPsetting1', 'OPsetting2', 'OPsetting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7',
's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20',
's21', ‘engineRUL’]]
engine_X_train_sequence, engine_y_train_sequence = generate_LSTM_samples(engine_df,
nSequenceSteps)
X_train_sequence = X_train_sequence + engine_X_train_sequence # adding samples to the
common list
y_train_sequence = y_train_sequence + engine_y_train_sequence
We are now ready to build and compile our RNN. The topology includes a single neuron output
layer with linear activation function. For regularization, we utilize the dropout technique.
# custom metric
import tensorflow.keras.backend as K
def r2_custom(y_true, y_pred):
"""Coefficient of determination
"""
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return (1 - SS_res/(SS_tot + K.epsilon()))
# define model
model = Sequential()
model.add(LSTM(units=100, return_sequences=True, input_shape=(nSequenceSteps, 24)))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(1))
# compile model
model.compile(loss='mse’, optimizer='Adam', metrics= r2_custom)
We can now fit the model. Validation curves indicate that we can obtain pretty good
predictions on training dataset.
MLforPSE.com|337
Chapter 20: Fault Prognosis: RUL Estimation
For test data, we will create one sequence per engine using the last 50 cycles. Figure 20.6
compares the actual versus predicted RUL values for the test engine dataset. It is apparent
that the model performs satisfactorily well. Such accurate estimation of remaining useful life
of process equipment can be a great assistance in maintenance planning and avoidance of
costs due to unexpected equipment failures.
# input/output test sequences (only the last sequence is used to predict failure)
X_test_sequence = []
y_test_sequence = []
# predict RULs
y_test_sequence_pred = model.predict(X_test_sequence)
test_performance = model.evaluate(X_test_sequence, y_test_sequence)
print('R2_test: {}'.format(test_performance[1]))
MLforPSE.com|338
Chapter 20: Fault Prognosis: RUL Estimation
Figure 20.6: Predicted vs observed engine RULs for test aircraft engine dataset
This completes our brief coverage of predictive maintenance methodologies. You now have
all the modern tools in your arsenal to tackle plant health management problems in your data
production plants. We bid adieu to you and wish you all the best in your process data science
career.
Summary
In this chapter, we built upon the fault prognosis foundations laid in the previous chapter. We
looked at techniques for RUL prediction using health indicator trajectory and direct regression.
We consolidated our conceptual understanding through a couple of case studies on RUL
predictions for wind turbine and gas turbine.
MLforPSE.com|339
Chapter 20: Fault Prognosis: RUL Estimation
MLforPSE.com|340
Of process data science, By process data scientists, For process data scientists
www.MLforPSE.com