Model development
methodology
Data-driven models are
constructed using a huge
Data amount of historical data
gathering
and
The quality of the produced
examination model is determined by the
quality of the input data
There is a spike or an
extremely high value in the
data
Pre-
processing Noise is present in data as a
and result of the process or from
measurement transmitters
conditioning
of data
It's possible that data will
have missing values or that
values will be frozen
Outliers are observations that do not
match the majority of the data, such as
missing data points and observations that
differ greatly from normal values
Detecting Outlier process data can be caused by a
and variety of factors, including the failure of
process equipment or measurement
replacing transmitters, the failure of a data collection
system, and so on
outliers Outlier detection and removal from data
sets is crucial for soft sensor development,
as undetected outliers have a negative
impact on the final soft sensor model's
performance
Outlier detection
using a univariate
technique
Hampel identifier and 3 edit rule are two popular
univariate approaches for detecting outliers
PCA is a multivariate statistical method for reducing
data dimensionality by projecting the data matrix onto a
lower dimensional space using loading vectors
The loading vectors corresponding to the k greatest
Outlier eigenvalues are capable of capturing data variations
and hence contain the majority of the data
detection The residual matrix and Q statistics, which indicate the
using a distance of a sample from the PCA model's space, can
be used to calculate the fitness between data and
model
multivariate The distance between a given data point and the
multivariate mean of the data is indicated by Hotelling's
approach T2 statistics, which provides a measure of variability
within the normal subspace
Outliers are detected using a combination of Q and T2
tests
Measurements with Q or T2
values over the threshold are
Outlier classed as outliers, based on
the significance level for the Q
detection and T2 statistics
using a
multivariate Outside of the 99 percent
approach confidence ellipse are outliers
Selection of
relevant input
output variables
• Choosing relevant input is an important step
in modelling the input–output relationship in
a process model
ANN modelling is frequently used in multivariate
systems that have multiple operating sample
rates
Normally, product quality parameters are
measured offline in a laboratory or by an online
analyzer with a long dead time in many
industrial chemical processes
Assemble Every second or minute, input variables such as
temperature and pressure are measured and
data recorded
As a result, the data must be aligned in the
proper time scale
It is critical that laboratory data be properly time
stamped and aligned with other continuous data
on a consistent time scale
This is the most important step in
creating an ANN model
Selection,
Because the model is at the heart of
training, ANN, selecting the right model is
and crucial to its success
validation of Model developer need to give the
model following parameters as user input
during ANN model building phase
parameters • Number of nodes in hidden layer
• type of activation function in input layer and
output layer and
• algorithms for weight up gradation etc
There are lot of options available for model
selection but no clear cut guidelines are
available to select which model at what
conditions
Most of the cases, type of model is
selected by the developer is based on his
Model personal choice and expertise
selection This can be very detrimental to develop a
good model
The best approach is to remain open
minded for all the model types
The good practise is to start with a simple
model type with less number of nodes in
hidden and gradually increase model
complexity as long as significant
improvement in the model’s performance
can be observed
Best
practices During model building phase, performance
of individual model can be judged by unseen
for model validation data
selection
The same approach can also be applied to
the parameters selection of the pre-
processing methods like for instance
variable selection
Normally data driven models need large amount of
data which is usually available in modern industry
However, in some instances where lab data is used,
Cross may be very small amount of data is available
Additionally, for some industrial processes where
validation there is few reliable lab data is available, statistical
error-estimation techniques like K-fold cross-
validation can be applied
This method makes an optimal uses of the available
data by partitioning it in such a way that all of the
samples are used for the model performance
validation
After finding the optimal model
structure and training the
model, the trained ANN model
performance has to judge on
Model new validation data set once
again
Performanc
Mean Squared Error , which
e measures the average square
distance between the predicted
and the correct value is most
popular performance
evaluation techniques for
model
Another way of performance judgement
is using visual representation of the
predictions
Model In these, the four-plot analysis is a useful
tool since it provides useful information
Performanc about the relation between the
predictions and the correct values
together with the analysis of the
e prediction residuals
A disadvantage of the visual methods is
that they require an assistance of the
model developer and the final decision if
the model performs adequately, is up to
the subjective judgement of the model
developer
To evaluate that the developed
model has some resemble with
the underlying physics of the
process
Important
criteria Many model experts stress the
necessity for the application of
process knowledge during the
ANN model development
phase
Model acceptance
and model tuning
After developing ANN model, the model is put
on test in offline mode to see how model
prediction matches with fresh data currently
generated in DCS
If model prediction closely matched with the
actual output, then model is accepted in
industry
Usual criteria is average prediction error
should be less than 1% with R2 value greater
than 0.95
It is very common in industry that
the performance of ANN model
deteriorates over time
Model
performanc Underlying process
may change
Measuring
e Reasons are many transmitters data
may drift
analyzer reading may
deterioratio
change due to
recalibration etc
n All of these can cause the
performance of the ANN model to
deteriorate and have to be
compensated for by adapting or
re-developing the model
ANN model is to be maintained and tuned
on a regular basis
Model In literature researchers tried various
adaptive approaches to update the model
based on its performance
tuning and
model Neural model is updated every six months
with fresh current data when it is found that
update present model prediction capability
deteriorates over time
Most of these auto model update methods
still limited to research publications and
very few is really applied in actual industry