Text Mining and Unsupervised Deep Learning For Intrusion Detection in Smart-Grid Communication Networks
Text Mining and Unsupervised Deep Learning For Intrusion Detection in Smart-Grid Communication Networks
1 National Council for Scientific Research (CNRS), Institut FEMTO-ST, Université Marie et Louis Pasteur,
90000 Belfort, France; [email protected] (R.C.); [email protected] (H.N.)
2 Lebanese Atomic Energy Commission (LAEC), National Council for Scientific Research (CNRS),
Beirut 1107 2260, Lebanon
3 Faculty of Science, Beirut Arab University, Beirut 1107 2809, Lebanon
* Correspondence: [email protected] (J.A.); [email protected] (M.A.S.)
† These authors contributed equally to this work.
In 2020, the European Network of Transmission System Operators for Electricity (ENTSO-E),
a consortium of 42 European transmission system operators, found evidence of a successful
cyber intrusion into its office network. Due to the limited information provided, it was
unclear whether the attack affected the customers, stakeholders, or IT systems [3]. Other
significant cyber-attacks occurred in 2019 against Russia’s power infrastructure [4] and
in 2017 against Saudi Aramco’s petrochemical plants [5]. Ukraine’s grid was targeted in
2015, knocking down power to thousands of people [6]. Among the attacks carried out, it is
possible to mention the exploitation of existing remote access tools within the environment
and telephone denial-of-service attacks. In 2014, cyber attackers infiltrated Korea Hydro
and Nuclear Power, South Korea’s nuclear and hydroelectric corporation, posting designs
and manuals for two nuclear reactors online and exposing the personal information of
thousands of employees [7]. The intruders attacked in three ways: (1) they utilized several
malwares, (2) they exploited a vulnerability in the Korean language’s writing system,
and (3) they used phishing emails.
A variety of security methods have been proposed for the smart grid, including
encryption, authentication, virus protection, network security, and Intrusion Detection
Systems (IDSs). Intrusion detection is a network security mechanism that emphasizes
identifying and preventing recognized threats. The fundamental function of an IDS is to
monitor the network and alert system administrators upon detecting a potential threat.
Intrusion Detection Systems (IDSs) can be primarily categorized into Signature-Based
Intrusion Detection Systems (SIDSs) and Anomaly-Based Intrusion Detection Systems
(AIDSs). Signature Intrusion Detection Systems (SIDSs) utilize pattern-matching techniques
to identify known attacks; this technology is also known as Knowledge-based Detection.
A standard model of system behavior in AIDSs is constructed via machine learning, deep
learning, statistical techniques, or knowledge-based approaches. A notable discrepancy
between observed and anticipated behavior is regarded as an anomaly, perhaps interpreted
as an attack [8]. Currently, signature-based cybersecurity solutions are being progressively
replaced by intelligent cybersecurity agents. Network anomalies differ from standard
network virus infections. Anomalies in networks are identified by detecting nonconforming
patterns within the network data. The classification of network traffic utilizing machine
learning and deep learning algorithms has proven to be highly effective [9]. Given the
challenges associated with classifying extensive network traffic, unsupervised learning
methodologies are more feasible.
The electrical substation’s design has changed multiple times in recent years. The de-
velopments are aimed at enhancing communications through the use of more efficient
Ethernet and TCP/IP technology. Different protocols and abstract data models that enable
the interoperability of devices from many vendors emerged. The Manufacturing Message
Specification (MMS) protocol is frequently used to increase process automation in IEC
61850-based power stations. However, because this protocol was not developed with
security in mind, it is susceptible to a variety of cyber-attacks [10]. This paper proposes
an unsupervised deep learning approach combined with text mining for detecting attack
sequences in Manufacturing Message Specification (MMS) traffic samples. In contrast to the
major works that have been proposed in the state of the art, this paper proposes a solution
for raw and unstructured MMS data. We used text mining techniques to pre-process the
XML data generated from the MMS PCAP files. To detect attack sequences, an LSTM
Autoencoder model has been proposed and trained on benign MMS traffic. It then recon-
structs the input sequences and classifies the sequences that were poorly reconstructed
as intrusions.
IoT 2025, 6, 22 3 of 22
1.1. Motivation
The motivation of this work is to address the vulnerability of the MMS protocol to
cyber-attacks in smart grid systems by developing an unsupervised deep learning approach
for anomaly detection that does not rely on labeled data, thus overcoming the limitations
of conventional methods. The work also aims to handle high-dimensional time series data
effectively while maintaining a high True Positive Rate.
1.2. Objectives
The objective of this paper is to develop an anomaly detection method using a bidirec-
tional LSTM autoencoder. It also aims to implement a text-mining strategy with a TF-IDF
vectorizer and truncated SVD for data preparation and feature extraction. Furthermore,
the proposed approach focuses on creating an unsupervised model that learns from normal
samples without relying on labeled data. The model is designed to detect attack sequences
in Manufacturing Message Specification (MMS) traffic samples. Additionally, this work
addresses challenges in pre-processing raw data sources and developing a deep learning
model for unsupervised intrusion detection.
2. Related Work
This article addresses two primary challenges: (1) pre-processing raw data sources
such as Wireshark PCAP capture files, and (2) developing an effective deep learning (DL)
model for unsupervised intrusion detection. The closest work to the one proposed in this
paper is presented by Lotfollahi et al. in [11]. Their task was to classify network traffic
using the ISCXVPN2016 dataset (https://www.unb.ca/cic/datasets/vpn.html, accessed
on 12 March 2025) [12] to become aware of the applications the clients utilized to allocate
suitable resources. The authors deleted Ethernet headers and padded traffic with UDP
header with zeros to a length of 20 bytes during the pre-processing phase. They suppressed
the IP address in the IP header to minimize overfitting, as the dataset was collected using a
small number of hosts and servers. The authors then transformed the raw packets into byte
vectors of size 1500 and normalized them by dividing them by 255. The obtained byte vec-
tors are then fed into a supervised deep learning model based on a 1D convolutional neural
network (CNN) architecture (https://github.com/munhouiani/Deep-Packet, accessed on
12 March 2025). Feature selection is a critical step in such applications since the practi-
cability of the selected features can directly affect the model’s performance. The authors
proposed a strategy based on automatic feature selection from the PCAP files in the study
mentioned above. The approach taken in this paper is different. Rather than treating each
input packet as a byte vector, we propose employing vector models and text pre-processing
to mine the generated XML from the PCAP files. The goal is to build a more data-agnostic
IoT 2025, 6, 22 4 of 22
approach to feature extraction/selection that can be easily extended and applied to various
security applications.
Recent research on anomaly-based intrusion detection using raw PCAP files has
focused on leveraging machine learning techniques to improve detection accuracy and
efficiency. Studies have explored various algorithms, including Random Forest, Gaus-
sian Naive Bayes, and multilayer perceptron [13]. Autoencoders have been utilized for
dimensionality reduction and feature extraction [14,15]. The CSE-CIC-IDS2018 dataset has
been widely used for training and evaluating intrusion detection models [16]. Researchers
have also investigated real-time intrusion detection using network monitoring tools like
Wireshark [17]. Adversarial training strategies have been proposed to enhance anomaly
detection capabilities [18]. The field continues to evolve, addressing challenges such
as class imbalance and the need for improved data cleaning and reproducibility [16,19].
Zhang et al. [20] also examined traffic classification, utilizing raw network traffic data
retrieved from PCAP files. The input was represented as a fixed-size byte vector of
m, but the authors employed an embedding layer to enhance the information in each
byte, allowing each byte to be transformed into a one-hot vector with a dimension of
255. The authors trained a supervised one-dimensional convolutional neural network
(1D-CNN) model for deep learning. Additional researchers have investigated the IS-
CXVPN2016, extracting features from the data through various methods, including the
conversion of PCAP files to JSON and the application of static and statistical features
(https://github.com/qa276390/Encrypted_Traffic_Classification, accessed on 12 March
2025) [21]. To the best of our knowledge, our work uniquely addresses the intrusion detec-
tion problem by directly modeling the raw XML data extracted from PCAP files as textual
data, bypassing the traditional feature selection step. This novel approach allows us to
treat network data in their original form, leveraging their textual representation to develop
an anomaly detection framework that avoids the pre-processing and feature engineering
stages typically employed in other methods.
The literature contains several works that address the development of Intrusion
Detection Systems for MMS attacks in smart-grid and industrial control systems [22,23].
After surveying the related literature, different challenges and difficulties could be figured
such as the complex structure of MMS packets, the inability of supervised approaches
to detect unknown attacks, and the length of time required to parse data packets. This
paper is not concerned with supervised intrusion detection for several reasons. First, it
is not always possible to have labeled data when collecting network traffic, and if it is
possible, the labeling process is very time-consuming. Also, one of the weaknesses of
supervised learning in such problems is that a model learns to classify what it has seen in
the past, making it less effective against more recent and advanced threats. The following
sections present a traffic mining-based intrusion detection approach in which traffic data
are collected, and unsupervised deep learning is used to identify abnormal behaviors in
industrial networks.
computationally more expensive to compute distance and discover the nearest neighbors
in high-dimensional space.
Recent research in unsupervised anomaly detection for multivariate time series data
has focused on utilizing variational autoencoders [27,28] and generative adversarial net-
works [29,30], to capture complex temporal dependencies and inter-correlations between
time series. Some models incorporate self-supervised learning [31] or self-training [29] to
improve performance on noisy or contaminated data. Other approaches employ low-rank
and sparse decomposition [32] or state-space models [33] for robust anomaly detection.
Researchers have also explored lightweight models for edge computing applications [28]
and interpretable methods for safety-critical systems [32]. These techniques have been
applied to various domains, including semiconductor manufacturing [34], water treatment
systems [32], and IoT systems [31], demonstrating improved performance over existing
methods. Through the unsupervised learning model, it is possible to cope with data that are
not labeled in the actual world, hence improving the anomaly detection system’s real-time
performance. However, typical unsupervised learning methods do not perform well with
high-dimensional time series data because the majority of data generated during the actual
industrial production process are high-dimensional and highly dynamic.
Conventional stacked autoencoders can process high-dimensional data, but their per-
formance on time series data is poor, while the LSTM network can efficiently extract the
data’s time series characteristics. In this paper, we integrate the features of the autoen-
coder with the LSTM network, resulting in the adoption of the LSTM-based technique for
unsupervised detection of abnormal events in smart-grid networks. Autoencoders can
handle multidimensional nonlinear data and learn the usual behavior of unlabeled datasets
due to their architecture. Combining LSTM networks and autoencoders and segmenting
dataset samples with sliding windows enables LSTM units to capture temporal relation-
ships in multivariate time series with the proposed approach. Simultaneously, the model
integrates the BiLSTM (bidirectional LSTM) network with the LSTM network, which can
better exploit the long-term dependencies in the data and, in comparison to a basic LSTM
network, can also extract the influence of the data before and after the anomalous moment.
After experimental validation, the approach described in this study has a good effect on
processing high-volume, high-dimensional, unlabeled, and time-related unbalanced data,
and it is more adaptable to the actual industrial environment. Our work stands out by
addressing the anomaly detection problem directly on raw multivariate time series data as
textual data.
4. Data Generation
4.1. Manufacturing Message Specification
The topic of interest of this paper is intrusion detection in Manufacturing Message
Specification (MMS) traffic in a power grid environment. The 61850/MMS standard applies
to controlling power grids [35], defining the communication between Intelligent Electronic
Devices (IEDs). Its objective is to replace the manufacturers’ proprietary protocols and
thus allow equipment interoperability. It describes a data model, a set of services to access
data, and mappings to protocols for using these services. This standard is designed for
the control of electrical networks. However, it does not propose a new communication
protocol. It is based on existing protocols such as MMS (ISO 9506) [36], GOOSE (Generic
Object Oriented Substation Event), and a mechanism for transmitting sampled values
(Sampled Values).
IoT 2025, 6, 22 6 of 22
5. Data Pre-Processing
MMS raw network traffic data records are stored in PCAP format files that involve a
mixture of PDU types. To apply the raw data to the anomaly detection model, it is necessary
to pre-process the original traffic data into a suitable data format. Figure 1 illustrates the
pre-processing of raw data. The significant steps taken in the pre-processing phase are
discussed in the following.
IoT 2025, 6, 22 7 of 22
Vocabulary design
One disadvantage of using the TF-IDF weighting method is that the vocabulary might
grow quite vast. The dimensionality curse is unavoidable given the volume of collected
data. This, in turn, will necessitate the use of enormous vectors for document encoding,
putting large requirements on memory and slowing down the training process. To address
the BoW model’s dimensional issue, the truncated Singular Value Decomposition (SVD)
was used [39]. The reason for selecting the truncated SVD over the standard SVD and
Principal Component Analysis (PCA) for dimensionality reduction is that the truncated
SVD is more computationally efficient. Due to the sparse nature of the transformed feature
vectors derived from the MMS packets, truncated SVD is better for handling such sparse
data than PCA or standard SVD. PCA requires computing the covariance matrix, which
necessitates acting on the entire matrix, increasing the processing overhead. Similarly,
given a M × N matrix, standard SVD will always yield a matrix with N columns, whereas
truncated SVD can give matrices with any number of columns. Figure 3 illustrates the
procedure for generating dimensionality-reduced feature vectors from an example of three
MMS documents.
deciding which states have the most significant impact on the present state rather than
merely selecting recent states.
An autoencoder is an unsupervised neural network model composed of two stages:
encoding and decoding. By mapping the raw data to a low-dimensional space, the encoder
can learn the significant features and patterns in the input data. From the low-dimensional
space, the decoder can reconstruct the original input data.
In this paper, we combine an LSTM network and autoencoder to create an encoder and
a decoder and both of them use two layers of LSTM. An encoder will extract features from
time series data, and a decoder will reconstruct samples from the extracted features. We
have multivariate time series data in our problem, where multiple variables are monitored
over time. Sequences of MMS packets represented as feature vectors will be used to train
an LSTM autoencoder for rare-event classification. For sequence reconstruction, an LSTM
autoencoder can be employed. During the training phase, it will learn to reconstruct
regular MMS traffic, and if the reconstruction error is large during testing, the input may
be classified as a potential attack.
(packet). In this problem, the number of features after applying dimensionality reduction
with SVD was 1017.
The encoder begins with a bidirectional LSTM layer, followed by a dropout and
another LSTM layer. Establishing direct connections between consecutive LSTM layers’
timestep cells is required. As a result, the first bidirectional LSTM layer causes each cell to
produce a signal once per timestep (return sequences = True). For the second LSTM layer,
only the last timestep cell emits signals (return sequences = False). The output is, therefore,
a vector.
To use the encoded features as an input for the decoder, starting with an LSTM layer,
a duplication of the features (RepeatVector) needs to take place to create a lookback × features
array. The decoder is composed of an LSTM layer followed by a dropout and a bidirectional
LSTM layer. Gaussian noise has been added after the bidirectional LSTM layer to improve
robustness and reduce overfitting. A Time Distributed layer has been added at the end of
the decoder to get the output that has the same shape as the input.
ENCODER
MMS packets
feature 1 Encoded features
dropout
feature 2
window 1
feature n
window 2
time
feature 1 feature n
Layer 3: RepeatVector(4)
Input: 1 x 256
matrix multiplication Output: 4 x 256
Output: 4 x 1017
V624x1017 x U4x624
Gaussian
noise dropout
DECODER
The autoencoder will generate errors while decoding the encoded features and recon-
structing the samples. Back-propagation is used to train an autoencoder to minimize the
reconstruction error. During the training phase, the autoencoder is fed with normal data.
By minimizing the mean squared error between the reconstructed and original samples,
the autoencoder learns the implicit features and patterns in the normal data. As a result,
the reconstruction error of normal samples is rather small during the testing phase. In com-
parison, the error in reconstructing abnormal samples is relatively large (because the model
does not learn the implicit features and patterns of abnormal samples). As a result, this
paper uses the reconstruction error as the sample’s anomaly score.
7. Experimental Results
7.1. Dataset
The testing set is unbalanced; it has 92,014 normal windows and 119 anomalous
moments. Notably, each window is M × N in size, with M denoting the window width
(number of packets) and N being the number of features. Our experiments included a
variety of different input size combinations. In the following, we will continue using a
window of four packets and 1017 features, as this combination produced the best outcome.
The window’s width should ensure that the window covers the duration of anomalous
events. In this context, four packets were enough to ensure that the entire abnormal event
could be contained within the window width of the abnormal sample.
TP
True Positive Rate (TPR)/Recall =
TP + FN
TP
Precision =
TP + FP
FP
False Positive Rate (FPR) =
TN + FP
True positive (TP) refers to the accurate identification of an attack sequence. False
positive (FP) refers to the classification of a normal sequence as an attack sequence. True
negative (TN) indicates that a normal sequence has been correctly classified, whereas false
IoT 2025, 6, 22 13 of 22
negative (FN) indicates that an attack sequence has been incorrectly classified as a normal
sequence. The receiver operating characteristic (ROC) curve and the area under the curve
(AUC) value are frequently used to evaluate the quality of a binary classifier in the binary
classification issue. The ROC curve takes the False Positive Rate as the horizontal axis and
the True Positive Rate as the vertical axis and forms a continuous curve with the movement
of the threshold. The AUC value represents the area under the ROC curve between 0 and 1.
The AUC value can be used to intuitively evaluate the model’s quality, with a bigger value
indicating a better model.
Figure 5. Reconstruction error per percentile for benign and attack sequences.
TN 91,051
True Negative Rate (TNR) = = = 0.9989
TN + FP 91,051 + 63
In this highly imbalanced setting, where normal sequences vastly outnumber attack
sequences, precision is not the most reliable metric to evaluate the model. Precision is highly
affected by the number of false positives (FPs), which are inevitable when the majority class
(normal sequences) is significantly larger. Even a small fraction of false positives can make
precision appear low, despite the model being effective in distinguishing between classes.
Instead, the True Positive Rate (TPR, or recall) and True Negative Rate (TNR) provide
a fairer assessment, as they measure how well the model captures actual attacks and avoids
misclassifying normal sequences. The Balanced Accuracy metric accounts for both classes
equally, making it a more suitable measure in cases of extreme class imbalance.
Additionally, the F2-score, which prioritizes recall over precision, is more relevant for
intrusion detection. In security-critical applications, it is preferable to flag more potential
threats (even at the cost of some false positives) than to miss actual attacks. The model’s
IoT 2025, 6, 22 16 of 22
high recall (96.63%) and strong TNR (99.89%) indicate that it effectively captures anomalies
while keeping false alarms to a minimum.
Table 1. Comparison of classification performance between the proposed approach and the Doc2Vec-
based approach.
Doc2Vec and other word embedding methods are designed to capture semantic mean-
ing and contextual similarity, making them well suited for natural language processing
tasks. However, in industrial network traffic analysis, the following challenges make
embeddings less effective:
1. Lack of Semantic Relationships: The keywords in MMS packets do not carry contextual
meaning. Words like “invokeid” or “confirmedresponsepdu” do not form meaning-
ful phrases—they act as standalone identifiers. Embeddings assume that words have
meaningful relationships, which does not apply in this case.
2. Keyword Presence Matters More Than Context: In MMS packets, the order of keywords
is far less important than their presence. A packet containing “errorpdu” is critical
to flag, regardless of its position in the sequence. TF-IDF naturally emphasizes such
occurrences, whereas embeddings attempt to infer meaning from order and context,
leading to potential loss of important indicators.
3. Better Feature Interpretability: With TF-IDF and SVD, we can directly interpret the
most influential terms contributing to classification, making debugging and refining
the model easier. Doc2Vec produces dense vectors that obscure which words are
influencing the decision.
4. Sparse Representation is Beneficial in This Case: Since the MMS vocabulary is technical
and limited, TF-IDF efficiently transforms the data into a high-dimensional sparse
representation where rare but critical terms (such as attack indicators) retain their
importance. In contrast, embeddings distribute word meanings over a continuous
space, which can dilute the impact of rare but crucial terms.
Overall, these characteristics make TF-IDF a more suitable choice for analyzing MMS
network packets, where detecting anomalies relies on the presence of specific words rather
than understanding linguistic structure.
Figure 9. Performance metrics across different models for varying window sizes.
The results presented in Table 2 and Figure 9 highlight several key observations:
• The BiLSTM Autoencoder achieves the highest recall (0.979) among unsupervised
approaches, meaning it correctly detects almost all attack packets.
• Precision is slightly lower for BiLSTM than for some classical ML models, indicating
a small number of false positives. However, in security applications, high recall is
prioritized over precision to minimize undetected threats.
• The Local Outlier Factor (LOF) also achieves a high recall (0.941), but with lower
overall Balanced Accuracy.
• The Gaussian Mixture Model (GMM) and KNN perform poorly, suggesting that density-
based approaches struggle with the high dimensionality of network traffic data.
• For reference, we included a supervised baseline model using BERT, trained on labeled
data. As expected, the BERT-based model achieved the highest performance, with an
F1-score of 0.95. However, in real-world scenarios, labeled attack data are scarce,
making fully supervised methods impractical for intrusion detection.
IoT 2025, 6, 22 19 of 22
8. Discussion
The approach presented in this paper utilizes a combination of text mining techniques
and deep learning for unsupervised anomaly detection in smart-grid communication net-
works. The core of the AI model is a bidirectional LSTM autoencoder. This architecture
was chosen for its ability to handle high-dimensional time series data and capture tem-
poral relationships in multivariate time series. The LSTM-based approach offers several
advantages, including the ability to learn long-term dependencies in the data. By combin-
ing LSTM networks with autoencoders, the model can effectively process high-volume,
high-dimensional, unlabeled, and time-related unbalanced data.
Unlike most studies in the state of the art, which model network traffic as structured,
tabular data (often in CSV format), this approach treats MMS packets as unstructured text
data. This fundamental difference in data representation requires a unique methodology
and makes direct comparisons with existing methods challenging. Unlike the majority of
work in the literature on anomaly-based intrusion detection, we formulated the problem as
a natural language processing (NLP) task. By treating each MMS packet as a “document”
and applying text mining techniques such as TF-IDF vectorization and truncated SVD,
our approach can effectively handle the variable nature of MMS packet content. This
is particularly advantageous because MMS packets can contain different information
depending on their status, making traditional structured data approaches less effective.
However, it is important to acknowledge that the field of AI is rapidly evolving,
with more advanced techniques emerging since the development of this approach. We
recognize this and outline our plans for future work, which include leveraging recent
advancements in transformers and generative AI for intrusion detection. For instance,
transformers have demonstrated exceptional performance in various sequence-based tasks.
Adapting transformer architectures for anomaly detection in network traffic could enhance
the model’s ability to capture complex patterns and dependencies. Additionally, generative
AI presents an opportunity to generate synthetic normal and anomalous traffic patterns,
addressing the challenge of imbalanced datasets and improving the model’s ability to
detect rare attack sequences.
Among the areas that can be tackled in future work, we highlight the following:
• Exploring synthetic data generation techniques to increase the number of anomalous
samples while preserving their key characteristics.
• Implementing cost-sensitive learning approaches that assign higher weights to the
minority class during training and evaluation, ensuring better detection of rare attacks.
• Investigating alternative dimensionality reduction techniques, such as t-SNE and
UMAP, to explore whether they can better capture the structure of MMS packet
data. While t-SNE is known to be computationally expensive for large datasets, its
IoT 2025, 6, 22 20 of 22
application may be feasible given the limited vocabulary in MMS traffic. Similarly,
UMAP could offer improved feature separation with lower computational costs.
• Evaluating SVD-BERT as a potential enhancement to the existing text-mining pipeline,
as it combines the strengths of dimensionality reduction with transformer-based embed-
dings, potentially improving feature representation for anomaly detection.
These enhancements aim to improve the robustness and generalizability of intrusion
detection models, making them more effective in real-world applications.
9. Conclusions
This paper proposed an unsupervised anomaly detection approach based on text
mining and deep learning to address the need for autonomous intrusion detection in
smart-grid communication networks. Numerous attack scenarios against industrial control
systems are possible. The Manufacturing Message Specification (MMS) traffic was the
focus of this paper. This paper initially presents a technique for preparing and extracting
features from raw MMS packets using a TF-IDF vectorizer and a truncated SVD. Rather
than manually picking features from each MMS packet, this paper treats each packet as a
document and represents it using the Bag of Words (BoW) text pre-processing approach.
Unlike embeddings, which aim to capture semantic relationships, this approach was more
effective for MMS traffic since anomaly detection relies on the presence of specific protocol-
related keywords rather than contextual meaning. Then, it proposes a bidirectional LSTM
autoencoder for unsupervised “sequence-aware” detection. This research implies that
unsupervised deep learning approaches could be used in place of supervised methods for
intrusion detection when labeling is either impractical or time-consuming. After training
on normal traffic, the proposed model produced acceptable false positive and false negative
rates when applied to unseen traffic with injected attack sequences. The weakness of
this approach, and unsupervised methods in general, is the medium to high rate of false
alarms (false positives) when the training data are incomplete. This demonstrates that
techniques based on artificial intelligence may be beneficial and supportive in industrial
control systems but are not yet the primary intrusion detection actor.
Author Contributions: Conceptualization, J.A.; Methodology, J.A. and M.A.S.; Software, J.A.; Valida-
tion, R.C. and H.N.; Formal analysis, M.A.S.; Investigation, R.C.; Writing—original draft, J.A. and
M.A.S.; Writing—review & editing, H.N.; Visualization, M.A.S.; Supervision, R.C. All authors have
read and agreed to the published version of the manuscript.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author due to privacy restriction. A sample of the code used is available at https:
//github.com/josephazar/Text-Mining-IDS (accessed on 12 March 2025).
Acknowledgments: This work has been supported by the EIPHI Graduate School (contract “ANR-17-
EURE-0002”). During the preparation of this manuscript, the authors used ChatGPT for the purposes
of reformulating and correcting grammatical errors. The authors have reviewed and edited the
output and take full responsibility for the content of this publication.
References
1. Otuoze, A.O.; Mustafa, M.W.; Larik, R.M. Smart grids security challenges: Classification by sources of threats. J. Electr. Syst. Inf.
Technol. 2018, 5, 468–483. [CrossRef]
2. Desarnaud, G. Cyber Attacks and Energy Infrastructures: Anticipating Risks. French Institute of International Relations (IFRI).
2017. Available online: https://www.ifri.org/sites/default/files/migrated_files/documents/atoms/files/desarnaud_cyber_
attacks_energy_infrastructures_2017_2.pdf (accessed on 12 March 2025).
IoT 2025, 6, 22 21 of 22
28. Fan, J.; Liu, Z.; Wu, H.; Wu, J.; Si, Z.; Hao, P.; Luan, T.H. LUAD: A lightweight unsupervised anomaly detection scheme for
multivariate time series data. Neurocomputing 2023, 557, 126644. [CrossRef]
29. Zhang, Z.; Li, W.; Ding, W.; Zhang, L.; Lu, Q.; Hu, P.; Gui, T.; Lu, S. STAD-GAN: Unsupervised Anomaly Detection on Multivariate
Time Series with Self-training Generative Adversarial Networks. ACM Trans. Knowl. Discov. Data 2023, 17, 1–18. [CrossRef]
30. Kong, L.; Yu, J.; Tang, D.; Song, Y.; Han, D. Multivariate Time Series Anomaly Detection with Generative Adversarial Networks
Based on Active Distortion Transformer. IEEE Sens. J. 2023, 23, 9658–9668. [CrossRef]
31. Jiao, Y.; Yang, K.; Song, D.; Tao, D. TimeAutoAD: Autonomous Anomaly Detection with Self-Supervised Contrastive Loss for
Multivariate Time Series. IEEE Trans. Netw. Sci. Eng. 2022, 9, 1604–1619. [CrossRef]
32. Belay, M.A.; Rasheed, A.; Rossi, P.S. Multivariate Time Series Anomaly Detection via Low-Rank and Sparse Decomposition. IEEE
Sens. J. 2024, 24, 34942–34952. [CrossRef]
33. Li, L.; Yan, J.; Wen, Q.; Jin, Y.; Yang, X. Learning Robust Deep State Space for Unsupervised Anomaly Detection in Contaminated
Time-Series. IEEE Trans. Knowl. Data Eng. 2022, 35, 6058–6072. [CrossRef]
34. Hwang, R.; Park, S.; Bin, Y.; Hwang, H.J. Anomaly Detection in Time Series Data and its Application to Semiconductor
Manufacturing. IEEE Access 2023, 11, 130483–130490. [CrossRef]
35. Tan, H.C.; Mohanraj, V.; Chen, B.; Mashima, D.; Nan, S.K.S.; Yang, A. An IEC 61850 MMS Traffic Parser for Customizable
and Efficient Intrusion Detection. In Proceedings of the 2021 IEEE International Conference on Communications, Control, and
Computing Technologies for Smart Grids (SmartGridComm), Aachen, Germany, 25–28 October 2021; IEEE: Piscataway, NJ, USA,
2021; pp. 194–200.
36. International Organization for Standardization. ISO 9506: Industrial Automation Systems—Manufacturing Message Specification
(MMS)—Part 1: Service Definition and Part 2: Protocol Specification. Originally Published in 1990, Revised in 2003. Available
online: https://www.iso.org/standard/37079.html (accessed on 12 March 2025).
37. Qaiser, S.; Ali, R. Text mining: Use of TF-IDF to examine the relevance of words to documents. Int. J. Comput. Appl. 2018,
181, 25–29.
38. Dzisevič, R.; Šešok, D. Text classification using different feature extraction approaches. In Proceedings of the 2019 Open
Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 25 April 2019; IEEE: Piscataway, NJ,
USA, 2019; pp. 1–4.
39. Du, S.S.; Wang, Y.; Singh, A. On the power of truncated SVD for general high-rank matrix estimation problems. In Proceedings of
the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30,
pp. 2017–2026. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/89f0fd5c927d466d6ec9a21b9ac34ffa-
Abstract.html (accessed on 12 March 2025).
40. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
41. Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-normalizing neural networks. In Proceedings of the Advances in
Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.