Computational Intelligence and Machine Learning
Computational Intelligence and Machine Learning
Computational Intelligence
and Machine Learning
Models and Applications
Edited by
Grzegorz Dudek
[Link]/journal/electronics
Computational Intelligence and
Machine Learning: Models and
Applications
Computational Intelligence and
Machine Learning: Models and
Applications
Guest Editor
Grzegorz Dudek
Editorial Office
MDPI AG
Grosspeteranlage 5
4052 Basel, Switzerland
This is a reprint of the Special Issue, published open access by the journal Electronics (ISSN 2079-9292),
freely accessible at: [Link] issues/0786LJ4A25.
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
Lastname, A.A.; Lastname, B.B. Article Title. Journal Name Year, Volume Number, Page Range.
© 2025 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license. The book as a whole is distributed by MDPI under the terms
and conditions of the Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)
license ([Link]
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Grzegorz Dudek
Computational Intelligence and Machine Learning: Advances in Models and Applications
Reprinted from: Electronics 2025, 14, 1530, [Link] . . . . . 1
Haoyu Xu, Guodong Wu, Enting Zhai, Xiu Jin and Lijing Tu
Preference-Aware Light Graph Convolution Network for Social Recommendation
Reprinted from: Electronics 2023, 12, 2397, [Link] . . . . . 50
Hüseyin Polat, Alp Kaan Turan, Cemal Koçak and Hasan Basri Ulaş
Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR)
System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA)
Adapter on Its Performance
Reprinted from: Electronics 2024, 13, 4227, [Link] . . . . . 69
Meshrif Alruily
ArRASA: Channel Optimization for Deep Learning-Based Arabic NLU Chatbot Framework
Reprinted from: Electronics 2022, 11, 3745, [Link] electronics11223745 . . . . . 112
Caiping Hu, Xuekui Sun, Hua Dai, Hangchuan Zhang and Haiqiang Liu
Research on Log Anomaly Detection Based on Sentence-BERT
Reprinted from: Electronics 2023, 12, 3580, [Link] . . . . . 128
Samet Memiş
Picture Fuzzy Soft Matrices and Application of Their Distance Measures to Supervised
Learning: Picture Fuzzy Soft k-Nearest Neighbor (PFS-kNN)
Reprinted from: Electronics 2023, 12, 4129, [Link] . . . . . 144
Jenny Aracely Segovia, Jonathan Fernando Toaquiza, Jacqueline Rosario Llanos and David
Raimundo Rivas
Meteorological Variables Forecasting System Using Machine Learning and Open-Source
Software
Reprinted from: Electronics 2023, 12, 1007, [Link] . . . . . 169
Mohamed Ashik Shahul Hameed, Asifa Mehmood Qureshi and Abhishek Kaushik
Bias Mitigation via Synthetic Data Generation: A Review
Reprinted from: Electronics 2024, 13, 3909, [Link] . . . . . 188
v
About the Editor
Grzegorz Dudek
Grzegorz Dudek is a professor of Information and Communication Technology. He received
his PhD in Electrical Engineering from the Czestochowa University of Technology (CUT), Poland, in
2003, and he completed his habilitation in Computer Science at the Lodz University of Technology,
Poland, in 2013. In 2023, he was appointed as a full professor. Currently, he is a professor at
the Department of Electrical Engineering, CUT, and Department of Mathematics and Computer
Science, University of Lodz. His research primarily focuses on machine learning and artificial
intelligence, with a strong emphasis on their applications in classification, regression, forecasting,
and optimization. He has authored and co-authored four books and over 140 scientific papers in
these areas. Notably, he has been recognized in the global ranking of the world’s most influential
scientists (top 2% list) compiled by Stanford University and Elsevier.
vii
Preface
This Special Issue brings together a diverse and timely selection of research contributions
that explore the theoretical foundations, methodological innovations, and practical applications of
intelligent computational systems. The aim of this collection is to highlight emerging trends and
pressing challenges in machine learning and computational intelligence, offering new perspectives
on how these technologies can be more effectively developed, adapted, and deployed in real-world
environments.
The scope of the Special Issue spans a wide range of topics, including personalized
recommendation systems, social network analysis, predictive modeling, speech recognition in
low-resource languages, log anomaly detection, natural language understanding in Arabic, bias
mitigation through synthetic data generation, and meteorological forecasting. The studies presented
here employ a rich set of methodologies, such as transformer architectures, graph neural networks,
retrieval-augmented generation, self-supervised learning, and fuzzy logic frameworks, to address the
complexity, uncertainty, and dynamism of real-world data.
The motivation behind compiling this Special Issue stems from the growing importance of
machine learning in shaping decision-making systems, user experiences, and data-driven services
across various domains. As AI continues to evolve from experimental prototypes into large-scale
systems with social and economic impact, there is an urgent need to ensure these technologies are
not only powerful and efficient but also interpretable, fair, and inclusive. This collection reflects that
shift, with several papers emphasizing the importance of model transparency, domain adaptation,
and ethical considerations.
The Special Issue is intended for a broad audience that includes researchers, practitioners, and
graduate students in computer science, artificial intelligence, data science, and engineering. It may
also serve as a resource for professionals working in application areas such as healthcare, finance,
transportation, energy, and software systems who seek to understand and apply the latest advances
in machine learning and computational intelligence to their specific challenges.
By bringing together these diverse yet thematically connected contributions, we hope
this volume will inform, inspire, and guide future research and development in the field,
encouraging new collaborations, methodological refinements, and impactful applications of
intelligent technologies.
Grzegorz Dudek
Guest Editor
ix
Editorial
1. Introduction
Computational intelligence (CI) and machine learning (ML) have evolved into founda-
tional pillars of modern data-driven research, with growing impacts across domains such
as engineering, medicine, finance, and environmental science [1]. Their capacity to learn
patterns from data and adapt to dynamic environments makes them indispensable tools
for both academic and industrial innovation. The past decade has seen a surge in interest
and the practical deployment of CI and ML models, ranging from classical techniques like
decision trees and support vector machines to recent breakthroughs in deep learning and
large language models [2].
Despite their progress, the development and application of CI and ML algorithms re-
main complex and challenging. A persistent issue lies in the appropriate selection of model
architectures and training strategies to ensure both learning efficacy and generalization [3].
This challenge becomes even more pronounced in practical contexts where data may be
noisy, sparse, high-dimensional, or subject to dynamic shifts. Moreover, the increasing
societal reliance on AI systems has heightened the demand for models that are not only
accurate but also interpretable, fair, and robust [4].
In response to these demands, recent research has explored both foundational improve-
ments to learning mechanisms and application-specific enhancements. For instance, one of
the highlighted contributions in this Special Issue addresses the limitations of traditional rec-
ommendation systems by incorporating generative AI with psychological modeling to per-
sonalize travel recommendations. Another study introduces self-supervised learning into
Received: 25 March 2025
Accepted: 9 April 2025
graph-based collaborative filtering, achieving better representation learning and reducing
Published: 10 April 2025 the reliance on labeled data. Further advancements are seen in the use of preference-aware
Citation: Dudek, G. Computational
graph neural networks to filter social signals and in the adaptation of large transformer-
Intelligence and Machine Learning: based models to improve automatic speech recognition in low-resource languages
Advances in Models and like Turkish.
Applications. Electronics 2025, 14, This Special Issue brings together ten papers selected from 40 submissions that ex-
1530. https://
emplify the diversity and maturity of current research in CI and ML. These works not
[Link]/10.3390/electronics14081530
only introduce novel algorithms and architectures but also demonstrate how to rigorously
evaluate them in real-world settings, ranging from environmental prediction using open-
Copyright: © 2025 by the author.
source ML toolkits to bias mitigation in healthcare through synthetic data generation. A
Licensee MDPI, Basel, Switzerland.
recurring theme across these papers is the emphasis on data-centric methodologies, from
This article is an open access article
distributed under the terms and
feature engineering and data augmentation to the design of metrics for fairness, utility,
conditions of the Creative Commons and interpretability.
Attribution (CC BY) license Together, these studies illustrate the field’s transition toward more specialized, context-
([Link] aware, and socially responsible AI systems. They reflect the community’s ongoing effort
licenses/by/4.0/).
to balance model complexity with usability, accuracy with equity, and innovation with
reproducibility. In doing so, they offer valuable insights not only into technical progress but
also into the broader implications of deploying intelligent systems across various domains
of human activity.
2
Electronics 2025, 14, 1530
The core problem being solved is the inefficiency of existing GNN-based recommenda-
tion systems in handling sparse interactions and noisy data. Traditional models struggle to
accurately represent user preferences, especially for long-tail items that lack sufficient inter-
actions. This issue is crucial because it affects the quality and diversity of recommendations,
limiting the personalization potential of recommender systems. Additionally, most models
operate in a fully supervised paradigm, which heavily relies on explicit user feedback that
is often scarce and biased.
To address these issues, the proposed method incorporates self-supervised learning
(SSL) into a graph attention network (GAT)-based collaborative filtering framework. The
model consists of two primary components: a supervised learning task using a multi-head
graph attention network (GAT) and an auxiliary self-supervised learning task that enhances
representation learning. The GAT component refines node representations by assigning
different importance weights to neighboring nodes, mitigating the impact of noisy data.
The self-supervised learning task employs contrastive learning, generating multiple views
of each node through graph data augmentation techniques such as node masking, edge
masking, and layer masking. The model maximizes the consistency between different views
of the same node while minimizing the consistency between the views of different nodes.
This study conducts extensive experiments on three benchmark datasets—Yelp2018,
Gowalla, and Amazon—to evaluate the effectiveness of SGACF. The results demonstrate
significant improvements in accuracy and robustness compared to existing methods. The
model outperforms state-of-the-art recommendation models, including Neural Matrix
Factorization (NeuMF), Spectral Collaborative Filtering, and Neural Graph Collaborative
Filtering (NGCF). Notably, SGACF achieves better recall and normalized discounted cumu-
lative gain (NDCG) scores, particularly in mitigating the long-tail problem by enhancing
the representation of low-degree nodes.
The key innovations of this work include the integration of self-supervised contrastive
learning into graph-based recommendation models, the use of multi-head graph attention
to improve representation learning, and the introduction of novel data augmentation
strategies for graph-based learning. Unlike previous approaches, this method effectively
reduces the reliance on explicit user feedback, improves model generalization, and enhances
recommendation diversity.
The contributions of this research are substantial. It establishes a new paradigm
for self-supervised learning in recommendation systems, demonstrating that auxiliary
self-supervised tasks can significantly enhance supervised learning. The introduction
of graph attention networks in combination with self-supervised contrastive learning
provides a novel approach to tackling the challenges of data sparsity, interaction noise,
and long-tail recommendations. Future research directions include further exploration of
data augmentation techniques for graph-based learning, improving contrastive learning
frameworks, and extending self-supervised learning to broader recommendation scenarios.
This work marks a significant advancement in AI-driven recommendation models, offering
a more efficient, scalable, and accurate approach to personalized recommendations.
The paper by Xu et al. explores the challenge of enhancing social recommendation
systems by introducing a preference-aware graph neural network approach. Traditional
recommendation systems, especially those based on collaborative filtering, often suffer
from data sparsity, which limits their ability to provide personalized recommendations.
Many existing social recommendation models incorporate user relationships to mitigate
this issue, but they frequently fail to properly filter out irrelevant or negative informa-
tion from high-order neighbors. This results in a decline in recommendation accuracy
and effectiveness.
3
Electronics 2025, 14, 1530
This problem is crucial because social recommendation systems are increasingly used
in e-commerce, social media, and content recommendation platforms. The challenge of
information overload makes it difficult for users to find relevant content. Introducing social
connections can enhance recommendation accuracy, but only if those connections are mean-
ingfully filtered to ensure that only relevant social signals contribute to recommendations.
To address these challenges, the authors propose the Preference-Aware Light Graph
Convolutional Network (PLGCN). This model consists of several key components. First,
it includes an unsupervised subgraph construction module, which clusters users into
subgraphs based on their preferences. By grouping users with similar preferences, the
PLGCN effectively filters out negative or irrelevant messages from users with different
interests. Second, a feature aggregation module is designed to combine user embeddings
with social and interaction information more effectively. Finally, the model employs a
lightweight GNN framework, removing nonlinear activation and feature transformation
operations to prevent overfitting and improve computational efficiency.
The authors conducted comprehensive experiments on two real-world datasets,
LastFM and Ciao, to evaluate the performance of the PLGCN. The results indicate that
the PLGCN outperforms state-of-the-art methods, particularly in addressing the cold-start
problem, where new users or items have limited interaction data. Compared to baseline
models such as NGCF, LightGCN, and SocialLGN, the PLGCN achieved superior precision,
recall, and normalized discounted cumulative gain (NDCG) scores, demonstrating its
effectiveness in providing more accurate and relevant recommendations.
The study introduces several key innovations. The preference-aware subgraph con-
struction module represents a novel approach to filtering negative information in social
recommendation systems, significantly improving recommendation performance. The
lightweight GNN framework reduces model complexity while maintaining high accuracy,
making it more suitable for large-scale applications. The feature aggregation module en-
hances user representations by integrating interaction and social information in a more
structured way.
In terms of contributions, this work advances the field of social recommendation by
introducing an efficient and scalable model that outperforms existing GNN-based recom-
mendation approaches. The proposed methodology demonstrates improved performance
in cold-start scenarios, which remains a major challenge in recommendation systems. The
study also highlights the potential for further enhancements, including incorporating ad-
ditional social features such as trust levels and exploring dynamic social networks where
user preferences evolve over time. This research provides a strong foundation for future
developments in AI-driven personalized recommendations, with practical applications
extending beyond social recommendations to e-commerce, online streaming platforms, and
digital marketing strategies.
The study by Polat et al. investigates the development and optimization of an auto-
matic speech recognition (ASR) system for Turkish using the Whisper architecture and
evaluates the performance gains achieved through fine-tuning with Low-Rank Adaptation
(LoRA). The main problem being tackled is the limited performance of ASR systems in low-
resource languages such as Turkish. Despite the capabilities of modern transformer-based
models like Whisper, their accuracy in Turkish remains suboptimal due to the language’s
morphological complexity, dialectal variation, and the scarcity of high-quality labeled
datasets. These limitations make it difficult to achieve reliable, scalable ASR performance
in real-world Turkish applications.
To overcome this, the authors implement an end-to-end ASR system using Whisper
and fine-tune it with the LoRA technique. Whisper is based on a transformer architecture
known for its ability to handle multilingual, noisy, and long-context inputs effectively.
4
Electronics 2025, 14, 1530
However, Whisper’s training is biased toward high-resource languages like English. LoRA
addresses the challenge of fine-tuning large-scale models by introducing low-rank trainable
matrices, drastically reducing the number of parameters to be updated during training. This
makes the fine-tuning process more computationally efficient and accessible, particularly
for low-resource languages.
The study uses five Turkish speech datasets—METU MS, TNST, Mozilla Common
Voice, FLEURS, and TASRT—to evaluate the system’s performance before and after fine-
tuning. The results show significant improvements in word error rates (WERs) and charac-
ter error rates (CERs), especially in the Whisper medium and large models after applying
LoRA. For example, the WER was reduced by up to 52%, with corresponding decreases
in the CER, demonstrating the effectiveness of LoRA-enhanced fine-tuning. The paper
also includes a comparative analysis with the Google USM ASR model, showing that the
Whisper-large-v3 model outperforms Google’s system on most datasets in both the WER
and CER.
A key advancement of this study is the application of LoRA to the Whisper model
for Turkish ASR, combined with a thorough evaluation across multiple speech datasets
and targeted improvements to dataset quality. Through the use of a transformer-based
architecture optimized via a parameter-efficient fine-tuning approach, the research enhances
the adaptability of large-scale ASR systems for languages with limited resources.
The study makes two primary contributions: it demonstrates that Whisper can be
effectively adapted to Turkish using LoRA, and it provides a framework for improving
ASR performance in other low-resource languages with similar challenges. The study
underscores the value of transformer-based models combined with efficient fine-tuning
techniques and sets a precedent for further research in multilingual, resource-constrained
ASR development.
The paper by Huang and Li proposes a novel framework called GGTr to address the
problem of human motion prediction, which involves forecasting future body movements
based on past pose sequences. This task is particularly challenging due to the high complex-
ity, variability, and uncertainty of human motion, which involves intricate spatial–temporal
dependencies among body joints. Existing models often fail to simultaneously capture both
local and global temporal dynamics or accurately represent spatial interactions between
joints, limiting their performance in real-world applications such as robotics, surveillance,
and human–computer interaction.
The authors address these limitations by proposing a new architecture that integrates
Graph Convolutional Networks (GCNs), Gated Recurrent Units (GRUs), and transformer
layers. The GCN module incorporates a learned positional representation, allowing the
model to capture complex spatial relationships between joints beyond fixed adjacency
matrices. GRUs are used to model local temporal dependencies in joint motion, while
the transformer layers extract long-range temporal patterns, enabling the network to
effectively handle both short-term transitions and long-term dynamics within human
motion sequences.
The GGTr model is trained end-to-end using the mean per joint position error (MPJPE)
as a loss function and is optimized with the AdamW optimizer. Evaluations are con-
ducted on two benchmark datasets, Human3.6M and CMU-MoCap, where the proposed
framework consistently outperforms state-of-the-art methods across both short-term and
long-term motion prediction tasks. The results show especially strong performance im-
provements for complex, irregular, and non-periodic movements, where traditional models
often struggle.
Among the novel contributions of this work is the integration of position-aware GCNs
with temporal modeling using GRUs and transformer layers, creating a unified framework
5
Electronics 2025, 14, 1530
6
Electronics 2025, 14, 1530
of two primary stages: model training and anomaly detection. In the training phase, a log
parser converts raw logs into structured events and triples, which are then used to train
the Sentence-BERT-based semantic vector model (T-SBERT). These vectors are arranged
into sequences and passed through a Bi-LSTM model trained to learn normal log behavior.
During detection, new logs are parsed, transformed into semantic vectors, and analyzed
using the trained Bi-LSTM model to identify anomalies based on a similarity threshold.
The approach advances current practices by combining semantic feature extraction
from Sentence-BERT with sequence modeling through a Bi-LSTM equipped with an at-
tention mechanism. This design enables the model to better understand contextual re-
lationships between log events, increasing both detection accuracy and robustness. The
framework also includes a semantic matching algorithm that supports generalization to
new log events, addressing a key limitation of prior models which perform poorly when
log formats change or new events appear.
Evaluation on two real-world datasets, HDFS and OpenStack, shows that Lo-
gADSBERT outperforms existing deep learning-based methods such as DeepLog and
LogAnomaly in terms of precision, recall, and F1-score. The model demonstrates particular
strength in handling newly injected log events, maintaining high detection performance
even when encountering previously unseen patterns. Experimental results also confirm the
method’s resilience across different hyperparameter settings, indicating its adaptability to
diverse application environments.
Overall, the study presents a semantically enriched and sequence-aware approach to
log anomaly detection that significantly improves accuracy, robustness, and generalization
compared to traditional and existing deep learning methods. This work highlights the
importance of integrating natural language processing techniques with temporal modeling
in system monitoring applications.
The paper by Memiş deals with the formalization and application of picture fuzzy
soft matrices (pfs-matrices) in supervised learning, particularly by introducing a new
classification algorithm called Picture Fuzzy Soft k-Nearest Neighbor (PFS-kNN). The
problem being addressed arises from inconsistencies in earlier definitions of picture fuzzy
sets and picture fuzzy soft sets. These inconsistencies limit the reliability and applicability of
pfs-sets and their matrix forms in computational tasks, especially those involving uncertain
or imprecise information, such as real-world decision-making or classification problems.
The issue is significant because many complex problems, especially in areas like
medical diagnosis or preference-based decision-making, involve uncertainty that cannot
be effectively captured by classical mathematical tools. Traditional fuzzy set models and
even their extensions, like intuitionistic or Pythagorean fuzzy sets, struggle to fully express
cases that include partial agreement, disagreement, and abstention (such as in voting
scenarios). Picture fuzzy sets address this by introducing three degrees: membership, non-
membership, and neutrality. However, without a consistent matrix-based representation
and valid mathematical operations, their use in machine learning remains limited.
To resolve this, the author redefines the structure of pfs-matrices to eliminate logical
and algebraic contradictions present in earlier models. These matrices allow for the repre-
sentation of data points with complex uncertainty structures in a way that is suitable for
computation. The paper then defines a set of new distance measures—such as Minkowski,
Euclidean, and Hamming distances—for comparing pfs-matrices. These distance measures
are used in the construction of the PFS-kNN classifier, which adapts the classical k-Nearest
Neighbor algorithm to the picture fuzzy soft set context by evaluating similarity between
pfs-matrices rather than standard numerical vectors.
The classifier is tested using four medical datasets from the UCI Machine Learning
Repository. The proposed method demonstrates superior performance compared to existing
7
Electronics 2025, 14, 1530
kNN-based classifiers across multiple metrics, including accuracy, precision, recall, and
F1-score. In 72 out of 128 evaluation cases, PFS-kNN outperforms the baselines.
What distinguishes this approach is the fusion of a restructured mathematical frame-
work with a practical supervised learning application. The paper not only resolves the-
oretical flaws in the structure and operations of pfs-matrices but also shows that these
improvements lead to better modeling of uncertainty in real-world datasets. As a result,
this work establishes a more robust foundation for integrating the picture fuzzy soft set
theory into machine learning, with potential applications in any domain requiring nuanced
handling of vague, partial, or conflicting information.
The study by Segovia addresses the problem of accurately forecasting meteorolog-
ical variables—specifically temperature, relative humidity, solar radiation, and wind
speed—using machine learning techniques implemented in open-source software. The
significance of this issue lies in the increasing demand for reliable weather prediction to
support applications in renewable energy management, agriculture, environmental moni-
toring, and public health. Climate variability and the nonlinear behavior of atmospheric
variables make traditional statistical approaches insufficient, especially in regions with
complex weather dynamics like the study area in Ecuador.
To meet this challenge, the authors propose a forecasting system based on Python and
compare the performance of six supervised learning models: multiple linear regression,
polynomial regression, decision tree, random forest, XGBoost, and the multilayer percep-
tron neural network. The models were trained and tested using a one-year dataset collected
every five minutes from a meteorological station in the Tungurahua province of Ecuador.
Each model’s performance was assessed using four evaluation metrics. The findings show
that the random forest model consistently delivers the most accurate predictions across
most variables. However, wind speed posed the greatest forecasting challenge due to its
high variability, with the best results for this variable obtained using XGBoost.
What distinguishes this work is the development of a low-cost, replicable forecast-
ing system based entirely on open-source tools. The methodology is adaptable to other
meteorological contexts and supports broader implementations in intelligent agriculture
and microgrid control. The approach demonstrates the capacity of ensemble methods
and neural networks to model complex atmospheric behaviors and suggests that machine
learning can offer reliable and scalable solutions for real-time environmental prediction.
The paper by Shahul Hameed examines the use of synthetic data generation as a
strategy for mitigating bias in artificial intelligence systems, with a particular focus on
medical datasets. The problem it addresses stems from the growing concern that AI models,
especially in healthcare, can replicate and even amplify societal biases present in training
data. These biases can result in unfair treatment recommendations, misdiagnoses, or
unequal access to healthcare services for certain demographic groups, particularly those
underrepresented in existing datasets.
This issue is especially critical in clinical settings, where algorithmic decisions can
have direct implications on patient outcomes. For example, biased models may lead to
disparities in diagnoses or therapeutic suggestions across racial, gender, or socioeconomic
groups. Traditional approaches to mitigate bias—such as algorithmic adjustments, pre-
or post-processing of data, or attempts to diversify existing datasets—are often limited in
effectiveness or introduce trade-offs, such as a loss of data fidelity. Synthetic data generation
offers an alternative that maintains the statistical structure of the original dataset while
improving representativeness and privacy.
The authors conduct a comprehensive review of seventeen peer-reviewed studies
published between 2020 and 2024, selected through a structured search process involving
major scientific databases including Google Scholar, PubMed, IEEE Xplore, ScienceDirect,
8
Electronics 2025, 14, 1530
and the ACM Digital Library. The selected studies apply a range of synthetic data gen-
eration techniques to address bias, including Generative Adversarial Networks (GANs),
Bayesian networks, Structural Causal Models (SCMs), SMOTE, Gaussian copulas, and
deep reinforcement learning. These methods are used to augment or replace biased data in
applications such as diagnosis prediction, treatment recommendation, and health record
de-identification. Several approaches emphasize the dual benefit of fairness improvement
and data privacy preservation.
The paper details how GANs are widely used to generate synthetic medical images,
signals, and tabular data, while Bayesian and causal models offer structured frameworks
for encoding probabilistic or causal relationships among variables. SMOTE and CTGAN
are frequently employed for balancing class distributions in imbalanced datasets. Across
studies, the effectiveness of these methods is assessed using various fairness and perfor-
mance metrics such as demographic parity, equal opportunity, ROC-AUC, F1-score, and
domain-specific utility scores.
In the discussion, the authors highlight that synthetic data generation has proven effec-
tive in enhancing model fairness and performance when applied appropriately. However,
the success of these methods depends heavily on the quality of the initial dataset, the suit-
ability of the chosen technique, and the availability of domain knowledge for model tuning.
Limitations include computational cost, the complexity of implementation (especially for
causal models), challenges in preserving high-dimensional dependencies, and the risks of
introducing new types of bias during data generation.
While the reviewed methods show promise, the authors stress that the current syn-
thetic data techniques still face barriers to broader adoption in real-world healthcare
systems. These include methodological complexity, lack of standardization, and limited
validation across diverse populations. Nonetheless, the review offers a solid foundation for
further research into artificial data generation as a practical and ethical solution to bias in
AI, especially in domains where data privacy and fairness are both paramount.
3. Conclusions
This Special Issue presents a comprehensive snapshot of current advancements in
computational intelligence and machine learning, highlighting their increasing sophis-
tication, diversity of application, and relevance to real-world challenges. The ten fea-
tured papers collectively demonstrate how novel learning paradigms, model architectures,
and data representations can address critical problems such as personalization, recom-
mendation diversity, fairness, interpretability, and performance in low-resource or noisy
data environments.
A prominent trend throughout the contributions is the growing integration of deep
learning models, particularly transformers, graph neural networks, and hybrid architec-
tures, with domain-specific knowledge and auxiliary learning objectives. From enhancing
travel recommendation systems with personality profiling to deploying self-supervised
learning in collaborative filtering, these studies showcase the importance of designing
models that are not only accurate but also adaptive and explainable. Additionally, multiple
contributions emphasize the practical viability of proposed solutions, as evidenced by
experimental validation on diverse benchmark datasets and real-world scenarios.
Another key theme is the increasing emphasis on ethical and inclusive AI, particularly
in works focused on bias mitigation, fairness in healthcare applications, and accessibility
for underrepresented languages. The use of synthetic data generation, lightweight model
adaptation, and open-source deployment reflects a broader movement toward responsible,
transparent, and reproducible research practices.
9
Electronics 2025, 14, 1530
Overall, the research collected in this Special Issue contributes to a deeper understand-
ing of both the capabilities and limitations of contemporary machine learning systems. It
provides a valuable resource for researchers and practitioners seeking to harness the power
of computational intelligence in increasingly complex, uncertain, and socially sensitive
environments. The breadth of methodological approaches and problem domains also
points to promising directions for future work, including multimodal learning, continual
adaptation, and trustworthy AI frameworks.
List of Contributions
1. Aribas, E.; Daglarli, E. Transforming Personalized Travel Recommendations: Integrating Genera-
tive AI with Personality Models. Electronics 2024, 13, 4751. [Link]
3234751.
2. Zhu, J.; Li, K.; Peng, J.; Qi, J. Self-Supervised Graph Attention Collaborative Filtering for
Recommendation. Electronics 2023, 12, 793. [Link]
3. Xu, H.; Wu, G.; Zhai, E.; Jin, X.; Tu, L. Preference-Aware Light Graph Convolution Network for
Social Recommendation. Electronics 2023, 12, 2397. [Link]
7.
4. Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based
Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-
Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance. Electronics 2024, 13,
4227. [Link]
5. Huang, B.; Li, X. GGTr: An Innovative Framework for Accurate and Realistic Human Motion
Prediction. Electronics 2023, 12, 3305. [Link]
6. Alruily, M. ArRASA: Channel Optimization for Deep Learning-Based Arabic NLU Chatbot
Framework. Electronics 2022, 11, 3745. [Link]
7. Hu, C.; Sun, X.; Dai, H.; Zhang, H.; Liu, H. Research on Log Anomaly Detection Based on
Sentence-BERT. Electronics 2023, 12, 3580. [Link]
8. Memiş, S. Picture Fuzzy Soft Matrices and Application of Their Distance Measures to Supervised
Learning: Picture Fuzzy Soft k-Nearest Neighbor (PFS-kNN). Electronics 2023, 12, 4129. https:
//[Link]/10.3390/electronics12194129.
9. Segovia, J.A.; Toaquiza, J.F.; Llanos, J.R.; Rivas, D.R. Meteorological Variables Forecasting
System Using Machine Learning and Open-Source Software. Electronics 2023, 12, 1007. https:
//[Link]/10.3390/electronics12041007.
10. Shahul Hameed, M.A.; Qureshi, A.M.; Kaushik, A. Bias Mitigation via Synthetic Data Genera-
tion: A Review. Electronics 2024, 13, 3909. [Link]
References
1. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160.
[CrossRef] [PubMed]
2. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning for AI. Commun. ACM 2021, 64, 58–65.
3. Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res. 2019, 20, 1–21.
4. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding Deep Learning Requires Rethinking Generalization. arXiv
2017, arXiv:1611.03530. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
10
Article
Transforming Personalized Travel Recommendations:
Integrating Generative AI with Personality Models
Erke Aribas * and Evren Daglarli *
Department of Computer Engineering, Faculty of Computer and Informatics Engineering, Istanbul Technical
University, 34485 Istanbul, Turkey
* Correspondence: aribas@[Link] (E.A.); daglarli@[Link] (E.D.)
Abstract: Over the past few years, the incorporation of generative Artificial Intelligence (AI) tech-
niques, particularly the Retrieval-Augmented Generator (RAG) framework, has opened up rev-
olutionary opportunities for improving personalized travel recommendation systems. The RAG
framework seamlessly combines the capabilities of large-scale language models with retriever mod-
els, facilitating the generation of diverse and contextually relevant recommendations tailored to
individual preferences and interests, all of which are based on natural language queries. These
systems iteratively learn and adapt to user feedback, thereby continuously refining and improving
recommendation quality over time. This dynamic learning process enables the system to dynamically
adjust to changes in user preferences, emerging travel trends, and contextual factors, ensuring that
the recommendations remain pertinent and personalized. Furthermore, we explore the incorporation
of personality models like the Myers–Briggs Type Indicator (MBTI) and the Big Five (BF) personality
traits into personalized travel recommendation systems. By incorporating these personality models,
our research aims to enrich the understanding of user preferences and behavior, allowing for even
more precise and tailored recommendations. We explore the potential synergies between personality
psychology and advanced AI techniques, specifically the RAG framework with a personality model,
in revolutionizing personalized travel recommendations. Additionally, we conduct an in-depth
examination of the underlying principles, methodologies, and technical intricacies of these advanced
AI techniques, emphasizing their ability to understand natural language queries, retrieve relevant
information from vast knowledge bases, and generate contextually rich recommendations tailored
to individual personalities. In our personalized travel recommendation system model, results are
Citation: Aribas, E.; Daglarli, E.
Transforming Personalized Travel
achieved such as user satisfaction (78%), system accuracy (82%), and the performance rate based on
Recommendations: Integrating user personality traits (85% for extraversion and 75% for introversion).
Generative AI with Personality
Models. Electronics 2024, 13, 4751. Keywords: personality model; retrieval augmentation; large language models (LLM); travel
[Link] recommendation
electronics13234751
12
Electronics 2024, 13, 4751
3. Background
Recommendation systems have evolved from early collaborative filtering methods,
such as user–item matrix factorization, to more sophisticated techniques incorporating
machine learning algorithms and deep learning models. Collaborative filtering approaches
from Resnick and Varian rely on user feedback and preferences to generate recommen-
dations, but they face challenges including cold-start issues and data sparsity [23]. As
computational power increased and data availability improved, content-based filtering
methods from Pazzani and Billsus emerged, utilizing item attributes and user profiles
to make personalized recommendations [24]. These methods are effective but limited at
capturing complex user preferences and contextual information.
13
Electronics 2024, 13, 4751
14
Electronics 2024, 13, 4751
15
Electronics 2024, 13, 4751
16
Electronics 2024, 13, 4751
equip language models with a deeper understanding of human personality. This, in turn,
could allow these models to tailor their responses to better reflect the user’s individual
characteristics.
So, first we create a mapping from the MBTI to the BF and implement protectionist
interfaces in the language model. We then extend this methodology to a time-sensitive
design where conducting a fixed effect analysis is straightforward [40]. When several
individuals are observed across multiple time points, these time points are accumulated
over ten-year intervals to obtain a discriminative observation result in the popular use of
LLMs such as ChatGPT [6,7].
Introduction to language models such as the ChatGPT encompasses a sophisticated
understanding of natural language processing (NLP) techniques and the pivotal role these
models play in comprehending and responding to natural language queries.
ChatGPT is an advanced language model developed by OpenAI, built on the gen-
erative pretrained transformer (GPT) architecture. It utilizes deep learning techniques
and large-scale transformer models to capture complex patterns and structures in natural
language data, as demonstrated by Radford et al. [44]. ChatGPT’s architecture enables it to
create coherent and contextually appropriate responses to diverse natural language inputs,
making it highly versatile for various NLP tasks.
Language models such as ChatGPT play a crucial role in understanding natural lan-
guage queries by leveraging advanced NLP capabilities that utilize techniques such as
attention mechanisms, transformer architectures, and pre-training on large datasets to cap-
ture semantic relationships, syntactic structures, and contextual nuances in language [45].
As a result, they can interpret user queries, extract relevant information, and generate
meaningful responses that align with the query’s intent.
The ability of language models such as ChatGPT to understand natural language
queries is further enhanced by fine-tuning particular tasks or areas. Fine-tuning involves
training the model on task-specific data or providing domain-specific prompts [46], allow-
ing the model to specialize in understanding and generating content relevant to a particular
context, as shown by Raffel et al. This fine-tuning process improves the model’s accuracy
and relevance in handling natural language queries within specific domains or applications.
In addition to understanding individual queries, language models such as the Chat-
GPT contribute to building conversational agents and AI-driven interfaces capable of
engaging in human-like interactions. Brown et al. investigated how these models facilitate
dialog generation, question answering, sentiment analysis, and content recommendation
tasks, enhancing the user experience and enabling more natural and effective communica-
tion with AI systems [43].
In conclusion, language models such as the ChatGPT model play a critical role in
understanding natural language queries by leveraging advanced NLP techniques, fine-
tuning specific tasks or domains, and contributing to the development of conversational
AI systems.
4. Methods
To evaluate the proposed system, we need to consider framework modeling that
includes components such as a travel data retrieval mechanism, personality driven prompt
augmentation, and travel recommendation generation with a knowledge base as well as
data collection and preprocessing tasks, as shown in Figure 1.
17
Electronics 2024, 13, 4751
Figure 1. The RAG framework with personality model for travel recommendation.
18
Electronics 2024, 13, 4751
recommendation system across various personality types. The experiment was structured
with a control group using a traditional system and an experimental group utilizing our
personality-driven model, allowing us to measure comparative effectiveness.
The data collection process was expanded to provide specific information on the
sources, methods, and criteria used. Data were gathered from multiple sources, including
travel booking sites, social media, and user-generated reviews. We applied rigorous inclu-
sion criteria to ensure data relevance and used APIs and web scraping tools to automate
collection. A detailed data filtering protocol was implemented to exclude outdated or
irrelevant data, ensuring high-quality inputs for the recommendation model.
Data validation steps were added to strengthen the robustness of the methodology.
We incorporated data accuracy checks through cross-referencing with verified sources
and utilized preprocessing steps such as deduplication and normalization. Furthermore,
external datasets were used to benchmark and validate the recommendation data, ensuring
the reliability and accuracy of the information processed by our system.
We provided a clearer description of the specific algorithms and models used in the
study, including the RAG framework and its configuration. This section now explains
the integration of personality models based on MBTI and BF traits, detailing how the
system tailors recommendations based on individual characteristics. Additionally, the
benchmarking process for these models was included, with key performance metrics like
accuracy, user satisfaction, and recommendation relevance highlighted to validate the
effectiveness of our model.
19
Electronics 2024, 13, 4751
personality traits directly influence both the retrieval and generation stages, enhancing
recommendation relevance.
To start the personality transition process, popular methods such as Pearson’s method
are used. In statistics, the Pearson correlation coefficient (PCC, Pearson r, or Pearson
product–moment correlation coefficient, PPMCC), often referred to simply as the correlation
coefficient, measures the linear relationship between two datasets [40]. It is calculated as
the ratio of the covariance of two variables to the product of their standard deviations,
providing a normalized measure of covariance that always results in a value between −1
and 1 [40]. Like covariance, this measure only captures the linear correlation between
variables, disregarding other types of relationships or correlations [40]. For instance,
the Pearson correlation coefficient between age and height in a sample of high school
teenagers is expected to be significantly greater than 0 but less than 1 (since 1 would
indicate an unrealistic perfect correlation). The Pearson correlation coefficient (r) is one of
several correlation coefficients available for measuring correlations. Pearson’s correlation
coefficient is appropriate when certain conditions are met [40]. Both variables must be
quantitative; if one is qualitative, another method should be used. The variables should
be normally distributed, which can be checked by creating a histogram for each variable
to determine if the distributions are approximately normal [40]. Slight deviations from
normality are usually not problematic. There should be no outliers in the data, as outliers
are observations that deviate from the overall pattern of the data [40]. A scatterplot can
help identify outliers by showing points that are isolated from the rest. The relationship
between the variables should be linear, meaning it can be reasonably represented by a
straight line. A scatter plot can also be used to assess whether the relationship between the
variables is linear.
n∑ xy − (∑ x )(∑ y)
r = (1)
n ∑ x 2 − ( ∑ x )2 n ∑ y2 − ( ∑ y )2
One advantage of using Pearson’s correlation coefficient (Equation (1)) to explore the
relationship between MBTI and the BF personality traits is that it offers a straightforward
and intuitive method for measuring the strength and direction of the relationship between
two variables. It is a widely recognized and commonly used statistical tool that is well
understood and easily interpreted by researchers. If we interpret, r = 1 indicates a perfect
positive linear relationship: as x increases, y increases in a perfectly linear way. R = −1
indicates a perfect negative linear relationship: as x increases, y decreases in a perfectly
linear way. r = 0 suggests no linear correlation between x and y. However, this method also
has several potential drawbacks. One limitation is that the Pearson correlation coefficient
assumes a linear relationship between the two correlated variables, which may not always
reflect reality. Additionally, the Pearson correlation coefficient only measures the strength
of the relationship between two variables and does not provide information about causality
or directionality [40]. Other factors, such as situational or cultural influences, may also
affect the relationship between MBTI and BF personality traits [37]. Another potential
drawback of using Pearson’s correlation coefficient is its reliance on accurate and reliable
measurements of the correlated variables. As with any measurement tool, there may be
sources of error or bias in assessing the MBTI and BF personality traits, which could impact
the accuracy of the correlation coefficient [37,40].
While the Pearson correlation coefficient can provide insights into the relationship
between MBTI scores and BF personality traits, researchers should remain aware of its
limitations in capturing complex, non-linear associations. To achieve a more comprehensive
analysis, additional methods and techniques should be considered alongside the Pearson
correlation.
The Pearson correlation coefficient presents limitations, particularly its assumption of
a linear relationship between correlated variables, which may not accurately capture the
complexities of non-linear associations. The second possible drawback is the potential lack
20
Electronics 2024, 13, 4751
of sensitivity. These fallbacks show that a secondary method may be used to calculate the
personality transition process.
Jaccard similarity serves as a widely used metric for assessing the similarity between
two entities, such as two textual documents. This metric proves useful for measuring
the likeness between two asymmetrical binary vectors or sets. In the academic literature,
Jaccard similarity, denoted by J(A, B), is interchangeably known as the Jaccard Index,
Jaccard coefficient, Jaccard dissimilarity, and Jaccard distance.
Jaccard similarity has extensive utility in the realm of data science. Examples illustrat-
ing the application of Jaccard similarity include textual analysis, which involves gauging
the resemblance between two text documents based on the terms shared between them;
e-commerce, where vast customer and product databases are selected to identify similar
customers via their buying histories; and recommendation systems, which employ the
Jaccard coefficient in movie recommendation algorithms to pinpoint similar customers who
have either rented or highly rated similar movies.
The application of the Jaccard similarity coefficient extends to evaluating the similarity
of two asymmetric binary variables. It is considered as a binary variable with two possible
states, 0 and 1, where 0 signifies the absence of the attribute and 1 indicates its presence.
Unlike symmetric binary attributes where both states hold equal values, the importance
of the two states differs significantly for asymmetric binary variables. If we consider
evaluating the similarity among customers of a store, we may utilize a binary attribute that
signifies a purchase made at the store. Here, 1 represents the purchase of a specific item,
while 0 indicates no purchase of that item.
Given the potential existence of thousands of products in the store, the quantity
of items left unpurchased by any customer far exceeds the number of purchased items.
Consequently, when assessing customer similarity, we focus solely on item purchases. This
results in an asymmetric binary variable, where a value of 1 holds greater than 0.
In the initial step of computing the Jaccard similarity in Equation (2) between two
customers, each characterized by binary attributes, the following four quantities (i.e.,
frequencies) are determined based on the provided binary data:
a = the total number of attributes that are equal to 1 for both objects i and j
b = the total number of attributes that are equal to 0 for object i but equal to 1 for
object j
c = the total number of attributes that are equal to 1 for object i but equal to 0 for
object j
d = the total number of attributes that equal 0 for both objects i and j.
Subsequently, the Jaccard similarity for these attributes is calculated using the follow-
ing equation:
a
J (i, j) = sim(i, j) = (2)
a+b+c
The count of matches d is disregarded in this calculation because it holds no signif-
icance, as the items are asymmetric binary attributes. These calculations indicate that
customers exhibit similar shopping patterns while displaying dissimilar behaviors, as they
have purchased entirely different items.
Jaccard similarity can also be explained as follows:
a represents the number of elements that are common between sets i and j,
b represents the number of elements that are unique to set i,
c represents the number of elements that are unique to set j.
If we expand this formula, the numerator a represents the count of elements common
to both sets i and j. This is the intersection of the two sets, denoted by |i ∩ j|. In set notation,
the numerator can be rewritten as: a = |i ∩ j|. The denominator a + b + c represents the
union of the two sets i and j. It includes
a: Elements common to both sets (intersection).
b: Elements only in set i.
c: Elements only in set j.
21
Electronics 2024, 13, 4751
This total represents the total unique elements in both sets combined, written as |I ∪ j|
in set notation. By substituting these terms into the formula, the Jaccard Similarity Index
can also be expressed in terms of set operations:
|i ∩ j|
J (i, j) = (3)
|i ∪ j|
This expanded version highlights that the Jaccard Index measures the ratio of the
number of elements in the intersection of the two sets to the number of elements in
their union. Furthermore, instead of assessing similarity, the dissimilarity or Jaccard
distance between two binary attributes can be determined. The dissimilarity based on these
attributes via the Jaccard coefficient is derived as follows:
b+c
d(i, j) = = 1 − sim(i, j) (4)
a+b+c
However, these methods also fail to represent our personality model. Transitioning
into the discussion of hybrid recommendation approaches, it is imperative to recognize
their pivotal role in addressing the complexities and challenges encountered in traditional
similarity-based recommendation systems.
Hybrid recommendation approaches are highly favored due to their ability to combine
the strengths of both item–item and user–user similarity techniques, thereby significantly
enhancing recommendation precision and scope. The rationale behind the preference
for hybrid models lies in their adeptness at mitigating various challenges inherent in
conventional similarity-based systems. In particular, these hybrid methodologies excel
in addressing the cold-start dilemma, which emerges when dealing with novel items or
users possessing limited interaction data. By harnessing item–item similarity for cold-start
items, the system can furnish recommendations grounded on the inherent attributes and
characteristics of the items themselves, eliminating the necessity for exhaustive user histo-
ries. Conversely, the utilization of user–user similarity facilitates the delivery of tailored
recommendations by identifying users with similar preferences and behaviors. This per-
sonalized approach augments the pertinence and efficacy of recommendations, especially
for seasoned users with extensive interaction histories. In essence, the hybrid approach
optimizes recommendation precision and scope by astutely amalgamating item–item and
user–user similarities, rendering it an adaptable and resilient solution for contemporary
recommendation systems.
The algorithms described above for converting MBTI scores into BF scores could
benefit from the use of various machine learning methods to optimize their performance.
Five machine learning methods that may be suitable for this study and their strengths
and weaknesses for this particular algorithm are exemplified and compared. Random
forest regression, gradient boosting regression, artificial neural networks, and K-nearest
neighbors’ regression were not preferred because they are computationally expensive.
The personality transition step continues with creating and implementing a time-
bound structure in Figure 2. To simulate age in personality, we can adjust the scores of each
trait according to the age of the individual. Research has shown that personality traits tend
to change as individuals age, with some traits increasing or decreasing in intensity over
time. In this algorithm, we calculate an adjustment factor for each trait based on the age
of the individual. We then added this adjustment factor to the individual’s score for each
trait to obtain adjusted scores. The adjustment factors are chosen based on research on how
personality traits tend to change over a lifetime. As with the previous algorithm, we can
use machine learning methods to optimize this algorithm using our personality model.
22
Electronics 2024, 13, 4751
23
Electronics 2024, 13, 4751
vectors, which are combined with user queries to refine the retrieval process and tailor
generative outputs.
Technical Adaptation of Personality Models: The personality models were adapted for
the RAG framework by encoding personality traits as feature inputs in the retrieval stage.
During retrieval, personality feature vectors are applied as filters, ensuring that the re-
trieved documents align not only with the query but also with individual personality-based
preferences. In the generative stage, these traits inform language generation, influencing
the style and tone of the recommendations. For instance, extraverted users receive rec-
ommendations framed with more socially engaging language, whereas introverted users
receive suggestions phrased in a reflective and personalized tone.
5.1. Results
This research integrating the RAG framework with personality models for personal-
ized travel recommendations yielded promising results. Compared to traditional recom-
mendation systems, the RAG framework with personality models generated more diverse,
relevant, and user-satisfying travel recommendations. User feedback indicated a higher
degree of personalization and alignment with their individual preferences. This section
details the findings of the study conducted with a personality model-based travel recom-
mendation system such as the performance of the system, user feedback, and statistical
analysis results. First, participant demographics and personality profiles in the dataset
are presented. The sample dataset size is 250 participants in the study. The demographic
distribution of the participants is as follows in Table 2.
24
Electronics 2024, 13, 4751
When a user submits a query, as a query processing task, the travel data retrieval
mechanism, including the search component and the travel data document base, converts
the user’s query into a vectorized request for receiving travel data from travel-based
web document repositories using a web crawler. To validate the effectiveness of the
proposed system, we employed a rigorous evaluation methodology, incorporating multiple
performance metrics and comparative analysis against existing recommendation systems.
We used the following key performance metrics below:
User Satisfaction Rate: This was measured through post-interaction surveys using
a 5-point Likert scale, with questions designed to assess the relevance, usefulness, and
overall satisfaction with the recommendations. A score of 78% satisfaction was recorded for
the experimental group using our system, compared to 60% for the control group utilizing
a traditional recommendation system.
Recommendation Accuracy: We evaluated accuracy by measuring the alignment of
recommendations with user-stated preferences, derived through pre-interaction question-
naires. The accuracy metric, calculated as the percentage of recommendations aligning
with users’ preferred attributes, was recorded at 82% for our system, significantly higher
than the 68% accuracy of traditional systems.
Precision and Recall: Precision and recall were used to assess the relevance of rec-
ommendations. Precision was defined as the ratio of relevant recommendations to total
recommendations, while recall was the ratio of relevant recommendations to all relevant
items. Our system achieved precision and recall values of 0.84 and 0.78, respectively,
outperforming baseline systems, which averaged 0.70 precision and 0.65 recall.
Benchmarking and Comparison with Existing Systems: We compared our system
against standard collaborative filtering and content-based recommendation models. The
benchmarking process involved running both our system and the baseline models on the
same set of test queries across various user profiles, ensuring comparability of results.
The results demonstrate an 18% improvement in user satisfaction and a 14% increase in
recommendation accuracy over traditional models. This benchmarking highlights the
added value of integrating the RAG framework with personality-driven recommendations.
Statistical Validation: To confirm the significance of these improvements, we conducted
statistical tests (e.g., paired t tests) between the experimental and control groups. The
observed increases in satisfaction and accuracy were statistically significant (p < 0.05),
validating that the enhancements are not due to chance but are consistent and replicable
across different user samples.
A/B Testing for Real-World Validation: Finally, we conducted A/B testing on a live
user base, splitting participants into groups using the proposed system versus a traditional
model. Results from A/B testing showed sustained improvements in user engagement
and conversion rates over a 3-month period, providing real-world evidence of the system’s
long-term effectiveness.
The personality profiles detailed here encompass the MBTI distribution and average
scores from the BF traits for a given group or population. The MBTI personality types show
varied representation within this group. ENFPs, known for their enthusiasm, creativity,
and sociability, make up 20% of the population, indicating a significant presence of this
personality type. ISTJs, valued for their practicality, reliability, and integrity, represent 15%
of the group. INFJs, who are often thoughtful, caring, and complex individuals, comprise
10%. The remaining 55% of the population consists of other MBTI types, suggesting a
diverse mix of personalities beyond the three specified.
25
Electronics 2024, 13, 4751
The system demonstrated the ability to adapt to changing user preferences and travel
trends over time. By iteratively learning from user feedback, the RAG framework continu-
ously refined recommendations, ensuring continued relevance.
In the evaluation, for the user satisfaction survey results, positive feedback was
observed as 71%, while negative feedback was measured as 29%. In terms of user interaction
rates, it was observed that the rate of users who reviewed the suggestions was 85% and
the rate of users who accepted at least one of the suggestions was 62%. In terms of
error rates, the wrong suggestion rate was found to be 4%. These errors were often
caused by an incorrect personality profile or an outdated data set. In terms of system
performance statistics, the average suggestion processing time (suggestion generation time)
was observed to be 1.8 s. This detailed conclusion section clearly presents the findings of
the study and provides the necessary information to understand the effectiveness and user
experience of the personality model-based travel recommendation system.
26
Electronics 2024, 13, 4751
100 89 86 91
90 77
80 71 69
70 62
60
50 38
40 29 31
30 23
11 14 9
20
10
0
The results demonstrate that integrating the RAG framework with personality models
yields a notable increase in user satisfaction and recommendation relevance, as evidenced
by a 78% satisfaction rate compared to 60% for traditional systems. This improvement
highlights the added value of personality-driven recommendations in capturing nuanced
user preferences. Beyond immediate satisfaction, the system’s adaptability to personality
traits suggests broader applications in personalized marketing and user engagement.
Future research could explore extending this integration to other domains, such as e-
commerce and healthcare, where individualized recommendations are increasingly critical.
Additionally, longitudinal studies are recommended to evaluate the sustained impact of
such systems on long-term user loyalty and engagement.
Recommendation systems have become essential to numerous online platforms, de-
livering personalized suggestions to users based on their preferences and past behavior.
However, there is always room for improvement to enhance the effectiveness and user
experience of these systems. Utilizing a variety of recommendation algorithms, such as
collaborative filtering, content-based filtering, and hybrid approaches, can lead to more
diverse and personalized recommendations. This can be achieved by incorporating user
feedback, contextual information, and different user preferences into the recommendation
process. Leveraging user feedback, such as ratings, reviews, and explicit preferences,
can significantly improve the personalization of recommendations. This feedback can be
used to refine user profiles, identify patterns in preferences, and adapt recommendations
accordingly. Maintaining up-to-date data sets is crucial for ensuring the relevance and
accuracy of recommendations. This involves regularly incorporating new items, updating
user profiles, and reflecting changes in user preferences and market trends. Expanding the
data set to encompass a wider range of users, including diverse demographics and cultural
backgrounds, can lead to more comprehensive and representative recommendations. This
can also help address potential biases in the system. Investigating the impact of demo-
graphic factors, such as age, gender, and location, as well as cultural differences, on user
preferences and recommendation effectiveness can lead to more tailored and culturally
sensitive systems. Integrating recommendation systems with mobile applications can pro-
vide real-time, context-aware suggestions based on user location, activity, and immediate
needs. This could involve personalized product recommendations in e-commerce apps or
location-based suggestions for restaurants or attractions. Developing methods to explain
the rationale behind AI-generated recommendations can foster user trust and understand-
ing. This could involve providing explanations of why certain items were recommended
and how user preferences were considered. By addressing these areas for improvement
27
Electronics 2024, 13, 4751
and pursuing future research directions, recommendation systems can become even more
effective in providing personalized, relevant, and valuable suggestions to users, enhancing
their overall online experience.
5.2. Discussion
According to proper evaluation metrics, we can explore how generative AI combined
with personality models enhances personalized travel recommendations, comparing it
with previous models in the literature employing fine-tuning methods. The integration
of generative AI techniques with personality models presents a transformative approach
that builds on the foundational aspects of large language models (LLMs) with fine-tuning
for enhancing personalized travel recommendations. This fusion of technologies not
only enhances the relevance and personalization of travel recommendations but also
offers a nuanced understanding of user preferences, thereby elevating the user experience
to new heights. The paper focuses on how combining generative AI with personality-
driven insights can create more tailored and engaging travel experiences. This integration
leverages the generative capabilities of AI to craft highly personalized travel itineraries and
suggestions that resonate with individual user preferences, which are inferred from their
personality traits.
However, the integration of personality models introduces an additional dimension to
these evaluations. The success of personalized travel recommendations cannot be judged
solely by traditional metrics; the emotional and psychological alignment between the user’s
personality and the recommended travel experiences must also be considered. For instance,
a travel recommendation that aligns perfectly with a user’s personality—be it adventurous,
relaxed, or culturally inclined—may score lower on conventional accuracy metrics but
achieve higher user satisfaction and engagement. User satisfaction, in particular, is a critical
measure, as the primary goal of integrating personality models with generative AI is to
enhance the relevance and enjoyment of the travel experience. Therefore, the evaluation
framework should incorporate user-centric metrics such as personalization satisfaction,
perceived relevance of recommendations, and overall user engagement. Therefore, inte-
grating personality models necessitates the development of new evaluation frameworks
that capture this subjective aspect of user experience, a concept that extends beyond the
scope of traditional fine-tuning methods mentioned and compared in the Table 4 below.
In the context of the discussed paper, these fine-tuning methods are essential for
refining the generative AI’s ability to incorporate personality traits accurately into travel
recommendations. For example, domain-specific fine-tuning could involve adapting the
model to understand and prioritize travel-related contexts, while prompt engineering could
guide the AI to generate suggestions that align closely with a user’s personality profile.
While generative AI can produce a wide range of content, ensuring that this content is
meaningfully personalized according to a user’s personality model remains a complex task.
Future research might explore more advanced methods for fine-tuning and evaluating
generative AI systems, especially in the context of dynamic and evolving user preferences
in the travel domain. Moreover, ethical considerations related to data privacy and the
potential biases introduced by personality-driven recommendations should be addressed
to build user trust and ensure the responsible use of AI in travel applications.
Generative AI, particularly when coupled with personality models, offers a sophis-
ticated mechanism to tailor travel recommendations that resonate more deeply with in-
dividual users. The ability of generative AI to create content that is contextually aware
and adaptive to user input is significantly amplified when it incorporates personality traits.
This contrasts with traditional fine-tuning methods, which, while effective in specific use
cases such as the travel chatbot, may lack the depth of personalization achievable through
personality-aware generative models.
28
Electronics 2024, 13, 4751
Table 4. Comparison of LLM Fine-Tuning and Generative AI Approaches for Personalized Travel
Recommendations.
LLM Fine-Tuning Methods for Travel Chatbot Generative AI with Personality Models for
Criteria
Use Case Travel Recommendations
To explore and assess the effectiveness of various
To improve personalized travel recommendation
fine-tuning methods for Large Language Models
systems by integrating generative AI (specifically
Objective (LLMs) within the context of a travel chatbot,
the RAG framework) with personality models,
with the goal of enhancing user interaction and
aiming to create more tailored travel suggestions.
satisfaction.
Investigated fine-tuning techniques including
Combined the RAG framework with personality
transfer learning, domain-specific adaptation,
models such as MBTI and BF personality traits to
Approach (Solution) and prompt engineering. These methods were
generate travel recommendations aligned with
tested using a travel chatbot as the primary
user personalities and preferences.
application scenario.
The study found that domain-specific adaptation The integration of personality models with
of LLMs significantly improved the chatbot’s generative AI resulted in more personalized and
Results (Findings) performance, particularly in understanding and relevant travel recommendations, showing an
responding to user queries, which led to higher increase in user satisfaction and engagement
user satisfaction. over time.
Aimed to address the challenge of effectively Sought to solve the problem of generating highly
fine-tuning LLMs to operate within specific personalized travel itineraries by considering
Targeted Problem domains, particularly focusing on enhancing the users’ unique personality traits, improving the
accuracy and relevance of responses in a travel alignment of travel recommendations with
chatbot context. individual preferences.
Concluded that the fusion of generative AI with
Concluded that tailored fine-tuning methods,
personality models offers a significant
especially domain-specific adaptations, are
advancement in personalized travel
Conclusion essential for optimizing LLM performance in
recommendations, providing deeper insights
specialized applications like travel chatbots,
into user preferences and more relevant
leading to better user experiences.
suggestions.
6. Conclusions
In conclusion, the incorporation of generative artificial intelligence (AI) techniques,
especially through the RAG framework, has initiated a new era of personalized travel
recommendation systems. By harnessing the power of large-scale language models and
29
Electronics 2024, 13, 4751
retriever models, these systems can produce diverse and contextually relevant recom-
mendations that cater to individual preferences and interests, all derived from natural
language queries. Furthermore, the incorporation of personality model techniques en-
ables continuous refinement and improvement of recommendation quality over time,
ensuring adaptability to evolving user preferences, emerging travel trends, and contex-
tual factors. Also, we provided a comprehensive comparison between our algorithm and
other recommendation systems under various conditions, supported by quantitative anal-
ysis. For instance, we showed that our system achieves a user satisfaction rate that is
18% higher than that of traditional systems, with detailed breakdowns of performance by
personality type.
Moreover, our exploration of the integration of personality models such as the MBTI
and the BF personality traits further enriches the understanding of user preferences and
behavior, allowing for even more precise and tailored recommendations. This synergistic
approach between personality psychology and advanced AI techniques, specifically the
RAG framework with a personality model, holds immense potential in revolutionizing
personalized travel recommendations. Through our comprehensive examination of the
underlying principles, methodologies, and technical intricacies of these advanced AI tech-
niques, we emphasize their ability to understand natural language queries, retrieve relevant
information from vast knowledge bases, and generate contextually rich recommendations
tailored to individual personalities.
Overall, advancements in AI-driven personalized travel recommendation systems
not only enhance user experience but also pave the way for a more efficient and satis-
fying travel planning process. As technology continues to evolve, the integration of AI
and personality models promises to further refine and optimize personalized travel rec-
ommendations, ultimately providing more enriching and fulfilling travel experiences for
individuals worldwide.
Our research lays the groundwork for further exploration in this area. Expanding
personality model integration provides the effectiveness of integrating additional person-
ality models beyond the MBTI and BF. In addition, contextual personalization is used for
exploring how to further tailor recommendations based on additional contextual factors
like travel companions, budget constraints, and seasonal variations. Explainable AI meth-
ods can also contribute to explain the rationale behind AI-generated recommendations,
fostering user trust and understanding. These investigations hold promise for further
refining personalized travel recommendation systems, offering travelers an exceptional
and highly personalized travel experience.
Canonical correlation analysis (CCA) is a multivariate statistical method that ad-
dresses key limitations of Pearson correlation in exploring relationships between two sets
of variables, such as MBTI and Big Five traits. Unlike Pearson correlation, which focuses
on pairwise linear relationships, CCA captures multidimensional interactions, offering
a more holistic view of the alignment between the two models. Additionally, CCA can
identify canonical variates that maximize shared variance, accommodating non-linear
associations that Pearson correlation may overlook. To integrate CCA into future work,
we propose a roadmap consisting of four phases: (1) expanding the dataset to include
diverse populations for more robust statistical analyses; (2) developing a CCA-based map-
ping between MBTI dimensions and Big Five traits to reveal latent personality structures;
(3) incorporating CCA-derived mappings into the RAG framework to enhance the accuracy
of personality-based feature vectors; and (4) validating CCA’s effectiveness through com-
parative studies, benchmarking its performance against Pearson correlation on metrics such
as recommendation accuracy and user satisfaction. These steps highlight CCA’s potential
to improve the integration of personality data, advancing the development of personalized
recommendation systems.
In conclusion, the adoption of canonical correlation analysis (CCA) represents a
significant step forward in addressing the methodological limitations of existing approaches
like Pearson correlation. By enabling a deeper understanding of the complex, multivariate
30
Electronics 2024, 13, 4751
relationships between MBTI and Big Five traits, CCA not only enhances the theoretical
framework underpinning personality-driven recommendation systems but also provides a
practical pathway to improve their accuracy and relevance. Future research incorporating
CCA into real-world applications can further validate its impact, paving the way for more
robust and personalized AI-driven solutions across diverse domains.
Author Contributions: Conceptualization, E.A. and E.D.; Software, E.A.; Validation, E.A.; Formal
analysis, E.A.; Writing—original draft, E.A. and E.D.; Writing—review & editing, E.A. and E.D. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Hardt, D.; Glückstad, F.K. A social media analysis of travel preferences and attitudes, before and during COVID-19. Tour. Manag.
2024, 100, 104821. [CrossRef]
2. Albayrak, T.; Rosario González-Rodríguez, M.; Caber, M.; Karasakal, S. The use of mobile applications for travel booking: Impacts
of application quality and brand trust. J. Vacat. Mark. 2023, 29, 3–21. [CrossRef]
3. Paulavičius, R.; Stripinis, L.; Sutavičiūtė, S.; Kočegarov, D.; Filatovas, E. A novel greedy genetic algorithm-based personalized
travel recommendation system. Expert Syst. Appl. 2023, 230, 120580. [CrossRef]
4. Shrestha, D.; Wenan, T.; Shrestha, D.; Rajkarnikar, N.; Jeong, S.R. Personalized Tourist Recommender System: A Data-Driven and
Machine-Learning Approach. Computation 2024, 12, 59. [CrossRef]
5. Sarkar, J.L.; Majumder, A.; Panigrahi, C.R.; Roy, S.; Pati, B. Tourism recommendation system: A survey and future research
directions. Multimed. Tools Appl. 2023, 82, 8983–9027. [CrossRef]
6. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-augmented generation for large language
models: A survey. arXiv 2023, arXiv:2312.10997.
7. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the
38th Annual AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; Volume 38, pp. 17754–17762.
8. Huang, Z.; Liu, P.; de Melo, G.; He, L.; Wang, L. Generating Persona-Aware Empathetic Responses with Retrieval-Augmented
Prompt Learning. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12441–12445.
9. Zhang, Q.; Zhou, Z.; Han, X.; Li, Y.; Jia, Z. A Recommender for Personalized Travel Planning Using Stacked Autoencoder in a
Multimodal Transportation Network. J. Transp. Eng. Part A Syst. 2024, 150, 04023135. [CrossRef]
10. Yuan, Y.; Zheng, W. Your trip, your way: An adaptive tourism recommendation system. Appl. Soft Comput. 2024, 154, 111330.
[CrossRef]
11. Nitu, P.; Coelho, J.; Madiraju, P. Improvising personalized travel recommendation system with recency effects. Big Data Min.
Anal. 2021, 4, 139–154. [CrossRef]
12. Memon, I.; Chen, L.; Majid, A.; Lv, M.; Hussain, I.; Chen, G. Travel recommendation using geo-tagged photos in social media for
tourist. Wirel. Pers. Commun. 2015, 80, 1347–1362. [CrossRef]
13. Liu, Q.; Ge, Y.; Li, Z.; Chen, E.; Xiong, H. Personalized travel package recommendation. In Proceedings of the 2011 IEEE 11th
International Conference on Data Mining, Vancouver, BC, Canada, 11–14 December 2011; pp. 407–416.
14. Chen, Y.Y.; Cheng, A.J.; Hsu, W.H. Travel recommendation by mining people attributes and travel group types from community-
contributed photos. IEEE Trans. Multimed. 2013, 15, 1283–1295. [CrossRef]
15. Sun, Y.; Fan, H.; Bakillah, M.; Zipf, A. Road-based travel recommendation using geo-tagged images. Comput. Environ. Urban Syst.
2015, 53, 110–122. [CrossRef]
16. Zheng, Y.; Xie, X. Learning travel recommendations from user-generated GPS traces. ACM Trans. Intell. Syst. Technol. (TIST) 2011,
2, 1–29. [CrossRef]
17. Ricci, F.; Cavada, D.; Mirzadeh, N.; Venturini, A. Case-based travel recommendations. In Destination Recommendation Systems:
Behavioural Foundations and Applications; CABI: Wallingford, UK, 2006; pp. 67–93.
18. Majid, A.; Chen, L.; Chen, G.; Mirza, H.T.; Hussain, I.; Woodward, J. A context-aware personalized travel recommendation
system based on geotagged social media data mining. Int. J. Geogr. Inf. Sci. 2013, 27, 662–684. [CrossRef]
19. Cheng, A.J.; Chen, Y.Y.; Huang, Y.T.; Hsu, W.H.; Liao, H.Y.M. Personalized travel recommendation by mining people attributes
from community-contributed photos. In Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ,
USA, 28 November–1 December 2011; pp. 83–92.
31
Electronics 2024, 13, 4751
20. Zheng, Y.; Burke, R.; Mobasher, B. Differential context relaxation for context-aware travel recommendation. In Proceedings of
the E-Commerce and Web Technologies: 13th International Conference, EC-Web 2012, Vienna, Austria, 4–5 September 2012;
Proceedings 13. Springer: Berlin/Heidelberg, Germany, 2012; pp. 88–99.
21. Wong, I.A.; Lian, Q.L.; Sun, D. Autonomous travel decision-making: An early glimpse into ChatGPT and generative AI. J. Hosp.
Tour. Manag. 2023, 56, 253–263. [CrossRef]
22. Chen, L.; Cao, J.; Tao, H.; Wu, J. Trip reinforcement recommendation with graph-based representation learning. ACM Trans.
Knowl. Discov. Data 2023, 17, 1–20. [CrossRef]
23. Kowald, D.; Yang, D.; Lacic, E. Reviews in recommender systems: 2022. Front. Big Data 2024, 7, 1384460. [CrossRef]
24. Pazzani, M.J.; Billsus, D. Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization;
Springer: Berlin/Heidelberg, Germany, 2007; pp. 325–341.
25. Guerard, G.; Gabot, Q.; Djebali, S. Tourism profile measure for data-driven tourism segmentation. Int. J. Mach. Learn. Cybern.
2024, 15, 1–26. [CrossRef]
26. Deseure Charron, F.; Djebali, S.; Guérard, G. Clustering method for touristic photographic spots recommendation. In Proceedings
of the Advanced Data Mining and Applications: 18th International Conference, ADMA 2022, Brisbane, Australia, 28–30 November
2022; Proceedings, Part II. Springer: New York, NY, USA, 2022; pp. 223–237.
27. Adomavicius, G.; Bauman, K.; Mobasher, B.; Tuzhilin, A.; Unger, M. Workshop on Context-Aware Recommender Systems 2023.
In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; pp. 1234–1236.
28. Abbasi-Moud, Z.; Vahdat-Nejad, H.; Sadri, J. Tourism recommendation system based on semantic clustering and sentiment
analysis. Exp. Syst. Appl. 2021, 167, 114324. [CrossRef]
29. Blanco-Moreno, S.; González-Fernández, A.M.; Muñoz-Gallego, P.A. Big data in tourism marketing: Past research and future
opportunities. Span. J. Market.-ESIC 2023, ahead-of-print. [CrossRef]
30. D’Urso, P.; De Giovanni, L.; Disegna, M.; Massari, R.; Vitale, V. A tourist segmentation based on motivation, satisfaction and prior
knowledge with a socio-economic profiling: A clustering approach with mixed information. Soc. Indic. Res. 2021, 154, 335–360.
[CrossRef]
31. He, A.Z.; Zhang, Y. AI-powered touch points in the customer journey: A systematic literature review and research agenda. J. Res.
Interact. Mark. 2023, 17, 620–639. [CrossRef]
32. Strohmann, T.; Siemon, D.; Khosrawi-Rad, B.; Robra-Bissantz, S. Toward a design theory for virtual companionship. Hum.–Comput.
Interact. 2023, 38, 194–234. [CrossRef]
33. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. Adv.
Neural Inf. Process. Syst. 2020, 33, 1877–1901.
34. Singh, P.K.; Othman, E.; Ahmed, R.; Mahmood, A.; Dhahri, H.; Choudhury, P. Optimized recommendations by user profiling
using a priori algorithm. Appl. Soft Comput. 2021, 106, 107272. [CrossRef]
35. Jung, C.G. Psychology of the Transference: (From Vol. 16 Collected Works); Princeton University Press: Princeton, NJ, USA, 2020;
Volume 8.
36. Marston, W.M.; King, C.D.; Marston, E.H. Integrative Psychology: A Study of Unit Response; Routledge: London, UK, 2020.
37. Dağlarlı, E.; Arıbaş, E. Personality identification by deep learning. In Proceedings of the 2017 25th Signal Processing and
Communications Applications Conference (SIU), Antalya, Turkey, 15–18 May 2017; pp. 1–4.
38. Celli, F.; Lepri, B. Is big five better than MBTI? A personality computing challenge using Twitter data. Comput. Linguist. CLiC-It
2018, 2018, 93.
39. Stein, R.; Swan, A.B. Evaluating the validity of Myers-Briggs Type Indicator theory: A teaching tool and window into intuitive
psychology. Soc. Personal. Psychol. Compass 2019, 13, e12434. [CrossRef]
40. Malik, M.A.; Zamir, S. The relationship between Myers Briggs Type Indicator (MBTI) and emotional intelligence among university
students. J. Educ. Pract. 2014, 5, 35–42.
41. Kokko, K.; Tolvanen, A.; Pulkkinen, L. Associations between personality traits and psychological well-being across time in middle
adulthood. J. Res. Personal. 2013, 47, 748–756. [CrossRef]
42. Luhmann, M.; Orth, U.; Specht, J.; Kandler, C.; Lucas, R.E. Studying changes in life circumstances and personality: It’s about time.
Eur. J. Personal. 2014, 28, 256–266. [CrossRef]
43. Ping, Y.; Gao, C.; Liu, T.; Du, X.; Luo, H.; Jin, D.; Li, Y. User consumption intention prediction in Meituan. In Proceedings of the
27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. p 3472–3482.
44. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI
Blog 2019, 1, 9.
45. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In
Advances in Neural Information Processing Systems 32 (NeurIPS 2019); NeurIPS Proceedings: San Diego, CA, USA, 2019; Volume 32.
46. Raffel, C.; Luong, M.T.; Liu, P.J.; Weiss, R.J.; Eck, D. Online and linear-time attention by enforcing monotonic alignments. In
Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2837–2846.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
32
electronics
Article
Self-Supervised Graph Attention Collaborative Filtering
for Recommendation
Jiangqiang Zhu 1 , Kai Li 1,2, *, Jinjia Peng 1,2 and Jing Qi 1,2
1 School of Cyber Security and Computer, Hebei University, Baoding 071000, China
2 Hebei Machine Vision Engineering Research Center, Hebei University, Baoding 071000, China
* Correspondence: likai@[Link]
Abstract: Due to the complementary nature of graph neural networks and structured data in rec-
ommendations, recommendation systems using graph neural network techniques have become
mainstream. However, there are still problems, such as sparse supervised signals and interaction
noise, in the recommendation task. Therefore, this paper proposes a self-supervised graph attention
collaborative filtering for recommendation (SGACF). The correlation between adjacent nodes is
deeply mined using a multi-head graph attention network to obtain accurate node representations. It
is worth noting that self-supervised learning is brought in as an auxiliary task in the recommendation,
where the supervision task is the main task. It assists model training for supervised tasks. A multi-
view of a node is generated by the graph data-augmentation method. We maximize the consistency
between its different views compared to the views of the same node and minimize the consistency
between its different views compared to the views of other nodes. In this paper, the effectiveness
of the method is illustrated by abundant experiments on three public datasets. The results show its
significant improvement in the accuracy of the long-tail item recommendation and the robustness of
the model.
learning models are more expressive and can mine more latent patterns in data. The model
structure based on deep learning is very flexible. It may be dynamically altered based on
the business scenario and data characteristics to ensure that the model fits precisely with
the application scenario. Recommendation based on deep learning has become mainstream,
from the simple single-layer neural network model AutoRec [7] to the classical deep neural
network structure Deep Crossing [8], which mainly increases deep neural network layer
count and structural complexity. NeuralCF [9] alters the interaction of user and item vectors
and enriches the way features are intersected in deep learning. Weed&Deep [10] enhances
the model’s integrative capabilities by integrating two deep learning networks with distinct
traits and complementing strengths. NFM [11] utilizes neural networks to improve the
feature crossover capability of the second-order part.
In recent years, graph neural networks [12–15] have received a lot of interest from
academia and business due to the powerful representational power of graph structures.
Graph neural networks are a class of deep learning-based methods for processing graph
domain information, and recommendations combined with graph neural networks have
been widely studied due to their better representational power and interpretability. Col-
laborative filtering recommendation based on graph neural networks [16,17] builds the
user–item interaction as a user–item bipartite graph. It utilizes higher-order connectivity
on the bipartite graph to enrich the user and item vector representation. Recommender
with graph convolutional networks [16] provides a complete solution for including higher-
order neighbors in node representation learning. Although effective, there are still some
limitations: sparse supervised signals, long-tail problem and interaction noise. Most mod-
els are performed in a supervised learning paradigm for recommendation tasks [9,18],
where supervised signals are derived from observed user–item interactions. However, the
observed interaction information is extremely sparse compared to the entire interaction
space [19,20] and is not sufficient to learn feature-rich node representations. The long-tail
problem has resulted in a high degree of nodes (abundant number of connected edges)
dominating representation learning [21], and low degrees of nodes (scarcity of connected
edges, i.e., long-tailed items) are difficult to learn. Most of the feedback provided by users is
implicit (e.g., click and browse) as opposed to explicit (e.g., rate, purchase, and like). Thus,
observed interactions usually contain noisy data [22]. For example, users unintentionally
click or browse content that does not interest them, and aggregation methods in graph
convolutional networks are unable to distinguish these noisy data, making the learning of
node representations more susceptible to noisy data.
This paper addresses the above limitations by combining graph attention networks [23]
and self-supervised learning [24]. As a backbone network for supervised learning tasks,
the graph attention network is implemented. It can be implemented to assign different
learning weights to different neighboring nodes, which significantly decreases the problem
of bringing in noisy data to the aggregation process. Self-supervised learning is widely
utilized in the domains of computer vision and natural language processing [25,26], but is
currently relatively rare in the field of recommendation. At its core is a framework called
proxy tasks, which allows the utilization of unlabeled data itself to generate labels without
the need for manual data annotation. For example, Bert [27] masked some of the words in
the text utilizing a random mask and set a proxy task to predict them; RotNet [28] utilizes
the rotated image as the input to the training model, giving the model better representation
capabilities. In contrast to supervised learning, self-supervised learning permits changes in
the input data to leverage the unlabeled data space to achieve significant improvements in
downstream tasks. In this study, the benefits of self-supervised learning are included in
the recommendation representation learning to solve the constraints of the graph neural
network-based recommendation models mentioned above.
The self-supervised learning task contains two main parts: (1) data augmentation to
generate multi-views per node, and (2) contrastive learning is used to maximize consistency
between multiple views of the same node while minimizing consistency across views of
distinct nodes. In graph representation learning, the properties of the data have a strong
34
Electronics 2023, 12, 793
impact on the representation results of the nodes, especially their structural properties.
Therefore, the data without labels can be built by altering the structure of the graph. To this
end, this paper utilizes three graph data-augmentation methods of node mask, edge mask,
and layer mask to change the graph structure and perform contrastive learning based on
graph attention networks. Self-supervised learning enhances node representation learning
by investigating the interactions within nodes. Thereby, self-supervised learning com-
plements graph neural network-based recommendation models. Node self-identification
provides auxiliary supervised signals that complement classical supervised learning from
observed interactions only. Graph data augmentation reduces the impact on model training
by reducing the edges of high-degree nodes.
In summary, this work proposes self-supervised graph attention collaborative filtering
for recommendation. It can effectively solve the problems of sparse supervised signals
and long tails in graph neural network-based recommendations and reduce the impact
from the drought-in interaction noise data. The following details the proposed method and
demonstrates its effectiveness through extensive experiments. Section 3 states the proposed
method in detail and contains two main tasks: supervised task and self-supervised task.
Sections 4 and 5 demonstrate the effectiveness of the proposed method through extensive
experiments on three public datasets.
2. Related Work
This subsection introduces three aspects related to this work: collaborative filtering-
based recommendation, graph neural network-based recommendation, and self-supervised
learning.
35
Electronics 2023, 12, 793
tation and tunes up the interpretability of the model; BGCF [37] treats the interaction graph
also as a random variable to mitigate the impact caused by the uncertainty of the user–item
interaction graph; and LR-GCCF [38] and LightGCN [39] analyze operations, such as fea-
ture transformation and nonlinear activation, in graph neural networks, simplifying them
to improve the performance of the model.
Although the above methods achieve relatively good results, the influence of noisy
data in the process of aggregating neighbors does not allow obtaining high-quality node
representations. Therefore, this work employs a multi-head graph attention network to
mine the correlation between neighboring nodes and obtain accurate node representations.
This work uses a supervised learning paradigm for model training, but sparse supervised
signals lead to a loss of performance. Therefore, self-supervised learning is incorporated as
an auxiliary task to enhance the supervised learning task.
3. Proposed Method
This paper proposes the self-supervised learning graph attention-based collaborative
filtering recommendation (SGACF), whose architecture is shown in Figure 1. The frame-
work is divided into two components: supervised tasks and self-supervised tasks. The
supervised task serves as the main part of the framework, with the graph attention network
as the backbone network. The self-supervised task mainly constructs supervised signals
from the correlations within the input data and performs joint learning with the supervised
task as an auxiliary task. This chapter introduces the supervised task framework and
self-supervised tasks. It describes how data augmentation in self-supervised learning is
performed to generate multiple view representations, and then contrastive learning is
performed to construct the pretext task based on the generated representations. Finally, a
theoretical analysis of how self-supervised tasks enhance supervised learning is presented.
36
Electronics 2023, 12, 793
Figure 1. Overall framework. The upper part is for the supervised task, and an H-graph attention
network does feature extraction with multiple lines in the network representing a multi-headed atten-
tion mechanism; the lower part is self-supervised learning as an auxiliary task, sharing parameters
with the network layer in the supervised task.
37
Electronics 2023, 12, 793
Figure 2. Supervised learning framework. It consists of three main components: the embedding layer,
the neighbor aggregation and embedding propagation layer, and the prediction layer. y(user,item) is
the user’s preference score for the item .
denotes the new vector representation generated, and AGG(·) is the aggregation function.
The aggregation function is implemented below. First, to obtain sufficient representational
power, the input node embedding vectors are feature transformed to obtain a new set of
node embedding vectors. A shared parameterized weight matrix W ∈ Rd ×d is needed to
act on each node, where d is the size of the transformed vector representation. Then, the
attention coefficient is calculated between the neighboring nodes. The formula is as follows:
38
Electronics 2023, 12, 793
K
1
∑ ∑
f inal
eu = σ( αkuj wk e j ) (5)
K k=1 j∈ Nu
where || denotes the concatenation operation, αkuj denotes the coefficient of the kth atten-
tion mechanism output between user u and item i, and wk is the weight matrix of the
corresponding linear transformation.
The node representation is enhanced by a first-order neighbor propagation layer, and
then multiple graph attention network layers are used to obtain higher-order neighbor
features. Such higher-order neighbor features can dig into the potential interests of users
and can effectively improve the generalization of the model. Each node in the interaction
graph performs first-order propagation to update the node representation, and the second-
order neighbor features can be obtained by iteratively performing first-order propagation.
Thus, the synergistic signals of higher-order neighbors can be obtained through multiple
iterations. The specific formula is expressed as follows:
⎧
⎨ e(1) = AGG(e(0) , G)
u u
(6)
⎩e(h) = AGG(e(h−1) , G)
u u
(0) (1)
where eu is the node vector representation after initialization, eu is the node represen-
(h)
tation after aggregating first-order neighbors, and is the vector representation after
eu
aggregating h-order. The output vectors of multiple networks contain node vector rep-
resentations with different order neighbor features, which directly affect the final vector
representation of the nodes. Nodes with rich low-order neighbors are less dependent on
higher-order neighbor collaboration signals, and conversely, require more high-order neigh-
bor signals to enrich the vector representation. Therefore, averaging pooling is adopted to
merge the vector representations of nodes of different orders. The formula is as follows:
H
1
∑
f inal
eu = eh (7)
H + 1 h =0 u
f inal f inal
where eu and ei are the final vector representations of users and items, ŷui is the
preference score of user u for item i.
39
Electronics 2023, 12, 793
unobserved user–items, and it assumes that observed interactions are more indicative of
user preferences and therefore are granted a higher ranking than unobserved interactions:
where the operations s1 and s2 are executed on G to change the graph structure and create
(h) (h)
two related views of node e1 and node e2 . Setting the probability of a node being
dropped as ρ, s1 and s2 can be modeled as follows.
where C1 , C2 ∈ {0, 1}|E | are two mask vectors whose randomly masked edge set E generates
two subgraphs. Local neighbors of nodes affect representation learning, further mitigating
the sensitivity of representation learning to structure.
40
Electronics 2023, 12, 793
Figure 3. Graph augmentation method. Top left is node mask, top right is edge mask, and bottom is
layer mask.
(h) (h)
s1 (G) = (V , C1 E ), s2 (G) = (V , C2 E ) (14)
(h) (h)
where C1 , C2 ∈ {0, 1}|E | are two mask vectors whose randomly masked edge of the
h-layer network.
41
Electronics 2023, 12, 793
are distinct views of distinct nodes after augmentation. This paper follows SimCLR and
utilizes the contrastive loss InfoNCE:
exp(s(eu , eu )/τ )
Luser
ssl = ∑ −log
∑o∈U exp(s(eu , eo )/τ )
(15)
u∈U
where s(·) is is the cosine similarity function. τ is the temperature coefficient. Similarly, the
item-side contrastive loss Litemssl can be obtained. Combining these two losses results in a
self-supervised loss L = Lssl + Luser
item
ssl .
where θ is the parameter of supervised learning, and λ1 and λ2 are the contribution values
of hyperparameter control self-supervised loss and L2 regularization.
4. Experiment
The work conducts experiments on three publicly available datasets to evaluate the
effectiveness of the proposed method. The settings of the parameters in the model are
presented, and the performance is analyzed in comparison with other models.
1 | Mu ∩ Mutest |
Recall =
|U | ∑ Mutest
(17)
u∈U
where Mu is the recommendation list and Mutest is the positive sample of user u in the test
set. NDCG considers the factor of item location in the recommendation list, and the higher
the value, the better the recommendation effect. The formula is as follows:
rel ( p)
1 ∑Kp=1
log( p+1)
NDCG =
|U | ∑ TP 1
(18)
u∈U ∑ p=1 log( p+1)
where rel (·) denotes the item correlation calculation, and TP denotes the items in the
recommendation list in order of correlation from largest to smallest.
42
Electronics 2023, 12, 793
Analysis of the results shows that the edge mask improves the model more significantly.
SGACF-EM outperforms SGACF-LM, and SGACF-LM outperforms SGACF-NM, which
may be the inherent ability of the edge mask to capture the graph structure. NM may be
thought of as a subset of EM in which certain node edges are masked. The performance
of SGACF-NM in the experiment is relatively unstable, from which it can be concluded
that mask high degree nodes lead to training instability. Analyzing from the aspect of
order, a too-low order leads to a lack of information in the vector representation learned
by the model, and too-high order leads to a convergence of node representations, which
cannot distinguish the vector representation of different nodes. From Table 1, it is clear
that taking the 3-order acquires better performance. In addition, self-supervised learning
can enhance the generalization ability of the model. That is, contrastive learning between
different nodes can alleviate the problem of the over-smoothing of node representations. To
obtain a clearer picture of the effect of the reaction order and graph augmentation method
on the performance, the results are visualized as shown in Figure 4.
43
Electronics 2023, 12, 793
• NeuMF learns only low-dimensional vector representations of users and items utiliz-
ing the embedding layer. The results are poor on all three datasets, indicating that
there is a considerable paucity of valuable information in the embedding vectors of
users and items.
• CMN outperforms other comparison models in terms of recall on the Yelp and Amazon
datasets. This is due to the utilization of the attention mechanism in the model
to obtain better performance by enhancing the representational ability of nodes in
heterogeneous networks.
• SpectralCF outperforms NeuMF on all three datasets, which indicates that the graph
neural network-based recommendation model outperforms the general deep learning-
44
Electronics 2023, 12, 793
based model in terms of structure. Building the interaction between users and items
as a bipartite graph can better explore the potential interests of users and enrich the
embedding vectors of users and items.
• NGCF has better performance in comparison models. It defines the convolution
operation directly on the spatial domain, without depending on the graph convolution
theory. While improving flexibility, the performance of the spatial domain-based
graph convolution recommendation model outperforms other methods.
• In comparison to NGCF, SGACF-EM improves the NDCG on the Yelp and Amazon
datasets by 5% and 6%, respectively. SGACF utilizes a self-attention mechanism in the
process of aggregating neighboring features to quantify the neighboring features of
nodes according to their importance, thus improving the embedding representation of
users and items. In order to make the calculation of attention coefficients more stable,
the multi-headed self-attention mechanism is employed. The experimental results
fully verify the rationality and effectiveness of the method proposed in this paper.
CMN only utilizes the features of first-order neighbors, and SGACF mines higher-
order collaborative signals based on higher-order connectivity, which illustrates the
importance of higher-order connectivity principle in graph representation learning.
SpectralCF and NGCF utilize spectral-based graph convolution and spatial domain-
based graph convolution networks, respectively; both ignore the reliability of the
feature propagation process between neighboring nodes and are prone to bring in
noisy data. Moreover, in the training process of graph convolutional networks, the
high degree of nodes tends to dominate the representation learning of the model.
Therefore, SGACF joint self-supervised learning makes it easy for nodes with low
degree to learn.
5. Ablation Analysis
To demonstrate the effectiveness of the adopted technique, this paper conducts abla-
tion experiments on a multi-headed graph attention mechanism and a self-supervised task.
45
Electronics 2023, 12, 793
Table 3. Effect of graph attention network with different number of heads on performance.
46
Electronics 2023, 12, 793
Author Contributions: Methodology and writing—original draft preparation, J.Z.; review and vali-
dation, K.L., J.P. and J.Q. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Natural Science Foundation of Hebei Province under grant
No. F2022201009; the Hebei University High-level Scientific Research Foundation for the Introduction
of Talent under grant No.521100221029; the Scientific Research Foundation of Hebei University for
Distinguished Young Scholars under grant No. 521100221081 and the Scientific Research Foundation
of Colleges and Universities in Hebei Province under grant No. QN2022107.
Data Availability Statement: All datasets are publicly available.
Acknowledgments: Thanks to the mentor for his careful guidance. Thanks to the anonymous
reviewers for their insightful comments that improved the quality of this paper. Thanks to my
girlfriend for encouraging me so much.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Linden, G.; Smith, B.; York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 2003, 7,
76–80. [CrossRef]
2. Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the
10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295.
3. Rendle, S. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining, Sydney, NSW, Australia,
13–17 December 2010; pp. 995–1000.
4. Juan, Y.; Zhuang, Y.; Chin, W.S.; Lin, C.J. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM
Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 43–50.
5. He, X.; Pan, J.; Jin, O.; Xu T.B.; Liu, B.; Xu, T.; Shi, Y.X.; Atallah, A.; Herbrich, R.; Bowers, S.; et al. Practical lessons from predicting
clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New
York, NY, USA, 24–27 August 2014; pp. 1–9.
6. Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [CrossRef]
7. Sedhain, S.; Menon, A.K.; Sanner, S.; Xie, L. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th
International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 111–112.
8. Shan, Y.; Hoens, T.R.; Jiao, J.; Wang, H.; Yu, D.; Mao, J. Deep crossing: Web-scale modeling without manually crafted combinatorial
features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San
Francisco, CA, USA, 13–17 August 2016; pp. 255–262.
9. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.S. Neural collaborative filtering. In Proceedings of the 26th International
Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182.
10. Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide
& deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems,
Boston, MA, USA, 15 September 2016; pp. 7–10.
47
Electronics 2023, 12, 793
11. He, X.; Chua, T.S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM
SIGIR conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 355–364.
12. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907.
13. Levie, R.; Monti, F.; Bresson, X.; Bronstein, M.M. Cayleynets: Graph convolutional neural networks with complex rational spectral
filters. IEEE Trans. Signal Process. 2018, 67, 97–109. [CrossRef]
14. Niepert, M.; Ahmed, M.; Kutzkov, K. Learning convolutional neural networks for graphs. In Proceedings of the International
Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 2014–2023.
15. Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of
the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1263–1272.
16. Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd international
ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174.
17. Wang, X.; Jin, H.; Zhang, A.; He, X.; Xu, T.; Chua, T.S. Disentangled graph collaborative filtering. In Proceedings of the 43rd
International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July
2020; pp. 1001–1010.
18. Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv
2012, arXiv:1205.2618.
19. Bayer, I.; He, X.; Kanagal, B.; Rendle, S. A generic coordinate descent framework for learning from implicit feedback. In
Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1341–1350.
20. He, R.; McAuley, J. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In
proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 507–517.
21. Tang, X.; Yao, H.; Sun, Y.; Wang, Y.; Tang, J.; Aggarwal, C.; Mitra, P.; Wang, S. Investigating and mitigating degree-related biases
in graph convoltuional networks. In Proceedings of the 29th ACM International Conference on Information & Knowledge
Management, Virtual Event, Ireland, 19–23 October 2020; pp. 1435–1444.
22. Wang, W.; Feng, F.; He, X.; Nie, L.; Chua, T.S. Denoising implicit feedback for recommendation. In Proceedings of the 14th ACM
International Conference on Web Search and Data Mining, Virtual Event, Israel, 8–12 March 2021; pp. 373–381.
23. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903.
24. Yao, T.; Yi, X.; Cheng, D.Z.; Xu, F.; Chen, T.; Menon, A.; Hong, L.; Chi, E.H.; Tjoa, S.; Kang, J.; et al. Self-supervised learning for
large-scale item recommendations. In Proceedings of the 30th ACM International Conference on Information and Knowledge
Management, Virtual Event, Queensland, Australia, 1–5 November 2021; pp. 4321–4330.
25. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9729–9738.
26. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language
representations. arXiv 2019, arXiv:1909.11942.
27. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
28. Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018,
arXiv:1803.07728.
29. Hofmann, T. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. (TOIS) 2004, 22, 89–115. [CrossRef]
30. Miyahara, K.; Pazzani, M.J. Collaborative Filtering with the Simple Bayesian Classifier. In Proceedings of the 6th Pacific Rim
international conference on Artificial intelligence, Melbourne, Australia, 28 August–1 September 2002; pp. 679–689.
31. Ungar, L.H.; Foster, D.P. Clustering methods for collaborative filtering. In Proceedings of the AAAI Workshop on Recommendation
Systems, Madison, WI, USA, 26–27 July 1998; Volume 1, pp. 114–129.
32. Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; Riedl, J. Grouplens: An open architecture for collaborative filtering of netnews.
In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, Chapel Hill, NC, USA, 22–26 October
1994; pp. 175–186.
33. Chowdhury, G.G. Introduction to Modern Information Retrieval; Facet Publishing: London, UK, 2010.
34. Zheng, L.; Lu, C.T.; Jiang, F.; Zhang, J.; Yu, P.S. Spectral collaborative filtering. In Proceedings of the 12th ACM Conference on
Recommender Systems, Vancouver, BC, Canada, 2–7 October 2018; pp. 311–319.
35. Berg, R.; Kipf, T.N.; Welling, M. Graph convolutional matrix completion. arXiv 2017, arXiv:1706.02263.
36. Sun, J.; Zhang, Y.; Guo, W.; Guo, H.; Tang, R.; He, X.; Ma, C.; Coates, M. Neighbor interaction aware graph convolution
networks for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development
in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 1289–1298.
37. Sun, J.; Guo, W.; Zhang, D.; Zhang, Y.; Regol, F.; Hu Y.; Guo, H.; Tang, R.; Yuan, H.; He, X.; et al. A framework for recommending
accurate and diverse items using bayesian graph convolutional neural networks. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 2030–2039.
38. Chen, L.; Wu, L.; Hong, R.; Zhang, K.; Wang, M. Revisiting graph based collaborative filtering: A linear residual graph
convolutional network approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12
February 2020; pp. 27–34.
48
Electronics 2023, 12, 793
39. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for
recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information
Retrieval, Virtual Event, China, 25–30 July 2020; pp. 639–648.
40. Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buckatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.; Azar, M.G.; et al.
Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284.
41. Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations
by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670.
42. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In
Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607.
43. Sun, F.Y.; Hoffmann, J.; Verma, V.; Tang, J. Infograph: Unsupervised and semi-supervised graph-level representation learning via
mutual information maximization. arXiv 2019, arXiv:1908.01000.
44. Velickovic, P.; Fedus, W.; Hamilton, W.L.; Lio, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. ICLR 2019, 2, 4.
45. Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; Leskovec, J. Strategies for pre-training graph neural networks. arXiv
2019, arXiv:1905.12265.
46. Hassani, K.; Khasahmadi, A.H. Contrastive multi-view representation learning on graphs. In Proceedings of the International
Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 4116–4126.
47. Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; Tang, J. Gcc: Graph contrastive coding for graph neural
network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 1150–1160.
48. Ma, J.; Zhou, C.; Yang, H.; Cui, P.; Wang, X.; Zhu, W. Disentangled self-supervision in sequential recommenders. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July
2020; pp. 483–491.
49. Xia, L.; Huang, C.; Xu, Y.; Zhao, J.; Yin, D.; Huang, J. Hypergraph contrastive collaborative filtering. In Proceedings of the 45th
International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022;
pp. 70–79.
50. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
51. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Sardegna, Italy,
13–15 May 2010; pp. 249–256.
52. Ebesu, T.; Shen, B.; Fang, Y. Collaborative memory network for recommendation systems. In Proceedings of the 41st International
ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 515–524.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
49
electronics
Article
Preference-Aware Light Graph Convolution Network for Social
Recommendation
Haoyu Xu 1 , Guodong Wu 1 , Enting Zhai 1 , Xiu Jin 1 and Lijing Tu 2, *
1 College of Information and Computer Science, Anhui Agricultural University, Hefei 230001, China;
xhy817@[Link] (H.X.)
2 Anhui Provincial Key Laboratory of Smart Agricultural Technology and Equipment, Hefei 230036, China
* Correspondence: tlj@[Link]
Abstract: Social recommendation systems leverage the abundant social information of users existing
in the current Internet to mitigate the problem of data sparsity, ultimately enhancing recommendation
performance. However, most existing recommendation systems that introduce social information
ignore the negative messages passed by high-order neighbor nodes and aggregate messages without
filtering, which results in a decline in the performance of the recommendation system. Considering
this problem, we propose a novel social recommendation model based on graph neural networks
(GNNs) called the preference-aware light graph convolutional network (PLGCN), which contains
a subgraph construction module using unsupervised learning to classify users according to their
embeddings and then assign users with similar preferences to a subgraph to filter useless or even
negative messages from users with different preferences to attain even better recommendation
performance. We also designed a feature aggregation module to better combine user embeddings
with social and interaction information. In addition, we employ a lightweight GNN framework
to aggregate messages from neighbors, removing nonlinear activation and feature transformation
operations to alleviate the overfitting problem. Finally, we carried out comprehensive experiments
using two publicly available datasets, and the results indicate that PLGCN outperforms the current
state-of-the-art (SOTA) method, especially in dealing with the problem of cold start. The proposed
model has the potential for practical applications in online recommendation systems, such as e-
commerce, social media, and content recommendation.
Citation: Xu, H.; Wu, G.; Zhai, E.; Jin, Keywords: graph convolution network; recommendation system; social recommendation
X.; Tu, L. Preference-Aware Light
Graph Convolution Network for
Social Recommendation. Electronics
2023, 12, 2397. [Link]
1. Introduction
10.3390/electronics12112397
With the emergence and prosperity of online service platforms, the dissemination and
Academic Editor: Grzegorz Dudek exchange of information have been greatly promoted, and the amount of information in the
Received: 18 April 2023
network has increased exponentially. However, when confronted with such an enormous
Revised: 15 May 2023 amount of information, users find it hard to obtain the information that is relevant and
Accepted: 19 May 2023 helpful to them; this phenomenon is referred to as “information overload”. To address this
Published: 25 May 2023 issue, a recommendation system was developed that analyzes the historical behavior data
of users and explores their potential interests to provide them with personalized services.
At present, recommendation systems are widely used in industry.
Collaborative filtering (CF) has been a widely used technique in the last few decades.
Copyright: © 2023 by the authors. In simple terms, collaborative filtering recommends information of interest to users accord-
Licensee MDPI, Basel, Switzerland. ing to the preferences of a group of people who share similar interests and experiences, thus
This article is an open access article filtering out a large amount of irrelevant information. However, CF is severely limited by
distributed under the terms and the problem of sparse data, and the effectiveness of the model is significantly reduced when
conditions of the Creative Commons
there are insufficient data on user–item interactions. As online social platforms such as
Attribution (CC BY) license (https://
Facebook, WeChat, and Twitter have grown in popularity, an increasing number of people
[Link]/licenses/by/
are posting product reviews on these sites. References [1–3] and personal experience also
4.0/).
show that people are affected by their friends’ views and actions and gravitate toward
those who share their interests. In summary, the application of social relationships in recom-
mender systems has also attracted increasing attention [4,5]. Based on this understanding,
recommendation systems can introduce social information to reduce data sparsity and
improve recommendation accuracy, and these recommendation systems are called social
recommendation systems [6–8].
Early GNNs mainly solved problems strictly related to graph theory [9–11], such
as molecular structure classification, in which GNNs showed a superb ability to handle
non-Euclidean data. Since data in recommender systems can naturally represent graph
data (e.g., interaction data between users and items represented as bipartite graphs),
much recent work has applied graph neural networks to recommender systems. Within
social recommendation systems, data are typically presented in two forms: the user–item
interaction graph, which contains information pertaining to the interactions between users
and items, and the social graph, which reflects the social relations of users. There are
generally two strategies for recommender systems to use social information [12]. One is to
learn user representations from the two graphs separately [8,13,14] and then combine them
into a vector, which is more flexible and can use different treatments for different graphs;
the other is to merge the two graphs into a unified heterogeneous graph [7,15] and apply
GNNs to propagate the information, which has the advantage that the information in both
graphs is unified in one representation, which can capture some more complex interactions.
Although GNNs have shown good performance in the field of social recommendation,
most of the existing models simply combine social information as auxiliary information
with interactive information without fully utilizing the social graph’s information. In the
information propagation, they only consider the information propagation of high-order
neighbor nodes, but no particular attention is given to the fact that there is a lot of useless
or even negative information in this information. Inspired by IMP-GCN [16], we introduce
an unsupervised subgraph construction module in the social recommendation system,
which divides the interaction graph into multiple subgraphs based on user preferences,
and users who share similar interests are placed into the same subgraph. We then perform
graph convolution operations in the subgraphs using a lightweight GNN to filter out
negative information brought by users with different preferences. We also design a feature
aggregation module to better integrate user representations in the two graphs.
In conclusion, the primary findings of this study are as follows:
• A novel social recommendation model PLGCN is proposed, which splits the user–item
interaction graph into multiple subgraphs based on the user’s preferences and passes
information in the subgraphs, filtering out negative information brought by users with
different preferences.
• A new feature aggregation module was designed that can aggregate the user rep-
resentations in the two graphs more effectively and has regularization to prevent
overfitting.
• We performed comprehensive experiments using two publicly available datasets to
evaluate the recommended performance of PLGCN. Based on the outcomes of these
experiments, it is evident that PLGCN outperforms the baseline models.
The rest of this article is summarized as follows: We begin with a brief overview
of typical relevant work in Section 2. The social recommender system problem and its
definition are introduced in Section 3. The design details of the PLGCN model are described
in Section 4. Section 5 presents a comprehensive experiment conducted to assess the perfor-
mance of PLGCN. Finally, we conclude our work and identify potential research directions.
2. Related Work
2.1. Social Recommendation
As online social platforms (e.g., WeChat, Twitter, Facebook) and the richness of users’
social information grow rapidly, an increasing number of recommendation systems are in-
troducing users’ social information. Leveraging social influence [1] and social homogeneity,
51
Electronics 2023, 12, 2397
as outlined in [2], facilitates a better comprehension of user preferences, and data sparsity
is effectively mitigated.
We generally categorize the prior social recommendation systems relevant to our
study into three groups based on how they utilize social information. Social networks are
used as a kind of regularization in the first category of methods [4,17–19]. SocialMF [19]
integrates trust propagation into the matrix factorization technique, making the user’s
preferences as close as possible to his/her social neighbors. CSR [18] designed a generic
regularization term to model the diverse similarities among users and their various friends.
One kind of ensemble method involves splitting all items into different groups and ranking
them manually [20,21]. SBPR [20] suggests that users have a tendency to provide higher
ratings to products that are favored by their friends, and for each user, the collection of
items is sorted into three categories: negative items, social items, and positive items. The
rankings are as follows: negative items < social items < positive items. The third method is
to fuse the embeddings of the user and his/her neighbors [22–24]. TrustSVD [22] introduces
social information based on SVD++ [25] and uses implicit feedback from social neighbors
as auxiliary implicit feedback for users. RSTE [24] posits that the user’s final choice is a
trade-off between his/her own likes and the opinions of his/her trusted friends and the
linear fusion of the user’s embedding with the user’s neighbor nodes in the social graph.
Nevertheless, none of these models can adequately model the intricate social relationships
between users and the interactions between users and items. Therefore, numerous recent
studies have focused on employing deep learning for social recommendation systems, with
GNN-based social recommendation systems attracting attention because both the social
relationship between users and the data that reflect user–item interactions can be modeled
naturally in graph form.
52
Electronics 2023, 12, 2397
aggregates all messages from neighboring nodes without considering their relevance. In
contrast, our PLGCN includes a subgraph construction module, and we perform message
passing in the subgraph to effectively filter out irrelevant information. (2) Our feature
aggregation module uses MLP to fully explore the potential relationship between user inter-
action information and social information, whereas SocialLGN uses a linear transformation
in the graph fusion module.
In summary, several approaches have been proposed to address the challenges of social
recommendation. However, these approaches often have limitations, such as the difficulty
of modeling complex user–item interactions and social relationships. Some approaches use
attention mechanisms to distinguish the importance of neighboring nodes and filter out
irrelevant information, but they often oversimplify the use of social information and do
not fully exploit its value. We propose the PLGCN approach to overcome these challenges,
which uses subgraph building blocks to filter out irrelevant information and fully leverages
the relationship between social and interaction information through neural networks. Our
approach provides a more nuanced modeling of the complex interaction between users
and items and the social relationship, leading to improved recommendation accuracy
and relevance.
3. Problem Definition
Essentially, the recommendation problem is to analyze the user’s behavior data to
predict the preferences of the user and then combine the data of the items in the system to
calculate the items that could potentially appeal to the user and generate a recommendation
list from them. However, users are only able to explicitly interact with a small fraction
of items, which results in very sparse valid data. The social recommendation system
introduces users’ social information, which supplements the sparse and effective data and
reduces data sparsity. It is clearly effective to use the homogeneity and the influence of
social relationships to understand users’ preferences.
In the paragraphs that follow, we define a GNN-based social recommendation system.
The notations j and U are used to represent the sets of items and users, and they have M
and N elements, respectively (i.e., | j| = M, |U | = N). In general, recommendation systems
make use of two distinct types of data: social graphs and user–item interaction graphs. A
description of these two graphs is given below.
The interaction behavior of users with items (e.g., views, rates, and clicks) is represented
by the user–item interaction graph. The graph can be defined by triples (u, yui , i|u ∈ U , i ∈ j)
and is denoted by GI , where yui represents the edge that connects user u to item i, and yui > 0
means that user u interacts with item i. On the other hand, there will not be any interaction
between them if yui = 0. The notation N iI means the collection of users who have explicit
interaction with item i, and the notation N uI indicates the collection of items with which user
u has explicit interaction.
Users’ social connections are represented in the social graph, which provides auxiliary
information about the user (e.g., direct follower or undirected friendship). We represent the
social graph as Gs , which has the triple form {(u, suv , v|u, v ∈ U )}, where suv represents
the relationship between users u and v, and suv = 1 means there is an observable social
connection between users u and v, while suv = 0 indicates there is no connection between
them. The symbol N Su is used to denote the collection of users who have a social connection
with user u.
Based on the aforementioned conditions, the social recommendation task is described
as follows: given the social graph GS and the user–item interaction graph G I , the recom-
mendation system should be able to predict the probability of interaction between user
u and all items, sort them in descending order, and choose the top N items to generate a
recommendation list for user u.
53
Electronics 2023, 12, 2397
Figure 1. Architecture design of our PLGCN model with 2 subgraphs as illustration. First-order prop-
agation operations are performed on the entire interaction graph and social graph, and higher-order
propagation operations are performed on the subgraphs of the interaction graph and social graph.
54
Electronics 2023, 12, 2397
(0)
where Fu denotes the user embedding obtained by embedding aggregation, eu represents
(1)
the output generated by the embedding layer, and eu is the feature vector that aggregates
first-order neighbor information in the graph, which is generated as an output from the first
propagation layer. σ denotes an activation function called LeakyReLU, capable of encoding
both negative and positive signals. The learnable parameters b1 ∈ R1×d and W1 ∈ Rd×d
are the bias vector and the weight matrix, respectively. To split the user–item interaction
graph into multiple subgraphs based on user preferences, we input the user embeddings
into a 2-layer neural network to obtain a prediction vector:
U h = σ ( b2 + W 2 F u ) (2)
U o = σ ( b3 + W 3 U h ) (3)
where Uo is the output vector, and the position where the value is at its maximum is
the subgraph to which the user belongs, so it is natural that the number of subgraphs
and the output vector’s dimension are the same. The learnable parameters W2 ∈ Rd×d
and W3 ∈ R1× Nc are the weight matrices, and the learnable parameters b2 ∈ R1×d and
b3 ∈ R1× Ns are the bias vectors. For users with similar embeddings, the neural network will
group them into the same subgraph. This is an unsupervised node classification method
because we do not need the real labels of the users.
In summary, we feed the user ID information and first-order user embedding, which
best reflect user preferences, into the subgraph construction module. Then, we utilize
the powerful modeling ability of neural networks to handle nonlinear relationships and
classify user preferences. It is worth noting that we refrain from using traditional clustering
algorithms such as K-means [30] due to the high dimensionality of user feature vectors in
the current recommendation system field. Traditional clustering algorithms are susceptible
to the curse of dimensionality when dealing with high-dimensional data, which can lead
to information loss if PCA-based [31] dimensionality reduction is used. Additionally,
traditional clustering algorithms cannot effectively model complex nonlinear relationships.
The subgraph construction module groups users with similar preferences and their
directly related items into the same subgraph, with each subgraph being independent. By
passing messages only within each subgraph, our approach effectively filters out irrelevant
or negative information.
55
Electronics 2023, 12, 2397
generate the final user embeddings through the feature aggregation module. The process
of user u’s l-th (l ≤ L) iteration propagation can be abstracted as follows:
(l ) ( l −1) ( l −1)
eu = Agg ei , ∀i ∈ NuI , ev , ∀v ∈ NuS (4)
( l −1) ( l −1)
where ei and ev are the embeddings of item i and user v, respectively, after the
l-th iteration propagation, and Agg(∗) is the aggregation function that aggregates the
embeddings of item i with which u has interaction and the embeddings of user v with
which u is socially connected. We designed a feature aggregation module to act as the user
aggregation function Agg(∗) to better learn user embeddings.
Because direct interactions between users and items more accurately reflect user
preferences, this is crucial and reliable information. To construct subgraphs based on
user preferences, we perform first-order graph convolution operations on the social graph
and entire interaction graph alone, while second-order or higher-order graph convolution
operations are performed on the social graph and subgraphs of the interaction graph to
filter out useless or even negative information from users with different preferences. To
achieve this, two separate embeddings are created in the interaction graph and the social
(l ) (l )
graph to represent user u after the l-th iteration propagation, with tu and su being their
(l )
respective representations and eu
being the user’s final embedding after the l-th iteration
propagation. Thus, for user u, the first-order propagation can be expressed as follows:
(1) 1 (0)
tu = ∑ e
cui i
(5)
i ∈NuI
(1) 1 (0)
su = ∑ cuv
ev (6)
v∈NuS
where cui is |NuI ||NuI |, which is the product of the square root of the degree of user u and
item i in the interaction graph, and its inverse is the normalization term that prevents the
user or item embedding scale from increasing due to graph convolution operations. cuv is
S S
Nu Nv , which is the product of the square root of the degree of user u and user v in
the social graph and serves the same purpose as cui .
The process of updating the embedding in second-order or higher-order (i.e., l ≥ 2)
graph convolution is analogous to the first-order graph convolution process, with the
difference that high-order graph convolution is performed in the social graph and the
subgraph of the interaction graph to which the user belongs. The procedure is explained in
full in the steps that follow:
(l ) 1 ( l −1)
tu = ∑ ei (7)
Ic c uic
i ∈N u
(l ) 1 ( l −1)
su = ∑ cuv
ev (8)
v∈N Su
where cuic is N uIc N uIc , which is the product of the square root of the degree of user u
and item i in the subgraph of the interaction graph to which the user belongs. As Equations
(5)–(8) show, we have adopted a lightweight form of propagation, discarding complex
operations such as linear transformations, and this lightweight form of propagation is
inspired by SGC [32] and LightGCN [28].
(l ) (l )
The feature aggregation module is then used to aggregate tu and su to generate
(l )
the updated embedding eu for layer l. As shown in Equations (9) and (10), the feature
56
Electronics 2023, 12, 2397
aggregation module can be seen as a function Agg(∗) with two embeddings as parameters,
and the specific aggregation steps are as follows:
(l ) (l ) (l )
hu = MLP σ W4 tu σ W5 su (9)
(l )
(l ) h
eu = (ul ) (10)
hu 2
where W4 and W5 ∈ Rd×d are trainable weight matrices and is a vector splicing operation
that splices two vectors of dimension d into a vector of length 2d. σ is the tanh activation
function, and MLP(∗) is a multilayer perceptron that can capture the complex relationship
between two users’ embeddings in each dimension. Equation (9) is a regularization opera-
(l )
tion that prevents embedding eu from becoming particularly large as the number of layers
l grows.
(l ) 1 ( l −1)
ei = ∑ ciu
eu (11)
u∈N iI
where ciu is |NuI ||NuI |, which is the product of the square root of the degree of user u and
item i in the interaction graph, and it also serves for normalization.
L
(l )
e∗u = ∑ α l eu (12)
l =0
L
(l )
ei∗ = ∑ α l ei (13)
l =0
where αl denotes the l-th layer’s embedding weight factor and e∗u and ei∗ are the final
embeddings of user u and item i, respectively.
To obtain the preference of user u for item i, the inner product of their embeddings
is computed:
ŷui = eu∗T ei∗ (14)
where ŷui denotes our predicted preference of user u for item i.
57
Electronics 2023, 12, 2397
To achieve this, we minimize the Bayesian personalized ranking loss, which is based
on the idea that it increases the gap between the scores of the negative and positive samples,
with positive samples being user and item interactions that already exist in the dataset
and negative samples being non-existent interactions that are not observed in the dataset.
Therefore, we define a triple {u, i+ , i− }, where u has interacted with i+ but not with i− .
The objective function has the following form:
where λ and Θ denote the weight decay rate and the parameters of PLGCN, respectively.
(1)
(1) HU
EU = (1)
(17)
HU 2
(1) (0)
EI = L TR EU (18)
The following formula shows the l-th layer’s propagation in matrix form in PLGCN:
(l ) ( l −1)
EUc = Lc EUc (19)
where Lc denotes the Laplacian matrix belonging to a subgraph of the interaction graph.
The information of all subgraphs is then aggregated:
(l ) (l )
EU = ∑ EUc (20)
Uc ∈ Gc
(l )
where EU is the final embedding of the l-th layer, and Gc denotes the set of subgraphs of
the user–item interaction graph. Then, we perform the same operation as the first layer:
(l ) ( l −1) ( l −1)
HU = MLP σ W1 L R E I σ W2 LS EU (21)
(l )
(l ) HU
EU = (l )
(22)
HU 2
(l ) ( l −1)
E I = L TR EU (23)
58
Electronics 2023, 12, 2397
5. Empirical Analysis
To compare our PLGCN’s performance with other recommendation methods, this
section describes the evaluation metrics, the dataset, the parameter settings, and the
experiments we carried out on various datasets. We ran all programs on a Win10 PC with
an RTX 3070 Ti graphics card with 8 G of RAM and an i5 12,600 K processor. We used
PyTorch to build the PLCGN.
Table 1. Statistics for the two datasets. # represents the number of elements in the set.
59
Electronics 2023, 12, 2397
5.1.3. Metrics
To assess the recommended performance under the top-N task of our PLGCN and five
other SOTA methods, we use three metrics that are commonly applied; two of them are
precision and recall, and they have the following expressions:
#TP
Precision = (24)
#TP + #FP
#TP
Recall = (25)
#TP + #FN
where FP is the number of incorrectly predicted negative samples, TP denotes the number
of properly predicted positive samples, and FN denotes the number of incorrectly predicted
positive samples. The other is NDCG, i.e., normalized discounted cumulative gain, which
is used to measure the quality of the ranking, and it is expressed as follows:
r (i )
r (1) + ∑il=2
log2i
NDCG@N = | REL|
(26)
r (i )
∑ i =1 ( i +1)
log2
where | REL| is the sum of the relevance scores r (i ) of the top N items recommended.
r (i ) = 1 indicates that the user interacts with the recommended item; r (i ) = 0 means that
the user does not interact with the recommended item.
In summary, the indicators used in this experiment and their significance are listed
below, and it is worth noting that these metrics are all dimensionless:
• Precision@k: the proportion of relevant items among the top k items recommended to
the user. Precision@10 and Precision@20 indicate the precision at 10 and 20 recommen-
dations, respectively.
• Recall@k: the proportion of relevant items among all the relevant items in the test set
that are recommended to the user. Recall@10 and Recall@20 indicate the recall at 10
and 20 recommendations, respectively.
• NDCG@k: normalized discounted cumulative gain at k. NDCG is a measure of ranking
quality that takes into account both the relevance of the recommended items and their
position in the list. NDCG@10 and NDCG@20 indicate the NDCG score at 10 and
20 recommendations, respectively.
The greater the value for these three evaluation metrics, the better the performance.
Given the sparsity of the interaction data, we repeatedly randomly selected an item that
the user did not interact with as a negative sample; then, we combined items that the user
did interact with the negative sample. To eliminate the instability of random selection, for
each model and dataset, we repeated the experiment five times and averaged the results as
the ultimate ranking results.
60
Electronics 2023, 12, 2397
Parameter Value
Learning rate (η) 1 × 10−3
Dimension of embedding (d) 200
Training batch size 2048
Weight decay (λ) 1 × 10−4
Number of layers (L) 3
Number of subgraphs (Nc ) 2
Table 3. Recommendation performance of all models on both datasets. The underlined value is the
second-best performance, and the bolded value is the best. Improvement is the comparison between
the best performance and the second-best performance.
The outcomes demonstrate that models based on MF do not perform as well in all cases
and exhibit a performance much inferior to that of GNN-based models because MF-based
models are more susceptible to data sparsity and cannot capture complex interactions.
LightGCN performs better in the vast majority of cases than BPR, SBPR, DiffNet, and
NGCF. As pointed out in [28], LightGCN removes two fundamental operations in GCN
that can negatively affect recommendation performance, namely linear transformation and
nonlinear activation. SocialLGN performs better than LightGCN because it introduces
social information on top of LightGCN and considers the effect of higher-order graph
structure on user embedding.
61
Electronics 2023, 12, 2397
Table 4. Recommendation performance of all models on two cold-start datasets. The underlined value
is the second-best performance, and the bolded value is the best. Improvement is the comparison
between the best performance and the second-best performance.
The results unequivocally show that PLGCN consistently achieves the best perfor-
mance. For instance, in contrast to SocialLGN, PLGCN improves the Recall@10 on the
original LastFM dataset by 3.22% and the Precision@10 on the original Ciao dataset by
13.59%. Since SocialLGN propagates messages on the social graph and the whole user–
item interaction graph without constructing subgraphs, by comparing the performance of
PLGCN with SocialLGN in the experiments, it can be seen that propagating information in
subgraphs can significantly raise the effectiveness of recommendations. In particular, on
the LastFM dataset containing information about cold-start users only, PLGCN improves
45.63% in the Precision@10 metric and 32.93% and 20.93% in Recall@10 and NDCG@10,
respectively. By looking at the data in 0, we can see the superior ability of PLGCN in
alleviating the cold-start problem. Additionally, we find that in the cold-start scenario,
the denser the interaction and social graphs of the dataset are, the more significant the
performance improvement is, while the opposite is true in the original dataset.
• PLGCNGraphSage : This variant uses the feature aggregation operation in GraphSage [33]
to aggregate the user’s embedding in both graphs with the following equation:
(l ) (l )
f GraphSage = σ W tu su (28)
(l ) (l )
In Equations (27) and (28), tu and su denote the embedding of user u propagated
through the l-th iteration on the interaction graph and social graph, respectively. W is the
62
Electronics 2023, 12, 2397
trainable transformation matrix, || means the concatenation operation, and σ is the tanh
activation function.
As shown in Figure 2, when compared to other models, our proposed feature aggrega-
tion module performs the best in all cases. The explanation for the superior performance of
PLGCN is that our proposed feature aggregation module first performs a feature transfor-
( l +1) ( l +1)
mation on tu and su and then uses an activation function to activate them nonlinearly,
such that a joint space may be created between the user embedding in the two graphs.
The multilayer perceptron can be used to explore higher-order feature interactions. In
addition, the recommendation performance benefits from the normalization operation,
(l )
which prevents eu from increasing with l.
63
Electronics 2023, 12, 2397
Table 5. Performance comparison of PLGCN and its variant on two datasets. The underlined value is
the second-best performance, and the bolded value is the best.
PLGCNs
Figure 3. The impact of the number of propagation layers L. (a) LastFM dataset and (b) Ciao dataset.
64
Electronics 2023, 12, 2397
connected in a short distance and acquires more information than the nodes in PLGCN3
and PLGCN4 , so it performs better.
6. Discussion
In our experiments, we showed that our graph neural network-based social recommen-
dation model outperforms some previous recommendation models ([8,27,28]). Compared
with [28], we find that adding social information to a recommender system does improve the
recommendation performance, while compared with [8,27], we find that the quality of social
information and the method of using social information also have a crucial impact. In some
cases, the performance improvement of our proposed method, PLGCN, is more evident in
the cold-start scenario. We believe that cold-start users have fewer interaction data, and
negative information has a greater impact on recommendation performance. By filtering
out negative information, we substantially improve the recommendation performance.
However, our proposed model also has some limitations, which highlight opportuni-
ties for future research. For example, we only consider user preferences in constructing
subgraphs, while other social features such as friendship networks or trust levels can be
incorporated to enhance the social filtering process. Additionally, our feature aggregation
module only includes user embeddings with social and interaction information. It could be
extended to include more diverse information sources, such as temporal information or
user-generated content.
7. Conclusions
Most of the existing social recommendation models only take higher-order collab-
orative signals into account, without paying attention to the negative signals in these
signals, which negatively affects the models’ recommendation performance. We propose
the PLGCN, a novel social recommendation model based on GCN, as a solution to this
issue, which incorporates unsupervised learning to classify users based on their prefer-
ences, allowing for more effective filtering of irrelevant and negative information from
high-order neighbor nodes. This enables PLGCN to provide more personalized and ac-
curate recommendations. Moreover, we designed a novel feature aggregation module to
65
Electronics 2023, 12, 2397
better aggregate user representations in both graphs. We evaluated PLGCN against other
SOTA models on two datasets, and the outcomes demonstrated that PLGCN outperforms
them. Furthermore, PLGCN adopts a lightweight GNN framework that removes nonlinear
activation and feature transformation operations, which mitigates the overfitting issue
and enables faster and more efficient training and inference. Our proposed model can be
applied to diverse social recommendation scenarios, such as e-commerce, social media,
and content recommendation.
However, our model still has limitations. First, it relies on the assumption that social
connections effectively capture users’ preferences. In reality, users may connect for various
reasons, and their social networks may not fully reflect their preferences, which could affect
the accuracy of our approach. Second, our experiments were conducted on specific datasets,
and the performance of our method may vary on other datasets or domains. Further
evaluation is necessary to validate the effectiveness and generalizability of our method.
Third, our model assumes a static social network structure and does not consider dynamic
changes over time. Future work can explore incorporating dynamic social information to
improve the performance of social recommendation methods.
In terms of future work, we plan to investigate several areas for further improvement.
First, we would like to explore the use of more complex graph neural network architectures
to capture even more nuanced social relationships and better incorporate users’ social
behavior. Second, we plan to investigate the use of additional data sources, such as user-
generated content and location data, to enhance our model’s performance and provide
more personalized recommendations. Finally, we will explore the use of different datasets
and evaluation metrics to better capture the effectiveness of our model and ensure that our
recommendations are not only accurate but also diverse and novel.
Author Contributions: Formal analysis, H.X. and L.T.; investigation, G.W.; methodology, H.X. and
L.T.; project administration, G.W.; resources, X.J. and E.Z.; data curation, H.X. and L.T.; supervision,
L.T.; validation, H.X. and X.J.; writing—original draft preparation, H.X.; writing—review and editing,
X.J and L.T. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Anhui Province Science and Technology Major Special
Projects (Project No. 202103b06020013), Anhui Provincial Natural Science Foundation Project (Project
No. 2108085MF209), and the Open Fund Project of Anhui Provincial Key Laboratory of Intelligent
Agricultural Technology and Equipment (Project No. APKLSATE2021X008).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: This research employed publicly available datasets for its experimen-
tal studies.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Cialdini, R.B.; Goldstein, N.J. Social influence: Compliance and conformity. Annu. Rev. Psychol. 2004, 55, 591–621. [CrossRef]
[PubMed]
2. McPherson, M.; Smith-Lovin, L.; Cook, J.M. Birds of a feather: Homophily in social networks. Annu. Rev. Sociol. 2001, 27, 415–444.
[CrossRef]
3. Knoke, D.; Yang, S. Social Network Analysis; SAGE publications: London, UK, 2019.
4. Ma, H.; Zhou, D.; Liu, C.; Lyu, M.R.; King, I. Recommender systems with social regularization. In Proceedings of the Fourth
ACM International Conference on Web Search and Data Mining, Hong Kong, China, 9–12 February 2011; pp. 287–296.
5. Tang, J.; Wang, S.; Hu, X.; Yin, D.; Bi, Y.; Chang, Y.; Liu, H. Recommendation with social dimensions. In Proceedings of the
Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016.
6. Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, E.; Tang, J.; Yin, D. Graph neural networks for social recommendation. In Proceedings of the
World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 417–426.
7. Wu, L.; Li, J.; Sun, P.; Hong, R.; Ge, Y.; Wang, M. Diffnet++: A neural influence and interest diffusion network for social
recommendation. IEEE Trans. Knowl. Data Eng. 2020, 34, 4753–4766. [CrossRef]
66
Electronics 2023, 12, 2397
8. Wu, L.; Sun, P.; Fu, Y.; Hong, R.; Wang, X.; Wang, M. A neural influence diffusion model for social recommendation. In
Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris,
Franch, 21–25 July 2019; pp. 235–244.
9. Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein interface prediction using graph convolutional networks. Adv. Neural Inf.
Process. Syst. 2017, 30, 6533–6542.
10. Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks
on graphs for learning molecular fingerprints. Adv. Neural Inf. Process. Syst. 2015, 28, 2224–2232.
11. Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P. Molecular graph convolutions: Moving beyond fingerprints. J. Comput.
-Aided Mol. Des. 2016, 30, 595–608. [CrossRef] [PubMed]
12. Wu, S.; Sun, F.; Zhang, W.; Xie, X.; Cui, B. Graph neural networks in recommender systems: A survey. ACM Comput. Surv. 2022,
55, 1–37. [CrossRef]
13. Eksombatchai, C.; Jindal, P.; Liu, J.Z.; Liu, Y.; Sharma, R.; Sugnet, C.; Ulrich, M.; Leskovec, J. Pixie: A system for recommending 3+
billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27
April 2018; pp. 1775–1784.
14. Wu, Q.; Zhang, H.; Gao, X.; He, P.; Weng, P.; Gao, H.; Chen, G. Dual graph attention networks for deep latent representation of
multifaceted social effects in recommender systems. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA,
13–17 May 2019; pp. 2091–2102.
15. Chen, T.; Wong RC, W. An efficient and effective framework for session-based social recommendation. In Proceedings of the 14th
ACM International Conference on Web Search and Data Mining, Online, 8–12 March 2021; pp. 400–408.
16. Liu, F.; Cheng, Z.; Zhu, L.; Gao, Z.; Nie, L. Interest-aware message-passing gcn for recommendation. In Proceedings of the Web
Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1296–1305.
17. Wang, X.; He, X.; Nie, L.; Chua, T.S. Item silk road: Recommending items from information domains to social users. In Proceedings
of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11
August 2017; pp. 185–194.
18. Lin, T.H.; Gao, C.; Li, Y. Recommender systems with characterized social regularization. In Proceedings of the 27th ACM
International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 1767–1770.
19. Jamali, M.; Ester, M. A matrix factorization technique with trust propagation for recommendation in social networks. In
Proceedings of the Fourth ACM Conference on Recommender Systems, Barcelona, Spain, 26–30 September 2010; pp. 135–142.
20. Zhao, T.; McAuley, J.; King, I. Leveraging social connections to improve personalized ranking for collaborative filtering. In
Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai,
China, 3–7 November 2014; pp. 261–270.
21. Yu, J.; Gao, M.; Li, J.; Yin, H.; Liu, H. Adaptive implicit friends identification over heterogeneous network for social recommenda-
tion. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26
October 2018; pp. 357–366.
22. Guo, G.; Zhang, J.; Yorke-Smith, N. Trustsvd: Collaborative filtering with both the explicit and implicit influence of user trust
and of item ratings. In Proceedings of the AAAI Conference on Artificial Intelligence, Chicago, Il, USA, 25–30 January 2015;
Volume 29.
23. Chaney AJ, B.; Blei, D.M.; Eliassi-Rad, T. A probabilistic model for using social networks in personalized item recommendation.
In Proceedings of the 9th ACM Conference on Recommender Systems, Vienna, Austria, 16–20 September 2015; pp. 43–50.
24. Ma, H.; King, I.; Lyu, M.R. Learning to recommend with social trust ensemble. In Proceedings of the 32nd International ACM
SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, 19–23 July 2009; pp. 203–210.
25. Koren, Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, Vegas, NV, USA, 22 February 2008; pp. 426–434.
26. Liu, Y.; Chen, L.; He, X.; Peng, J.; Zheng, Z.; Tang, J. Modelling high-order social relations for item recommendation. IEEE Trans.
Knowl. Data Eng. 2020, 34, 4385–4397. [CrossRef]
27. Liao, J.; Zhou, W.; Luo, F.; Wen, J.; Gao, M.; Li, X.; Zeng, J. SocialLGN: Light graph convolution network for social recommendation.
Inf. Sci. 2022, 589, 595–607. [CrossRef]
28. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for
recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information
Retrieval, Online, 25–30 July 2020; pp. 639–648.
29. Hu, Y.; Zhan, P.; Xu, Y.; Zhao, J.; Li, Y.; Li, X. Temporal representation learning for time series classification. Neural Comput. Appl.
2021, 33, 3169–3182. [CrossRef]
30. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Society. Ser. C (Appl. Stat.) 1979, 28,
100–108. [CrossRef]
31. Yang, J.; Zhang, D.; Frangi, A.F.; Yang, J.Y. Two-dimensional PCA: A new approach to appearance-based face representation and
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 131–137. [CrossRef] [PubMed]
32. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T. Simplifying graph convolutional networks. In Proceedings of the International
Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 6861–6871.
67
Electronics 2023, 12, 2397
33. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30,
1025–1035.
34. Cantador, I.; Brusilovsky, P.; Kuflik, T. Second workshop on information heterogeneity and fusion in recommender systems
(HetRec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems, Chicago, IL, USA, 14 October 2011;
pp. 387–388.
35. Tang, J.; Gao, H.; Liu, H. mTrust: Discerning multi-faceted trust in a connected world. In Proceedings of the Fifth ACM
International Conference on Web Search and Data Mining, Washington, DC, USA, 8–12 February 2012; pp. 93–102.
36. Xu, H.; Huang, C.; Xu, Y.; Xia, L.; Xing, H.; Yin, D. Global context enhanced social recommendation with hierarchical graph neural
networks. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November
2020; IEEE: Piscataway, NJ, USA; pp. 701–710.
37. Lin, J.; Chen, S.; Wang, J. Graph neural networks with dynamic and static representations for social recommendation. In
Proceedings of the Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event,
11–14 April 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 264–271.
38. Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv
2012, arXiv:1205.2618.
39. Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International
ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174.
40. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
68
electronics
Article
Implementation of a Whisper Architecture-Based Turkish
Automatic Speech Recognition (ASR) System and Evaluation of
the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA)
Adapter on Its Performance
Hüseyin Polat 1, *, Alp Kaan Turan 1 , Cemal Koçak 1 and Hasan Basri Ulaş 2
1 Department of Computer Engineering, Faculty of Technology, Gazi University, Ankara 06560, Turkey;
[Link]@[Link] (A.K.T.); ccckocak@[Link] (C.K.)
2 Department of Manufacturing Engineering, Faculty of Technology, Gazi University, Ankara 06560, Turkey;
bulas@[Link]
* Correspondence: polath@[Link]
Abstract: This paper focuses on the implementation of the Whisper architecture to create an automatic
speech recognition (ASR) system optimized for the Turkish language, which is considered a low-
resource language in terms of speech recognition technologies. Whisper is a transformer-based
model known for its high performance across numerous languages. However, its performance in
Turkish, a language with unique linguistic features and limited labeled data, has yet to be fully
explored. To address this, we conducted a series of experiments using five different Turkish speech
datasets to assess the model’s baseline performance. Initial evaluations revealed a range of word
error rates (WERs) between 4.3% and 14.2%, reflecting the challenges posed by Turkish. To improve
these results, we applied the low-rank adaptation (LoRA) technique, which is designed to fine-tune
large-scale models efficiently by introducing a reduced set of trainable parameters. After fine-tuning,
significant performance improvements were observed, with WER reductions of up to 52.38%. This
Citation: Polat, H.; Turan, A.K.; study demonstrates that fine-tuned Whisper models can be successfully adapted for Turkish, resulting
Koçak, C.; Ulaş, H.B. Implementation
in a robust and accurate end-to-end ASR system. This research highlights the applicability of Whisper
of a Whisper Architecture-Based
in low-resource languages and provides insights into the challenges of and strategies for improving
Turkish Automatic Speech
speech recognition performance in Turkish.
Recognition (ASR) System and
Evaluation of the Effect of
Keywords: automatic speech recognition; artificial intelligence; deep learning; representation
Fine-Tuning with a Low-Rank
Adaptation (LoRA) Adapter on Its
learning; self-supervised learning; Whisper model
Performance. Electronics 2024, 13,
4227. [Link]
electronics13214227
1. Introduction
Academic Editor: Grzegorz Dudek
Automatic speech recognition (ASR) systems have become an integral part of many
Received: 2 October 2024 modern technologies, enabling voice-activated assistants, transcription services, and real-
Revised: 24 October 2024 time communication across various platforms. The development of ASR systems has histor-
Accepted: 25 October 2024 ically been focused on high-resource languages such as English, which benefit from large,
Published: 28 October 2024
labeled datasets and sophisticated linguistic models. However, low-resource languages,
including Turkish, continue to face challenges in terms of speech recognition accuracy
due to the scarcity of labeled data and unique linguistic features such as agglutinative
Copyright: © 2024 by the authors.
morphology and vowel harmony [1].
Licensee MDPI, Basel, Switzerland. The Whisper architecture, developed by OpenAI, represents a significant advancement
This article is an open access article in ASR technology. As a transformer-based model, Whisper is designed to handle multiple
distributed under the terms and languages, accents, and noisy environments with high accuracy. However, despite its broad
conditions of the Creative Commons language support, its performance in low-resource languages like Turkish has not been
Attribution (CC BY) license (https:// fully optimized, as most of the model’s training data are skewed toward high-resource
[Link]/licenses/by/ languages such as English [2]. The need for the fine-tuning of such models in low-resource
4.0/). languages is evident to improve their performance in real-world applications [3].
Turkish presents specific challenges for ASR systems due to its agglutinative structure,
where suffixes are attached to root words to form complex words. This results in a vast
number of possible word forms, making it difficult for ASR systems to accurately segment
and recognize words. Additionally, Turkish exhibits notable dialectal diversity and phono-
logical complexity, further complicating the development of a robust ASR system [4]. The
scarcity of high-quality labeled datasets exacerbates these issues, leading to higher word
error rates (WERs) compared to other languages.
In this study, we aim to address the gap in the literature by adapting the Whisper ASR
system to Turkish using the low-rank adaptation (LoRA) method, a parameter-efficient
fine-tuning technique that allows large-scale models to be adapted to specific tasks without
the need for retraining all the model parameters [5]. LoRA significantly reduces the
computational cost and memory requirements of fine-tuning large models, making it ideal
for adapting ASR systems to low-resource languages like Turkish. By fine-tuning Whisper
with LoRA, we aim to enhance its performance on Turkish speech datasets and provide a
more accurate and robust ASR solution for Turkish.
Previous research on ASR systems for Turkish has primarily focused on traditional
machine learning approaches and language models tailored to Turkish [6]. However, these
approaches often lack the scalability and adaptability offered by modern transformer-
based models like Whisper. Our work bridges this gap by leveraging the strengths of
Whisper while addressing the specific challenges posed by the Turkish language. The key
contribution of this study is the evaluation of Whisper’s performance across five distinct
Turkish speech datasets, both before and after fine-tuning with LoRA. This comprehensive
evaluation provides insights into how well transformer-based models can be adapted to
low-resource languages and highlights the potential for further improvements in ASR
systems for Turkish.
Literature Review
The system model for speech analysis and synthesis was proposed by Dudley et al.
at Bell Laboratories [7,8] in 1939, which is considered the beginning of ASR systems [9].
The first experimental work was the system for isolated digit recognition for a single
speaker developed by Davis et al. of Bell Laboratories in 1952 [10]. Between 1950 and
1960, studies aimed to create pattern recognition systems for phoneme, single letter, or
syllable discrimination [11,12]. In the period between 1960 and 1970, three hardware-based
systems were developed in Japan [13–15]. Additional prominent works include IBM’s
Shoebox software [16], Martin’s work at RCA Laboratories [17], and Vintsyuk’s study
using dynamic programming methods [18]. Between 1970 and 1980, the Viterbi algorithm
was used in ASR systems [19], along with statistics-based approaches such as Itakura’s
LPC-based study [20]. Noteworthy works include the VIP-100 software, the Hearsay and
HWIM software developed within the scope of the DARPA SUR program, as well as the
Harpy software developed by Carnegie-Mellon University [21].
Between the 1980s and 1990s, statistical models and artificial neural network (ANN)
studies emerged. The period between 1990 and 2000 saw developments such as the AT&T
Voice Recognition Call Processing (VRCP) solution and the Hidden Markov Model Tool Kit
(HMM Tool Kit). After 2000, ANN-based systems continued to develop, and significant
advancements like the “Voice Search” feature in Android and Siri integration with iOS
were introduced. Additionally, the Effective Affordable Reusable Speech-to-Text (EARS)
and Global Autonomous Language Exploitation (GALE) datasets were created within the
DARPA program for ASR systems [22–28].
After 2010, the use of artificial intelligence (AI) in automatic speech recognition (ASR)
systems has increased, with significant developments such as connectionist temporal
classification (CTC), recurrent neural network (RNN), long short-term memory (LSTM),
two-way LSTM using the listen, attend, speech (LAS) mechanism, convolution mechanism,
residual network-based studies, the transducer mechanism, Wav2vec model, Wav2vec
2.0, the ASR system using a convolutional neural network (CNN) and the Conformer
70
Electronics 2024, 13, 4227
model, and the Whisper ASR system developed by OpenAI. Major companies such as IBM,
Microsoft, Google, and Amazon also released their advanced ASR systems during this
period [29–36].
The existing literature on Turkish speech recognition primarily focuses on the distinc-
tive characteristics of the language and their implications for automatic speech recognition
(ASR) systems. Turkish poses several challenges in terms of speech recognition due to
its agglutinative structure, rich morphology, complex phonology, and significant dialectal
diversity. To address issues related to this diversity, ASR systems often rely on robust
language models. In addition, the use of a lexicon helps improve recognition accuracy by
providing a structured mapping of words, which is particularly useful in managing the
vast variety of word forms in Turkish. Despite these techniques, the scarcity of high-quality
speech and text data remains a major obstacle to further advancement in the field [37–39].
The early works on Turkish ASR focused on developing speaker-dependent systems
and acoustic models. In recent years, there has been significant progress in Turkish ASR
systems, with the introduction of deep learning models such as LSTM, gated recurrent unit
(GRU), and Transformer, and the augmentation of Turkish speech and text data [40–51].
This research explores the implementation of a Whisper-driven automatic speech
recognition system for the Turkish language and evaluates the effects of refining the model
using LoRA approach. Whisper is a relatively new technology, so there are a limited
number of studies on it. Various studies have compared Whisper with other architectures,
such as Wav2Vec 2.0, and have utilized different datasets for testing. Additionally, some
researchers have employed innovative methods, such as image-converted EEG data and
direct fine-tuning, to improve Whisper’s ASR performance [52–56].
This study expands upon the existing body of research on Turkish ASR by addressing
some of the critical limitations that have hindered previous efforts in the field. Earlier works
on Turkish ASR typically focused on traditional machine learning models, such as hidden
Markov models (HMMs) or Gaussian mixture models (GMMs), and relied on relatively
small or less diverse datasets. These models, while effective in some contexts, struggled
with the unique linguistic challenges of Turkish, such as its agglutinative morphology, rich
inflectional structure, and the presence of numerous dialects.
In contrast, this study leverages the Whisper ASR model, which is based on a trans-
former architecture trained on a large, multilingual corpus. This allows the model to
capture long-range dependencies in speech and handle the complexities of Turkish more ef-
fectively than traditional models. The key differentiating factor of our work is the use of the
LoRA method for fine-tuning Whisper. LoRA enables the model to be fine-tuned efficiently
with fewer trainable parameters, addressing the computational challenges associated with
fine-tuning large-scale models for low-resource languages like Turkish.
Moreover, unlike many previous studies that focused on a single dataset or task, this
study evaluates the Whisper model on multiple Turkish speech datasets, representing a
more comprehensive assessment of its performance across different contexts. This study
also introduces corrections to the datasets, improving the quality of the training data and
reducing the impact of noise and other errors.
71
Electronics 2024, 13, 4227
performance and accuracy of ASR systems, leading to more reliable and efficient speech
recognition technology [60].
2.1. Transformers
RNNs and their variations, such as LSTM and GRU, have succeeded significantly
in tasks like machine translation, language modeling, sequence transfer, and sequence
modeling. The success of RNNs in these tasks is due to their ability to process sequential
data. However, their sequential nature limits parallel processing, as each step depends on
the result of the previous step, making it challenging to handle long sequences effectively.
Over time, RNNs tend to forget important information, especially in long sequences,
which leads to difficulties in capturing long-term dependencies. The sequence-to-sequence
(Seq2Seq) [61] architecture was proposed to address these challenges. This architecture
consists of an encoder and a decoder, typically composed of RNN units. The encoder takes
the input sequence and compresses it into a fixed-length context vector, which is passed
to the decoder. The decoder then generates the output sequence from this context vector.
Seq2Seq improves the learning process by efficiently handling sequence data. However,
when dealing with long sequences, it experiences performance degradation because it
compresses the entire input into a fixed-length vector, which increases the risk of losing
crucial information.
To overcome this issue, the attention mechanism was introduced. The attention
mechanism allows the model to focus on specific parts of the input sequence dynamically,
rather than compressing all the input information into a single fixed-length vector. This
enables the model to consider different input parts at each step of the output generation,
improving the performance, particularly for long sequences. By highlighting important
points in the data, attention mechanisms make it easier for the model to retain critical
information. However, even with attention, challenges like poor long-distance dependency
retention, high computational costs, and limited parallel processing still persist [62].
To address these limitations and overcome the challenges of RNN-based models,
the transformer architecture was proposed by Vaswani et al. in 2017 [32]. Unlike RNNs,
transformers remove the need for sequential processing and rely entirely on the attention
mechanism. This architecture allows for parallel processing and significantly reduces the
risk of losing information in long sequences. One of the most significant advantages of
the transformer is its ability to support parallel processing, as it removes the need for step-
by-step sequential computations. These speed up the training process, especially when
working with large datasets. Additionally, the self-attention mechanism excels in tasks
requiring long-term dependencies by capturing relationships between distant elements in
a sequence. This feature enables it to overcome one of the key weaknesses of RNNs and
Seq2Seq models, which often struggle with such dependencies. The transformer is also
highly scalable, making it a foundational architecture for large language models such as
bidirectional encoder representations from transformers (BERT) and generative pre-trained
transformer (GPT).
The basic structure of the transformer architecture is illustrated in Figure 1. The
transformer consists of an encoder–decoder structure designed to process sequences of
data (like sentences).
The encoder processes the input data. It consists of N layers. Each layer has two
sub-layers: multi-head self-attention mechanism and position-wise fully connected feed-
forward network Each of these sub-layers is followed by add and layer normalization
steps, which stabilize and normalize the input. Positional encoding is added to the input
embeddings to provide the model with information about the position of tokens in a
sequence. This is necessary because, unlike RNNs or LSTMs, the transformer model does
not inherently understand the order of tokens.
72
Electronics 2024, 13, 4227
ŶĐŽĚĞƌ KƵƚƉƵƚWƌŽďĂďŝůŝƚŝĞƐ
ĞĐŽĚĞƌ
^ŽĨƚŵĂdž
ĚĚĂŶĚEŽƌŵĂůŝnjĂƚŝŽŶ >ŝŶĞĂƌ
ĚĚĂŶĚEŽƌŵĂůŝnjĂƚŝŽŶ
WŽƐŝƚŝŽŶͲǁŝƐĞ
&ĞĞĚͲ&ŽƌǁĂƌĚEĞƵƌĂů
EĞƚǁŽƌŬ
WŽƐŝƚŝŽŶͲǁŝƐĞ
&ĞĞĚͲ&ŽƌǁĂƌĚEĞƵƌĂů
Edž
EĞƚǁŽƌŬ
ĚĚĂŶĚEŽƌŵĂůŝnjĂƚŝŽŶ
ĚĚĂŶĚEŽƌŵĂůŝnjĂƚŝŽŶ
ĚĚĂŶĚEŽƌŵĂůŝnjĂƚŝŽŶ
WŽƐŝƚŝŽŶĂůĞŶĐŽĚŝŶŐ н
DĂƐŬĞĚŵƵůƚŝͲŚĞĂĚ
ŵďĞĚĚŝŶŐ ĂƚƚĞŶƚŝŽŶ
^ŽƵƌĐĞƐ
н WŽƐŝƚŝŽŶĂůĞŶĐŽĚŝŶŐ
ŵďĞĚĚŝŶŐ
dĂƌŐĞƚƐ
The decoder generates the output sequence, conditioned on the input sequence en-
coded by the encoder. The decoder also has N layers, similar to the encoder but with a
few differences.
In addition to the multi-head self-attention mechanism and feed-forward network,
the decoder includes a masked multi-head attention layer. This ensures that predictions
for a given position only depend on the outputs before that position, preserving the
autoregressive nature of the language generation. Similar to the encoder, each layer in the
decoder is followed by add and layer normalization steps.
73
Electronics 2024, 13, 4227
input X by separate weight matrices, the query, key and value are obtained, which are fed
to each attention unit and given in Equations (1)–(3).
Q = XWQT (1)
K = XWKT (2)
V = XWVT (3)
DƵůƚŝͲ,ĞĂĚ ^ĐĂůĞĚŽƚͲWƌŽĚƵĐƚ
ƚƚĞŶƚŝŽŶ ƚƚĞŶƚŝŽŶ
>ŝŶĞĂƌ DĂƚŵƵů
ŽŶĐĂƚĞŶĂƚĞ ^ŽĨƚŵĂdž
DĂƐŬ
;ŽƉƚŝŽŶĂůͿ
>ŝŶĞĂƌ
>ŝŶĞĂƌ
^ĐĂůĞĚŽƚͲWƌŽĚƵĐƚƚƚĞŶƚŝŽŶ
Ś ^ĐĂůĞ
In Equation (4), the query and key matrix product is scaled by the square root of
the dimension and the weight obtained by the softmax function is multiplied by the
value matrix.
QKT
Attention(Q, K, V) = softmax √ V (4)
dk
The result from each attention unit (Equation (5)) is combined, as shown in Equation (6)
and then transmitted to the next layer.
Headi = Attention QWQ K
i , KWi , VWi
V
(5)
74
Electronics 2024, 13, 4227
dmodel
the model, denoted as dmodel 0 ≤ i ≤ 2 and user-defined criterion n, is as outlined
in Equations (7) and (8).
PE( pos,2i) = sin pos/n2i/dmodel (7)
PE( pos,2i+1) = cos pos/n2i/dmodel (8)
To illustrate, the positional encoding matrix of the initial six words in a text input with
d = 4 and n = 10000 is presented in Table 1.
i= 0 0 One One
pos pos pos pos
sin n2i/d
cos n2i/d
sin n2i/d cos n2i/d
x0 0 1 0 1
x1 0.841471 0.540302 0.009999 0.999950
x2 0.909297 −0.416147 0.019999 0.999800
x3 0.141120 −0.989992 0.029996 0.999550
x4 −0.756802 −0.653644 0.039989 0.999200
x5 −0.958924 0.283662 0.049979 0.998750
The same mechanisms involved in the encoder are also involved in the decoder. The
difference is that in the positional encoding process in the encoder, the input x1 , x2 , x3 ,. . .
xn in the parser when receiving y0 (start), y1 , y2 , y3 , . . . yn−1 is used. The first MHA in the
analyzer is masked to prevent overlearning. The last output in the analyzer is the SoftMax
function, which assigns a probability value between 0 and 1 to each element in the result
array and selects the outputs with the highest probability.
75
Electronics 2024, 13, 4227
W = W + ΔW (10)
76
Electronics 2024, 13, 4227
In this context, W represents the updated weight matrix, W denotes the original
pre-trained weight matrix, and ΔW is the full-rank matrix of learned weight updates, with
a dimensionality of d × k. The number of parameters in ΔW is d × k.
The objective of LoRA is to reduce the number of trainable parameters by constraining
ΔW to being a low-rank matrix. Instead of learning the complete matrix ΔW, LoRA
postulates that ΔW can be decomposed into the product of two lower-dimensional matrices:
ΔW = AB, where:
A ∈ Rd×r is a matrix with a rank r much smaller than min (d, k).
B ∈ Rr×k is another matrix with the same rank r.
Thus, the update rule becomes:
W = W + AB (11)
The ratio of the number of low-rank parameters to the number of full-rank parameters
is as follows:
Parameter reduction ratio= r(d + k)/d × k (14)
At the beginning of the training, A is assigned a random Gaussian value and B is
assigned 0. Therefore, ΔW = BA has a value of zero in the first stage. It is then scaled by αr
using a hyperparameter (α).
77
Electronics 2024, 13, 4227
unified modeling approach that discards the traditional distinction between acoustic and
language models.
Whisper officially supports around 100 languages. Whisper was trained on a large
dataset of approximately 680,000 h, of which around 117,000 h are multilingual. This
constitutes an order of magnitude more data than used to train Wav2Vec 2.0, namely
60,000 h of unlabeled audio. Of the data utilized for Whisper’s training, 65% (or 438,000 h)
was dedicated to English speech recognition, 17% (or 117,000 h) to multilingual speech
recognition, and the remaining 18% (or 126,000 h) to English translation. This diverse
dataset enhances Whisper’s robustness and generalizability across different languages,
accents, and speech types. Whisper can transcribe 99 different languages and is capable
of not just transcription but also translating conversations and timestamping speech. The
model can be optimized for various tasks and is available in five different sizes, ranging
from 39 million to 1.55 billion parameters, allowing developers to strike a balance between
accuracy and processing speed [4,5].
Whisper’s architecture is composed of two main components: the encoder and the
decoder (Figure 4). The raw audio is divided into 30 s segments and transformed into a
log-Mel spectrogram, which generates perceptually relevant frequency representations.
Whisper has been designed to work on audio samples of up to 30 s in duration. However,
the use of a chunking algorithm allows it to be used to transcribe audio samples of any
length. This is made possible through the transformer’s pipeline method.
>E' ƚƌĂŶƐͲ
Ϭ͘Ϭ Ŷ ,ĂŬŝŬŝ DƺƌƔŝĚ
;dZͿ ĐƌŝďĞ
EĞdžƚͲdŽŬĞŶWƌĞĚŝĐƚŝŽŶ
DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ
DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ ƌŽƐƐͲƚƚĞŶƚŝŽŶ
^ĞůĨͲƚƚĞŶƚŝŽŶ ^ĞůĨͲƚƚĞŶƚŝŽŶ
ƌŽƐƐƚƚĞŶƚŝŽŶ
ŶĐŽĚĞƌ
ůŽĐŬƐ DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ ĞĐŽĚĞƌ
ůŽĐŬƐ
^ĞůĨͲƚƚĞŶƚŝŽŶ ƌŽƐƐͲƚƚĞŶƚŝŽŶ
^ĞůĨͲƚƚĞŶƚŝŽŶ
DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ
^ĞůĨͲƚƚĞŶƚŝŽŶ DƵůƚŝͲ>ĂLJĞƌWĞƌĐĞƉƚƌŽŶ
ƌŽƐƐͲƚƚĞŶƚŝŽŶ
∼
^ŝŶƵƐŽŝĚĂů
WŽƐŝƚŝŽŶĂů н ^ĞůĨͲƚƚĞŶƚŝŽŶ
ŶĐŽĚŝŶŐ
ϭŽŶǀŽůƵƚŝŽŶ;džϮͿ͕'>h н
>ĞĂƌŶĞĚWŽƐŝƚŝŽŶĂů
ŶĐŽĚŝŶŐ
>E' ƚƌĂŶƐͲ
^Kd Ϭ͘Ϭ Ŷ ,ĂŬŝŬŝ
;dZͿ ĐƌŝďĞ
>ŽŐͲDĞů^ƉĞĐƚƌŽŐƌĂŵ dŽŬĞŶƐDƵůƚŝƚĂƐŬdƌĂŝŶŝŶŐ&ŽƌŵĂƚ
The spectrogram is processed through 2x1D convolutional layers with GELU (Gaus-
sian error linear unit) activations to match the transformer’s width [63]. Positional coding
assigns temporal locations to the outputs of these convolutional layers, helping the model
78
Electronics 2024, 13, 4227
track sequential information from the audio data. This processed input is then fed into the
encoder stack in the transformer [4,5].
The encoder consists of multiple blocks, each containing self-attention and multi-layer
perceptron (MLP) layers. The self-attention mechanism analyzes each time segment of the
audio concerning all the others, capturing short- and long-term speech dependencies. For
instance, a word at the start of a sentence may be semantically related to one at the end;
self-attention allows for such long-term correlations. After establishing these relationships
through self-attention, the multi-layer perceptron layer enables deeper and more advanced
processing of these connections, allowing the model to extract nuanced features and discern
a broader range of linguistic patterns. Multiple encoder blocks are required to grasp the
complexities of language and expedite understanding. The encoder operates only once per
30 s segment to produce a latent representation from the spectrogram [4,5].
This latent representation is passed to the decoder, where each block contains cross-
attention, self-attention, and multi-layer perceptron layers. Cross-attention facilitates the
transfer of the latent representation from the encoder to the decoder, linking the audio
data to the text generation process. This allows the decoder to make accurate predictions
based on the audio signal. Self-attention within the decoder considers previously generated
words to maintain grammatical consistency and meaning in the text output. The multi-layer
perceptron in the decoder enriches the information provided by self-attention and cross-
attention, ensuring that the text generated by the model is both grammatically coherent
and accurate in capturing the subtleties of the audio signal. The decoder predicts the text
step by step using the latent representation; first predicting the most likely word, then
generating the next word based on the previous prediction, and continuing until the entire
speech sequence is transcribed [4,5].
Based on the number of layers, the width of the feature representation, and the number
of attention heads, the Whisper model is categorized into five versions: tiny, base, small,
medium, and large. The specifics of each version are summarized in Table 2. Furthermore,
Large-v2 had 2.5 times more training iterations than the large version, while Large-v3
utilized data collection, processing, and pseudo-labeling with Large-v2 to augment the
training data to 5 million hours. Both the large-v2 and large-v3 models surpassed the large
model, with the large-v3 model exhibiting even stronger capabilities than the large-v2,
particularly in terms of model training [4,5].
Attention Parameters
Model Layers Width
Heads (Million)
tiny 4 384 6 39
base 6 512 8 74
small 12 768 12 244
medium 24 1024 16 769
large 32 1280 20 1550
This study identified the following factors as influencing the selection of Whisper over
other potential ASR models.
The Whisper model is a transformer-based ASR system that has demonstrated robust
performance across a range of languages, including low-resource languages such as Turkish.
The robust performance of Whisper in noisy environments and its capacity to handle a
diverse range of speech conditions, including accents and spontaneous speech, render it
particularly well-suited for real-world applications in Turkish speech recognition.
Whisper operates in an end-to-end manner, whereby the model is trained to map
audio inputs directly to text outputs, obviating the need for separate acoustic and language
models. This unified approach markedly enhances the system’s capacity to generalize
across disparate languages and accents. Additionally, Whisper’s transformer architecture
79
Electronics 2024, 13, 4227
enables parallel processing, rendering it more efficient than RNNs or LSTM units, which
process input data sequentially and are slower when handling long speech sequences.
Furthermore, Whisper is capable of fine-tuning through techniques such as LoRA,
which enables the model to be efficiently adapted to specific tasks or languages without
the necessity of retraining all the parameters. This represents a significant advantage
over other models, which may require more extensive resources to achieve the same level
of fine-tuning. In light of the challenges posed by Turkish, including its agglutinative
morphology and dialectal variations, the ability of Whisper to be fine-tuned with fewer
trainable parameters represents a significant advantage and was a key factor in its selection
for this study.
In conclusion, Whisper was selected for this study due to its advanced transformer-
based architecture, multilingual training, adaptability to low-resource languages, and
efficient fine-tuning capabilities. These features render it a more suitable model for de-
veloping a robust ASR system for Turkish compared to other traditional or even deep
learning-based ASR models.
80
Electronics 2024, 13, 4227
of 16 kHz. The speech text is stored in TSV format files, with attributes such as the text,
speaker gender, etc. The recordings are divided into three parts: dev, train and test. The
Turkish dataset consists of a total of 3607 records, with 743, 2526, and 338 records in the
train, test, and dev sections, respectively.
TASRT dataset’s [68] data availability status: data available on request due to restric-
tions (commercial use). The data presented in this study are available from the correspond-
ing author upon request.
The TASRT dataset is compiled by Oyucu [68]. The TASRT dataset contains approxi-
mately 186 min of speech data from 286 speakers (143 women and 143 men) in 20 different
categories. The audio data are stored in WAV files, with a sampling frequency of 16 kHz.
The dataset contains transcript files with TXT extension corresponding to each audio file
and named with category, speaker gender and order.
EC + SC + DC
CER = (17)
NC
81
Electronics 2024, 13, 4227
such as these can also be used to measure ASR performance. Low error rates indicate high
accuracy and thus high model success.
82
Electronics 2024, 13, 4227
Similarly, there are many problematic records in the TNST dataset. Adjacent words ae
were written together. Additionally, many foreign words are included with their Turkish
pronunciations. These issues caused an increase in the WER and CER values.
The texts in the datasets were reorganized by listening to the audio files in cases of
ambiguity. Foreign names in the TNST dataset, which were written side by side with their
Turkish pronunciations, were reduced to one in the text so that only their commonly used
equivalents were included. Apart from these, other problems encountered with both the
text files and the functioning of the existing model are listed below:
1. In some files, errors occurred in the output produced because the recording was
terminated before the end of the speech. Especially in recordings with more than one
speaker’s speech, it was observed that if there were gaps, the model did not process
the speech of subsequent speakers. Therefore, voice activation detection (VAD) was
used to reduce the gaps that cause errors.
2. Confusion about whether conjunctions such as “with”, “ise”, “de”, “da” are written
adjacent to the word or separately is one of the most common errors. In the text,
suffixes or conjunctions, which are difficult to identify even for the listener, are
sometimes misspelled or incorrectly transformed by the model. The text was corrected
where it should have been separate or adjacent.
3. When the end of one word and the start of the next word have similar sounds, the
model sometimes merges them. This causes it to skip the start of the following word.
4. The model abbreviates expressions such as doctor, professor, etc., kilometers and
converts them into text as Dr., Prof., etc., km.
It was important to determine how the corrections made to the METU MS, TNST,
and TASRT datasets affected the performance of the pre-trained Whisper models. To this
end, the original and the corrected versions of the datasets were subjected to testing with
pre-trained Whisper models, and the outcomes were evaluated. Afterward, LoRA was used
to fine-tune the pre-trained Whisper models with the corrected datasets. The performances
of these fine-tuned models were then evaluated.
3.2. Testing the Performance of Whisper Models on Five Turkish Speech Datasets
Table 3 presents the WER, CER and the duration values in hours and minutes for
the pre-tests conducted on the Whisper models at five different scales (tiny, base, small,
medium, large) before any modifications were made to the Turkish speech datasets. In the
test, all of the samples in the METU MS, TASRT and FLEURS datasets with a low number of
records and 30% of the samples in the TNST and Mozilla CV datasets with a high number
of records were used.
83
Electronics 2024, 13, 4227
Table 3. Pre-testing the performance of the Whisper models on five Turkish speech datasets.
In the METU MS dataset, while the WER is 0.36 in the tiny model, it drops to 0.10 in
the large-v2 model. In other words, the errors decreased as the model size increased. While
the CER is 0.10 in the tiny model, it decreases to 0.03 in the large-v2 model. The TNST
dataset has higher error rates compared to the METU MS. Especially, the tiny model has
the highest error rate, with a WER of 0.53. However, again with the large-v2 model, the
WER drops to 0.13. The FLEURS dataset performs better than the TNST. With the large-v2
model, the WER drops to 0.08 and the CER to 0.02, indicating very low error rates. The
Mozilla CV dataset also exhibits high error rates, similar to the TNST. Even in the large-v2
model, the WER drops to 0.16, but this is higher than in the METU MS and FLEURS. The
TASRT dataset shows a balanced performance across all the models. In the large-v2 model,
the WER is 0.13 and the CER is 0.04. Compared to the other datasets, the average error
rates are lower.
Table 4. Test performance evaluation of the Whisper models after dataset correction.
In the experiments performed using the Whisper-large-v2 model after rearranging the
datasets, in the METU MS dataset, there is a notable decline in the WER and CER ratios
with an increase in the model size. The Large-v2 model exhibits the lowest WER of 0.06 and
the lowest CER of 0.02. This suggests that the dataset was transcribed with a high degree
of accuracy. In comparison to the other datasets, the TNST dataset exhibits higher WER
and CER ratios. This may be indicative of the dataset being more challenging to transcribe
84
Electronics 2024, 13, 4227
or comprising elements such as different languages and accents. The lowest WER was 0.10
in the “large-v2” model, while the lowest CER was 0.04. In the TASRT dataset, the WER
and CER rates decrease with the model size. Notably, the values of the WER (0.08) and
CER (0.03) in the “large-v2” model demonstrate an improvement trend comparable to that
observed in the other datasets.
85
Electronics 2024, 13, 4227
interfaces (APIs). The optimization points of Google USM, which runs behind APIs, and
the hardware on which it runs are not clear. Despite these uncertainties about the model,
a performance comparison between the two systems can be performed indirectly. A
performance comparison between the two ASR systems was performed on the WER and
CER using the METU MS, TNST, FLEURS, Mozilla CV and TASRT datasets. A certain
number of random speech files were selected from each dataset. The selected files were
given as input to the Whisper large-v3 model and the API (called Chirp) running Google
USM behind it. The resulting output texts were compared with real speech texts to find
the WER and CER values. The number of samples for the datasets used in the comparison,
the criterion values obtained and the percentage differences between the values of the two
models are given in Table 6.
Whisper-large-v3 provides lower WER and CER than Google USM on most datasets
and is particularly superior on the METU MS, FLEURS and TASRT datasets. Significant
performance gains are observed over Google USM on the FLEURS and TASRT datasets.
However, on the Mozilla CV dataset, Google USM provides better WER performance than
Whisper-large-v3, but Whisper-large-v3 is superior concerning the CER. In terms of the
CER, Whisper-large-v3 is more successful in most cases, but there is one case where Google
USM performs better on the TNST dataset.
86
Electronics 2024, 13, 4227
One of the methods used to overcome these difficulties is LoRA. In this study, the
fine-tuning process is performed using LoRA. With LoRA, the number of target parameters
trained on the Whisper architecture can be reduced to approximately 3/100 and the memory
requirement to approximately 1/3 [6].
Table 7. Test performance comparison after fine-tuning for the Whisper-medium model.
Whisper-Medium
Dataset Metric Whisper-Medium Difference (%)
(Fine-Tuning)
WER 0.109 0.065 −40.37
METU MS (100%)
CER 0.043 0.015 −65.12
WER 0.139 0.136 −2.16
TNST (10%)
CER 0.047 0.036 −23.40
WER 0.118 0.117 −0.85
FLEURS (100%)
CER 0.037 0.034 −8.11
WER 0.210 0.172 −18.10
Mozilla CV (10%)
CER 0.117 0.099 −15.38
WER 0.91 0.106 16.48
TASRT (100%)
CER 0.035 0.042 20.00
After fine-tuning the Whisper-medium model, the most significant decrease in terms
of the WER value was observed in the METU MS dataset. No significant improvement was
observed in the TNST and FLEURS datasets, which contain many foreign words in various
languages. On the other hand, while an 18.1% improvement was achieved in the Mozilla
CV dataset, a 16.48% performance decrease was observed in the TASRT dataset. Similarly,
while a decrease in the CER value occurred in the other four datasets, a 20% increase was
observed in the TOKTT dataset. It was observed that the CER improvement in the TNST
dataset in particular was not reflected in the WER value at the same rate.
87
Electronics 2024, 13, 4227
Whisper Large-v2
Dataset Metric Whisper Large-v2 Difference (%)
(Fine-Tuning)
WER 0.059 0.050 −15.25
METU MS (100%)
CER 0.020 0.010 −50.00
WER 0.132 0.109 −17.42
TNST (30%)
CER 0.051 0.032 −37.25
WER 0.083 0.047 −43.37
FLEURS (100%)
CER 0.021 0.014 −33.33
WER 0.10 0.051 −49.00
Mozilla CV (30%)
CER 0.021 0.010 −52.38
WER 0.081 0.079 −2.47
TASRT (100%)
CER 0.032 0.030 −6.25
Following the application of the fine-tuning techniques, a reduction in the WER and
CER values was observed across all the datasets. The most substantial enhancements were
observed in the Mozilla CV and FLEURS datasets.
As anticipated, the 43.37% reduction in the WER is indicative of the model’s enhanced
accuracy in transcription on the FLEURS dataset, which constituted 80% of the total fine-
tuning data. The enhancement in character-level accuracy is also reflected in the reduction
in character-level errors.
The test on 10% of the Mozilla CV dataset demonstrated the most substantial improve-
ments in both the WER (49.00%) and CER (52.38%). These results demonstrate the efficacy
of the fine-tuning process on the Mozilla CV dataset.
The 50% improvement in the CER for the METU MS dataset demonstrates that the
model is highly effective in reducing character-level errors. The 15.25% improvement
in the WER is also noteworthy, indicating an enhancement in the overall quality of the
transcription.
Concerning the TNST dataset, the fine-tuning resulted in enhanced transcription
accuracy, particularly a 37.25% improvement in the CER. The 17.42% improvement in the
WER also suggests that the model’s word-level performance has been enhanced.
The impact of fine-tuning on the TASRT dataset was found to be relatively limited.
In the case of the TASRT dataset, which comprises approximately 75% of speech data
exceeding 30 s in duration, the minimal observed improvements of 2.47% in the WER and
6.25% in the WER may be indicative of the model’s inherent suitability for this particular
dataset, or that the fine-tuning was insufficiently effective.
88
Electronics 2024, 13, 4227
runtime. Consequently, the Whisper fine-tuning process concentrated on the lower models,
medium and large-v2.
4. Discussion
The end-to-end, multi-tasking design of modern ASR applications is a testament to the
advancements in artificial intelligence and machine learning. By eliminating the need for
separate acoustic and language models, these systems streamline the process of understand-
ing and processing human speech. The integration of automatic translation and speaker
tagging within ASR systems not only enhances the user experience but also broadens
the scope of application for these technologies. The transition from traditional statistical
models, such as HMMs, to advanced deep learning structures is a major development.
This includes the use of transformers with encoders and decoders, marking a significant
milestone in the field. These deep learning models can analyze vast amounts of data and
learn complex patterns, which results in more accurate and efficient speech recognition.
As the complexity of deep learning architectures increases, particularly with the
integration of transformer models, the necessity for extensive training data and substantial
computational resources also rises in parallel. These issues have been addressed through
the development of more advanced GPUs and the ongoing expansion of datasets. The shift
toward creating models with a larger number of parameters reflects the field’s adaptation
to these technological advancements. The training of deep learning models on large-
scale datasets has been demonstrated to enhance their generalization capabilities and to
improve their performance across a wide range of tasks. Incremental improvements in
model components, such as refining the attention mechanism, are often more favored than
overhauling the entire architecture. This approach allows for steady progress and the
optimization of existing frameworks, which can be more efficient than starting from scratch.
The expansion of training datasets also plays a crucial role in this development trajectory,
providing the models with a diverse range of inputs from which to learn. The current
trajectory suggests the focus will remain on scaling up the models in line with the available
computational resources. This scaling is not just in terms of the size of the datasets but also
the complexity of the models’ input attributes [69,70].
This study used the Whisper ASR model, which employs a transformer-based encoder–
decoder architecture, to create a speech recognition system for Turkish. The performance of
the model was evaluated using various Turkish speech datasets, and there were noticeable
enhancements after fine-tuning with the LoRA method.
Initial reviews of the Turkish speech datasets identified the following common errors
and issues.
Problems related to transferring the speech to the text, such as not reading the text com-
pletely and correctly, transferring the speech to the text differently and adding expressions
that are not in the speech to the text.
Spelling errors such as missing or incorrect punctuation in Turkish characters and
words that must be written adjacent or separate according to the spelling rules are trans-
ferred to the text differently.
Ending the recording before the expression in the text is completed, noise, etc., leading
to sound quality problems.
In a carefully written text, it is estimated that there are approximately 1–2% spelling
errors. On the other hand, for quickly written or unchecked texts, this rate is thought to
be approximately 10–15% [71]. Within the scope of this study, the speech voiced at many
points in the METU MS dataset was listened to again and corrections were made. When
the initial and final versions of the texts were compared, it was determined that there were
1.7% character and 5.8% word differences. Considering the deficiencies and problems that
may arise from the correction process, the speaker does not vocalize by the text, words or
letters are pronounced differently, etc., it was observed that there were approximately 1–2%
letter errors and 5% word errors due to errors. This coincides with the predictions given at
the beginning.
89
Electronics 2024, 13, 4227
Considering all these issues, it is understood that the current Whisper architecture
is more successful than the error values obtained in terms of the understandability of the
converted text.
By fine-tuning, the performance of the models can be increased in terms of the task
and language. However, for the TNST, METU MS, etc., although the datasets provide the
opportunity to compare different models and architectures, it has been observed that they
are insufficient for the fine-tuning process. Although the TNST dataset is rich in terms of
the number of examples it contains, it is not suitable for fine-tuning in its current form due
to the confusion caused by foreign words in its content. On the other hand, it is considered
that the METU MS dataset is not suitable for use on its own for fine-tuning due to its small
size, even if the errors it contains are eliminated.
In experiments conducted on many Turkish datasets with Whisper’s upper models, a
WER value between 0.05 and 0.15 was obtained. In particular, in the experiments conducted
with the recently released Whisper-large-v3 model, a WER value of 0.04 to 0.10 was reached.
These results show that Whisper is a successful architecture.
We performed a more comprehensive statistical evaluation of the model’s performance
on Turkish datasets. Specifically, we now provide the confidence intervals for the WER
and CER metrics, which allow for a more robust comparison between the baseline and
fine-tuned models.
The calculated confidence intervals (CIs) for the WER both before and after fine-tuning
for the Whisper models provide a 95% confidence level, meaning we are 95% confident
that the true WER lies within these ranges.
Confidence Intervals for Pre-Fine-Tuning WER:
METU MS: Mean = 0.202, CI = (0.152, 0.252)
TNST: Mean = 0.278, CI = (0.208, 0.347)
FLEURS: Mean = 0.228, CI = (0.165, 0.291)
Mozilla CV: Mean = 0.290, CI = (0.227, 0.354)
TASRT: Mean = 0.232, CI = (0.178, 0.286)
Confidence Intervals for Post-Fine-Tuning WER:
METU MS: Mean = 0.166, CI = (0.113, 0.218)
TNST: Mean = 0.262, CI = (0.202, 0.322)
TASRT: Mean = 0.190, CI = (0.137, 0.243)
These confidence intervals show the variability and reliability of the WER values
before and after fine-tuning. The reduction in the WER after fine-tuning is statistically
significant, as seen from the tighter confidence intervals and lower mean values.
Additionally, we used statistical significance tests (e.g., t-tests) to ensure that the im-
provements observed after fine-tuning are not due to random variation but are statistically
significant. This helps solidify the reliability of our results. The results of the paired t-tests,
which compare the WER values before and after fine-tuning for each dataset, are as follows:
METU MS: t-statistic: 5.77, p-value: 0.004
TNST: t-statistic: 4.65, p-value: 0.009
TASRT: t-statistic: 6.24, p-value: 0.003
For all the datasets (METU MS, TNST, and TASRT), the p-values are well below the
common significance threshold of 0.05. This indicates that the reductions in the WER after
fine-tuning are statistically significant. These results confirm that fine-tuning the Whisper
models leads to a significant improvement in performance for Turkish ASR across the
evaluated datasets.
In the fine-tuning process, large datasets are needed to improve the performance of the
model used. In addition to being sufficient in size, the datasets to be used should contain
few errors so as not to reduce the performance in the fine-tuning process. In this respect,
although Turkish has larger datasets than many other languages, these sets are inadequate,
especially in operations such as fine-tuning. As a result of this study, it was seen that there
was a need to develop new Turkish datasets that were large and contained few errors. It is
90
Electronics 2024, 13, 4227
thought that in the future, studies should be carried out on the development of datasets
with the above-mentioned features.
5. Conclusions
The integration of machine learning, particularly advanced deep learning techniques,
has had a significant impact on the field of ASR. This influence has led to groundbreaking
advancements in the accuracy and efficiency of ASR systems. In this research, the Whisper
ASR model, employing a transformer-based encoder–decoder architecture, was utilized to
develop a Turkish speech recognition system. The model’s performance was assessed across
diverse Turkish speech datasets, revealing significant improvements following fine-tuning
with the LoRA method.
Challenges and Solutions:
Data scarcity: one of the primary challenges in developing ASR systems for low-
resource languages like Turkish is the limited availability of labeled speech data. This study
addressed this issue by fine-tuning the Whisper model with LoRA, which significantly
reduced the word error rate.
Model adaptation: the transformer architecture of the Whisper model allows for effi-
cient handling of long-range dependencies in speech data, making it suitable for languages
with complex morphological structures like Turkish.
Future Directions:
Enhanced data collection: future research should focus on collecting more diverse and
extensive Turkish speech datasets to further improve the model’s performance.
Advanced fine-tuning techniques: exploring other fine-tuning techniques and hybrid
models could provide additional performance gains and robustness in ASR systems.
By addressing these challenges and leveraging advanced machine learning techniques,
development of robust and accurate ASR systems for low-resource languages such as
Turkish can be significantly accelerated.
Author Contributions: Conceptualization, H.P., A.K.T. and C.K.; methodology, H.P. and A.K.T.; soft-
ware, A.K.T.; validation, H.P., A.K.T., C.K. and H.B.U.; formal analysis, H.P. and A.K.T.; investigation,
C.K. and A.K.T.; resources, C.K. and H.B.U.; data curation, A.K.T. and H.B.U.; writing—original
draft preparation, H.P. and A.K.T.; writing—review and editing, H.P., C.K. and H.B.U.; visualization,
H.B.U.; supervision, H.P. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Middle East Technical University (METU) Turkish Microphone Speech,
data availability status: data available in a publicly accessible repository. The original data presented
in the study are openly available in [Link] (accessed on 15
September 2024). Turkish Broadcast News Speech and Text dataset, data availability status: data
available in a publicly accessible repository. The original data presented in the study are openly
available in [Link] (accessed on 15 September 2024). Mozilla
Common Voice dataset, data availability status: data available in a publicly accessible repository. The
original data presented in the study are openly available in [Link]
datasets (accessed on 15 September 2024). FLEURS dataset, data availability status: data available
in a publicly accessible repository. The original data presented in the study are openly available in
[Link] (accessed on 15 September 2024). Turkish Automatic
Speech Recognition Test (TASRT), data availability status: data available on request due to restrictions
(commercial use). The data presented in this study are available from the corresponding author
upon request.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Lv, Z.; Poiesi, F.; Dong, Q.; Lloret, J.; Song, H. Deep Learning for Intelligent Human–Computer Interaction. Appl. Sci. 2022, 12,
11457. [CrossRef]
91
Electronics 2024, 13, 4227
2. Mihelic, F.; Zibert, J. Speech Recognition; InTech: Online, 2008; pp. 477–550. Available online: [Link]
books/3785 (accessed on 27 August 2024).
3. Oyucu, S. Development of Deep Learning Based Models for Turkish Speech Recognition. Ph.D. Thesis, Gazi University, Ankara,
Turkey, 2020.
4. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak
Supervision. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023;
pp. 28492–28518.
5. Song, Z.; Zhuo, J.; Yang, Y.; Ma, Z.; Zhang, S.; Chen, X. LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR.
arXiv 2024. [CrossRef]
6. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language
Models. arXiv 2021. [CrossRef]
7. Dudley, H.; Riesz, R.R.; Watkins, S.S.A. A Synthetic Speaker. J. Frankl. Inst. 1939, 227, 739–764. [CrossRef]
8. Dudley, H. The Vocoder—Electrical Re-creation of Speech. J. Soc. Motion Pict. Eng. 1940, 34, 272–278. [CrossRef]
9. Juang, B.-H.; Rabiner, L.R. Automatic Speech Recognition—A Brief History of the Technology Development, 2nd ed.; Elsevier: Amster-
dam, The Netherlands, 2005.
10. Davis, K.H.; Biddulph, R.; Balashek, S. Automatic Recognition of Spoken Digits. J. Acoust. Soc. Am. 1952, 24, 637–642. [CrossRef]
11. Fry, D.B. Theoretical Aspects of Mechanical Speech Recognition. J. Br. Inst. Radio Eng. 1959, 19, 211–218. [CrossRef]
12. Forgie, J.W.; Forgie, C.D. Results Obtained from a Vowel Recognition Computer Program. J. Acoust. Soc. Am. 1959, 31, 1480–1489.
[CrossRef]
13. Suzuki, J.; Nakata, K. Recognition of Japanese Vowels-Preliminary to the Recognition of Speech. J. Radio Res. Lab. 1961, 37, 193–212.
14. Nagata, K.; Kato, Y.; Chiba, S. Spoken Digit Recognizer for the Japanese Language. J. Audio Eng. Soc. 1964, 12, 336–342.
15. Sakai, T. The Phonetic Typewriter: Its Fundamentals and Mechanism. Stud. Phonol. 1961, 1, 140–152.
16. Kamath, U.; Liu, J.; Whitaker, J. Deep Learning for NLP and Speech Recognition, 1st ed.; Springer: Cham, Switzerland, 2019;
pp. 369–404. [CrossRef]
17. Martin, T.B.; Nelson, A.L.; Zadell, H.J. Speech Recognition by Feature-Abstraction Techniques; Tech. Doc. Report No. AL TDR 64-176;
Air Force Avionics Lab.: Wright-Patterson AF Base, OH, USA, 1964.
18. Vintsyuk, T.K. Speech Discrimination by Dynamic Programming. Cybernetics 1968, 4, 52–57. [CrossRef]
19. Viterbi, A. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. Inf. Theory
1967, 13, 260–269. [CrossRef]
20. Itakura, F. Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Trans. Acoust. Speech Signal Process. 1975,
23, 67–72. [CrossRef]
21. Klatt, D.H. Review of the ARPA Speech Understanding Project. J. Acoust. Soc. Am. 1977, 62, 1345–1366. [CrossRef]
22. Lippmann, R.P. Review of neural networks for speech recognition. Neural Comput. 1989, 1, 1–38. [CrossRef]
23. Lowerre, B.; Reddy, R. The Harpy Speech Recognition System: Performance with Large Vocabularies. J. Acoust. Soc. Am. 1976, 60,
S10–S11. [CrossRef]
24. Ferguson, J.D. Symposium on the Application of Hidden Markov Models to Text and Speech; Institute for Defense Analyses, Communi-
cations Research Division: Princeton, NJ, USA, 1980.
25. Wilpon, J.G.; Rabiner, L.R.; Lee, C.H.; Goldman, E.R. Automatic Recognition of Keywords in Unconstrained Speech Using Hidden
Markov Models. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1870–1878. [CrossRef]
26. Levinson, S.E.; Rabiner, L.R.; Sondhi, M.M. An Introduction to the Application of the Theory of Probabilistic Functions of a
Markov Process to Automatic Speech Recognition. Bell Syst. Tech. J. 1983, 62, 1035–1074. [CrossRef]
27. Liberman, M.; Wayne, C. Human Language Technology. AI Mag. 2020, 41, 22–35. [CrossRef]
28. Young, S.J.; Evermann, G.; Gales, M.J.F.; Kershaw, D.; Moore, G.; Odell, J.J.; Ollason, D.G.; Povey, D.; Valtchev, D.; Woodland, P.C.
The HTK Book. 2006. Available online: [Link] (accessed on 30 August 2024).
29. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence
Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA,
USA, 25–29 June 2006; pp. 369–376.
30. Chan, W.; Jaitly, N.; Le, Q.V.; Vinyals, O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational
Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Shanghai, China, 20–25 March 2016; pp. 4960–4964. [CrossRef]
31. Zhang, Y.; Chan, W.; Jaitly, N. Very deep convolutional networks for end-to-end speech recognition. In Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017;
pp. 4845–4849.
32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You
Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9
December 2017.
33. Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019.
[CrossRef]
34. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019. [CrossRef]
92
Electronics 2024, 13, 4227
35. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Online Conference, 6–12 December
2020; pp. 12449–12460.
36. Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-
augmented Transformer for Speech Recognition. In Proceedings of the INTERSPEECH 2020, Shanghai, China, 25–29 October
2020; pp. 5036–5040. [CrossRef]
37. Arslan, R.S.; Barışçı, N. A Detailed Survey of Turkish Automatic Speech Recognition. Turk. J. Electr. Eng. Comput. Sci. 2020, 28,
3253–3269. [CrossRef]
38. Salor, Ö.; Pellom, B.L.; Ciloglu, T.; Hacioglu, K.; Demirekler, M. On Developing New Text and Audio Corpora and Speech
Recognition Tools for the Turkish Language. In Proceedings of the 7th International Conference on Spoken Language Processing
(ICSLP 2002), Denver, CO, USA, 16–20 September 2002; pp. 349–352. [CrossRef]
39. Polat, H.; Oyucu, S. Building A Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition
Results. Symmetry 2020, 12, 290. [CrossRef]
40. Artuner, H. The Design and Implementation of a Turkish Speech Phoneme Clustering System. Ph.D. Thesis, Hacettepe University,
Ankara, Turkey, 1994.
41. Erzin, E. New Methods for Robust Speech Recognition. Master’s Thesis, Bilkent University, Ankara, Turkey, 1995.
42. Özkan, Ö. Implementation of a Speech Recognition System for Connected Numerals. Master’s Thesis, Middle East Technical
University, Ankara, Turkey, 1997.
43. Uslu, E. Large Vocabulary Continuous Speech Recognition Using Hidden Markov Model. Master’s Thesis, Yıldız Technical
University, Istanbul, Turkey, 2007.
44. Yılmaz, C. A Large Vocabulary Speech Recognition System for Turkish. Master’s Thesis, Bilkent University, Ankara, Turkey, 1999.
45. Edizkan, R. Computer Speech Recognition: Design of a Classifier in Feature Space and Subspaces. Ph.D. Thesis, Osmangazi
University, Eskişehir, Turkey, 1999.
46. Şahin, S. Language Modeling for Turkish Continuous Speech Recognition. Master’s Thesis, Middle East Technical University,
Ankara, Turkey, 2003.
47. Dede, G. Speech Recognition with Artificial Neural Networks. Master’s Thesis, Ankara Univesity, Ankara, Turkey, 2008.
48. Susman, D. Turkish Large Vocabulary Continuous Speech Recognition by Using Limited Audio Corpus. Master’s Thesis, Middle
East Technical University, Ankara, Turkey, 2012.
49. Tombaloğlu, B.; Erdem, H. Turkish Speech Recognition Techniques and Applications of Recurrent Units (LSTM and GRU). Gazi
Univ. J. Sci. 2021, 34, 1035–1049. [CrossRef]
50. Safaya, A.; Erzin, E. Experiments on Turkish ASR with Self-Supervised Speech Representation Learning. arXiv 2022. [CrossRef]
51. Mussakhojayeva, S.; Dauletbek, K.; Yeshpanov, R.; Varol, H.A. Multilingual Speech Recognition for Turkic Languages. Information
2023, 14, 74. [CrossRef]
52. Mercan, Ö.B.; Çepni, S.; Taşar, D.E.; Ozan, Ş. Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish:
Whisper-Small and Wav2Vec2-XLS-R-300M. Turk. Inform. Found. Comput. Sci. Eng. J. 2023, 16, 109–116. [CrossRef]
53. Chun, S.-J.; Park, J.B.; Ryu, H.; Jang, B.-S. Development and Benchmarking of a Korean Audio Speech Recognition Model for
Clinician-Patient Conversations in Radiation Oncology Clinics. Int. J. Med. Inform. 2023, 176, 105112. [CrossRef]
54. Yang, H.; Zhang, M.; Tao, S.; Ma, M.; Qin, Y. Chinese ASR and NER Improvement Based on Whisper Fine-Tuning. In Proceedings
of the 25th International Conference on Advanced Communication Technology (ICACT), Pyeongchang, Republic of Korea, 19–22
February 2023; pp. 213–217. [CrossRef]
55. Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits. JASA
Express Lett. 2024, 4, 025206. [CrossRef]
56. Wang, X.; Aitchison, L.; Rudolph, M. LoRA ensembles for large language model fine-tuning. arXiv 2023. [CrossRef]
57. Arora, S.J.; Singh, R.P. Automatic Speech Recognition: A Review. Int. J. Comput. Appl. 2012, 60, 34–44. [CrossRef]
58. Dhanjal, A.S.; Singh, W. A Comprehensive Survey on Automatic Speech Recognition Using Neural Networks. Multimed. Tools
Appl. 2024, 83, 23367–23412. [CrossRef]
59. Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic Speech Recognition: A Survey. Multimed. Tools Appl. 2021, 80,
9411–9457. [CrossRef]
60. Wongso, W.; Lucky, H.; Suhartono, D. Pre-Trained Transformer-Based Language Models for Sundanese. J. Big Data 2022, 9, 39.
[CrossRef]
61. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International
Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112.
62. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014. [CrossRef]
63. Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2023. [CrossRef]
64. Salor, O.; Ciloglu, T.; Demirekler, M. METU Turkish Microphone Speech Corpus. In Proceedings of the IEEE 14th Signal Processing
and Communications Applications, Antalya, Turkey, 17–19 April 2006; pp. 1–4. [CrossRef]
65. Arisoy, E.; Can, D.; Parlak, S.; Sak, H.; Saraclar, M. Turkish Broadcast News Transcription and Retrieval. IEEE Trans. Audio Speech
Lang. Process. 2009, 17, 874–883. [CrossRef]
93
Electronics 2024, 13, 4227
66. Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common
Voice: A Massively-Multilingual Speech Corpus. arXiv 2020. [CrossRef]
67. Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; Bapna, A. FLEURS: Few-shot Learning
Evaluation of Universal Representations of Speech. arXiv 2022. [CrossRef]
68. Oyucu, S. Development of Test Corpus with Large Vocabulary for Turkish Speech Recognition System and A New Test Procedure.
Adıyaman Univ. J. Sci. 2022, 9, 156–164. [CrossRef]
69. Khan, A.; Khan, M.; Gueaieb, W.; Saddik, A.E.; De Masi, G.; Karray, F. CamoFocus: Enhancing Camouflage Object Detection with
Split-Feature Focal Modulation and Context Refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1423–1432. [CrossRef]
70. Khan, U.; Khan, M.; Elsaddik, A.; Gueaieb, W. DDNet: Diabetic Retinopathy Detection System Using Skip Connection-based
Upgraded Feature Block. In Proceedings of the IEEE International Symposium on Medical Measurements and Applications
(MeMeA), Jeju, Republic of Korea, 14–16 June 2023; pp. 1–6. [CrossRef]
71. Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,
and Speech Recognition with Language Models, 3rd ed.; 2024. Available online: [Link] (accessed
on 30 August 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
94
electronics
Article
GGTr: An Innovative Framework for Accurate and Realistic
Human Motion Prediction
Biaozhang Huang 1,2 and Xinde Li 1,2, *
1 Key Laboratory Measurement and Control of CSE, Ministry of Education, School of Automation,
Southeast University, Nanjing 210096, China; bzhuang@[Link]
2 Nanjing Center for Applied Mathematics, Nanjing 211135, China
* Correspondence: xindeli@[Link]
Abstract: Human motion prediction involves forecasting future movements based on past observa-
tions, which is a complex task due to the inherent spatial-temporal dynamics of human motion. In
this paper, we introduced a novel framework, GGTr, which adeptly encapsulates these patterns by
integrating positional graph convolutional network (GCN) layers, gated recurrent unit (GRU) net-
work layers, and transformer layers. The proposed model utilizes an enhanced GCN layer equipped
with a positional representation to aggregate information from body joints more effectively. To
address temporal dependencies, we strategically combined GRU and transformer layers, enabling the
model to capture both local and global temporal dependencies across body joints. Through extensive
experiments conducted on Human3.6M and CMU-MoCap datasets, we demonstrated the superior
performance of our proposed model. Notably, our framework shows significant improvements in
predicting long-term movements, outperforming state-of-the-art methods substantially.
Keywords: human motion prediction; graph convolutional network; gated recurrent unit; transformers
1. Introduction
Human motion prediction, the task of forecasting future human movements based on
past observations, plays a critical role in various domains such as robotics, computer vision,
Citation: Huang, B.; Li, X. GGTr: An healthcare, and sports analysis. Accurately predicting human motion is instrumental for
Innovative Framework for Accurate facilitating effective human-robot collaboration [1,2], ensuring system security [3,4], analyz-
and Realistic Human Motion ing human behavior and emotions [5,6], and supporting sports performance analysis [7].
Prediction. Electronics 2023, 12, 3305. However, predicting human motion presents significant challenges due to the complexity
[Link] and diversity of human behaviors. First, human motion exhibits high variability and
electronics12153305 uncertainty. This is evident at the 3D skeletal level, due to the diversity in human body
Academic Editor: Grzegorz Dudek sizes, and at the movement level, due to individual idiosyncrasies. In scenarios with rapid
changes, such as sudden reactions or motions, it is exceedingly difficult for predictive
Received: 6 July 2023 models to adapt quickly. Second, the interplay between different body parts and their
Revised: 29 July 2023
coordinated movements further complicates the task. For instance, a motion initiated by
Accepted: 30 July 2023
one body part can propagate to other body parts in complex and often non-intuitive ways.
Published: 1 August 2023
So, it becomes essential to capture both spatial and temporal features, as illustrated
in Figure 1. Numerous research efforts have been made to address the challenges of
modeling human motion. Traditional methods and machine learning methods such as
Copyright: © 2023 by the authors.
hidden Markov models (HMM) [8], Gaussian processes (GP) [9], and restricted Boltzmann
Licensee MDPI, Basel, Switzerland. machine [10]. However, these methods may not fully capture the complex interdepen-
This article is an open access article dencies and non-linear dynamics present in human motion. More recently, deep learning
distributed under the terms and approaches, such as convolutional neural networks (CNNs) [11], graph convolutional net-
conditions of the Creative Commons works (GCN) [12–15], temporal modules such as recurrent neural networks (RNNs) [16–21],
Attribution (CC BY) license (https:// and transformers [22–24] have been used. While existing RNN and deep learning-based
[Link]/licenses/by/ models have significantly improved the prediction performance, they still have limita-
4.0/). tions. These methods often struggle to capture the dynamic and complex interactions
between different body parts. The relations among body joints cannot be simply modeled
by static spatial proximity; instead, they are influenced by various factors such as the
individual’s physical attributes, the current body state, and the motion context. Moreover,
these models frequently encounter challenges in effectively capturing both local and global
temporal dependencies.
6SDWLRWHPSRUDO
)HDWXUHV
([WUDFWLRQ
+LVWRULFDOSRVHV 3UHGLFWLRQSRVHV
Figure 1. The process of human motion prediction. The historical sequences of human motion are
subjected to feature extraction in terms of both space and time. This involves the analysis of the
spatial configuration of the body, as well as the timing and sequence of these movements. The derived
spatial and temporal features constitute the foundation for building a comprehensive understanding
of the motion patterns.
To address these challenges, we proposed a novel approach for human motion predic-
tion that combines positional graph convolutional network (GCN) layers, gated recurrent
unit (GRU) network layers, and transformer layers, termed as GGTr. This combination
allows us to better capture the complex spatial-temporal patterns in human motion data.
In summary, the primary contributions of this paper include:
(1) The introduction of a novel GCN layer with positional representation, enabling better
aggregation of information from adjacent body joints;
(2) The strategic combination of GRU and transformer layers to capture both local and
global temporal dependencies across body joints;
(3) The conduction of extensive experiments on the Human3.6M and CMU-MoCap
datasets, demonstrating the effectiveness and advantages of our proposed frame-
work, our model shows significant improvements in predicting short-term movements.
Experiments reveal that our model significantly outperforms state-of-the-art methods .
The remainder of this paper is organized as follows: Section 2 reviews related work.
In Section 3, we first introduce some preliminary concepts and formalize the human motion
prediction problem and the overall framework of the spatial temporal network model. Then,
we discuss in detail the three model components to deal with the spatial and temporal
dependencies, separately. In Section 4, experiments were conducted on two large-scale
datasets, comparing the performance of the proposed method with baselines. Section 5
provides a summary and conclusion, as well as a discussion of future work.
2. Related Work
Human motion prediction is a challenging task due to the complexity and variability
of human movements. This complexity arises from the intricate interplay of various factors
such as the individual’s physical characteristics, the environment, and the task at hand.
Over the years, several approaches have been proposed to tackle this problem, each with
its own strengths and limitations.
Traditional methods primarily rely on data statistical approaches or prior knowledge,
such as Markov prediction models [8], Gaussian process dynamical models [9] and re-
stricted Boltzmann machine [10], which can only tackle simple human motion patterns.
Although these methods have achieved some success in certain scenarios, they still face
challenges in capturing complex spatial and temporal dependencies.
96
Electronics 2023, 12, 3305
Chiu et al. [25] used LSTM units to model the underlying structure of human motion
hierarchically, but did not adequately utilize spatial information of human motion data.
Martinez et al. [18] introduced a residual structure using GRU to model the velocity of
human motion sequences, Their model focused on short-term temporal modeling, ignoring
long-term dependencies and spatial structure. Jain et al. [17] combined LSTM and fully
connected (FC) layers in a structural RNN model to encode high-level spatio-temporal
structures in human motion sequences. Guo et al. [26] employed FC layers and GRU to
model local structures and capture long-term temporal dependencies, but they did not
take into account the interactions between different human body limbs. The transformer
model with a self-attention mechanism has been proved to be more effective than recurrent
networks in various domains [23,27,28]. It applies a multi-head attention mechanism
to directly learn dependencies between each pair of input and output positions without
any latency.
Graph convolutional networks (GCN) have become widely used for modeling the
underlying relationships of non-Euclidean data. Kipf et al. [29] proposed a layer-wise prop-
agation rule for nodes inspired by first-order approximation of spectral convolutions on
graphs. However, this approach is limited by the characteristics of the graph. Yuan et al. [30]
proposed a more flexible approach by learning node connectivity based on node neigh-
borhood. Velickovic et al. [31] introduced self-attention to determine the neighborhood
structure to be considered, providing more flexibility to the network. This approach has
been applied to action recognition [32] by using a GCN to capture the temporal and spatial
dependencies of human body joints via a graph defined on temporally connected kinematic
trees. These techniques have been applied to human motion prediction by building the
human pose as a graph and using GCN to encode the spatial connectivity of human joints.
Ma et al. [33] proposed two variants of GCN to extract spatial and temporal features. They
built a multi-stage structure where each stage contains an encoder and a decoder and, dur-
ing training, the model is trained with intermediate supervision to learn to progressively
refine the prediction. References [12,13,34] extended the graph of the human pose to multi-
scale the version across the abstraction levels of the human pose. However, as these models
aggregate body joint features based on an input adjacency matrix, the relation between
body parts is fixed and may limit the model’s ability to adapt to complex human motions.
The transformer [35] has become the dominant approach in natural language process-
ing (NLP). The key component of the transformer is a multi-head self-attention mechanism
that captures long-range dependencies. Building upon the success of the transformer
in various tasks [36–38], researchers have increasingly focused on exploring its potential
applications in 3D human motion prediction [22–24]. Cai et al. [22] employed a transformer-
based architecture with discrete cosine transform (DCT) to capture the long-range spatial
correlations and temporal dependencies in human motion dynamics. Another notable
advancement is the spatio-temporal transformer (ST-Trans) mechanism proposed by Ak-
san et al. [23], which effectively captures the spatio-temporal dependencies of decomposed
3D human motions. However, the ST-Trans method overlooks the importance of ensuring
consistency between spatial and temporal information, which is a crucial factor when deal-
ing with time-varying data. To address this limitation, a cross-transformer approach [24]
has been developed to explore effective interaction between spatial and temporal branches.
This approach is designed to learn the coherence of spatial and temporal information and
simultaneously enhance the model’s predictive capacity. Despite these advancements,
transformer-based methods may overlook local information when dealing with human
motion data, warranting further investigation into this aspect.
In summary, with the development of human motion prediction in recent years,
GCN/GRU/transformer-based architectures have been well explored and results have
significantly improved. In this paper, we proposed a new spatial-temporal graph convo-
lutional network framework to address the human motion prediction problem. We used
graph convolutional networks with a position-wise attention mechanism to capture the
spatial dependencies of the human body joints. A gated recurrent neural network with
97
Electronics 2023, 12, 3305
transformer layers was used to capture both local and global information of human motion
sequences in the temporal dimension.
X pred = f ( G; X | Z ) = ( x T +1 , . . . , x T +T ). (1)
98
Electronics 2023, 12, 3305
In the human motion prediction problem, if two body joints are connected or in close
proximity, their movements are likely to mutually influence each other. Thus, to capture
these spatial relationships, we employed the graph convolutional networks model proposed
in [29,31]. This model is used to transform and propagate motion information through the
graph structure. Specifically, given input motion information X l ∈ R M×d on the structure,
l
l +1
where dl represents the input dimension, the output X l +1 ∈ R M×d can be computed,
with dl +1 denoting the output dimension.
where σ(.) is a non-linear activation function. In our work, we adopted ReLU (·), standing
for REctified Linear Unit, which is advantageous for its ability to speed up the convergence
of stochastic gradient descent compared to sigmoid and tanh functions. Ã = A + I M
is the adjusted adjacency matrix with I M as the M-dimensional identity matrix, and D̃
is the modified degree matrix, with D̃ii = ∑ j Ãij . W l are the parameters to be learned.
For convenience, we represent the operation in Equation (2) as follows:
/LQHDU3UHGLFWLRQ
7UDQVIRUPHU/D\HUV
+LVWRULFDOSRVHV ݔଵ்
…
3RVLWLRQDO(QFRGLQJ 3RLQWZLVHDGGLWLRQ
Figure 2. The whole architecture of our suggested solution for motion prediction, which employs an
end-to-end framework. We encode human poses x1:T and feed them to GGTr. The GCN layer aids
in understanding spatial relationships among body joints within the human motion network. Fol-
lowing the GRU is a transformer layer, designed to grasp global temporal dependencies. Ultimately,
the transformer’s output is harnessed to predict future motion states.
In the aforementioned formulation, the operation relies solely on the human body
joints structure information, which is premised on the physical proximity between body
parts. However, the interplay between body joints can be significantly more intricate.
Various factors such as muscle constraints, physical capabilities, motion style, and ongoing
actions can impact the motion of body parts. Therefore, the information from neighboring
joints should not be aggregated equally to a given central joint when performing the aggre-
gation in Equation (2). Recently, GaAN [39] attempted to employ the attention mechanism
to model the complex relationships between graph nodes. Ideally, the aforementioned fac-
tors could be used to calculate the attention score, but these factors are not always available.
Furthermore, there may be other factors influencing the relations between joints that we are
99
Electronics 2023, 12, 3305
not aware of. Therefore, in this paper, we proposed learning a positional representation to
capture these factors for each joint. Specifically, for joints vi , v j , we aimed to learn two latent
positional representations hi , h j ∈ R M . We then modeled the pairwise relations between
any body joints as:
eij = a([Whi ||Wh j ])
exp( LeakyReLU ( a(hi , h j ))) (4)
αij = ,
∑k=1 exp( LeakyReLU ( a(hi , h j )))
M
where a(.) is a relation score function modeled with dot product as follows:
Subsequently, the GCN operation can be applied to the newly learned relation matrix
mask(α):
X l +1 = σ( D̃α−1/2 α̃ D̃α−1/2 X l W l ), (7)
where α̃ = mask(α) + I M and D̃α is the degree matrix for α̃. For simplicity, we represent
the operation in Equation (7) as:
X l +1 = GCNN ( A, X l ). (8)
Instead of the operation in (2), the operation in (8) was used to capture the spatial
relations. This approach allowed us to learn a positional representation for each joint,
capturing various factors that influence their relationships. By using these learned repre-
sentations in combination with an attention mechanism and a sparsified relation matrix,
we can effectively model complex spatial relationships between different body joints and
improve the accuracy of human motion prediction.
x̃t = GCNN ( A, xt )
(9)
h̃t−1 = GCNN ( A, ht−1 ).
100
Electronics 2023, 12, 3305
For a body joint vi at time step t, the history observation xt [i ] contains the dynamic
features for joint vi at t-th time step. The update gate and reset gate are denoted as follows:
݄௧ିଵ ݄୲
ݎ௧ ݑ௧ ܿ௧
5HVHW 8SGDWH
JDWH JDWH
ߪ WDQK
ݔ௧
Figure 3. Architectural elements in a GRU layer.
The above are the fundamental equations and steps of GRU. By controlling the update
and reset gates, GRU can dynamically update and adjust the hidden state based on different
patterns in the sequence, enabling it to better capture the local temporal information.
However, in the human body motion prediction problem, the temporal information may
not only be sequentially dependent. Hence, it is important to capture the global temporal
information for accurate human motion prediction. Thus, after processing with the GRU
layer, we adopted a transformer layer [35] to capture the global dependencies following
the GRU.
The transformer layer, similar to the GRU layer, was applied to each joint individually.
For joint vi , the output sequence (h1 [i ], . . . , h T [i ]) from GRU was taken as the input for the
transformer. Figure 4 illustrates that a transformer layer consists of a multi-head attention
layer, a shared feed-forward neural work layer, and normalization layers between them.
Inputs of model. The input of the model is the spatial dependency embedding se-
quence of each joint. Let us consider the i-th joint’s spatial dependency embedding sequence
(h1 [i ], h2 [i ], . . . , h T [i ]). Although the self-attention mechanism is effective in capturing hid-
den dependency relationships in the sequence, it fails to maintain location information.
Hence, we incorporated position encoding et as proposed by [35], between GRU and the
transformer layer. This choice was motivated by the ability of this approach to facilitate the
101
Electronics 2023, 12, 3305
model’s learning to attend to relative positions in a sequence. The new representations ht [i ]
were computed as follows:
ht [i ] = ht [i ] + et , (12)
where et is defined as
sin(t/100002i/dmodel ), if i = 0, 2, 4. . .
et = (13)
cos(t/100002i/dmodel ), otherwise.
݄௧௧௦
$GG /D\HU1RUPDOL]H
)HHG)RUZDUG
$GG /D\HU1RUPDOL]H
*OREDO7HPSRUDO
0XOWL+HDG6HOI$WWHQWLRQ
4.9
݄௧
Figure 4. Architectural elements in a transformer layer.
This signifies that the output of multi-head attention is the concatenation of s single
head attention blocks, each one being projected with the matrix W O . Each single head
attention block, heads , can be given by:
heads = Attention(hvi )
Qs K T (15)
= so f tmax ( √ s )Vs ,
dk
where WsQ , WsK , and WsV denote the queries, keys, and values of the s-th single head
attention for joint vi , respectively. We obtained the Qs , Ks , and Vs by a linear projection
with W Q , W K , and W V :
where WsQ ∈ Rd×dk , WsK ∈ Rd×dk , and WsV ∈ Rd×dv are the trainable projection matrices for
the s-th attention head and are shared by all the joints.
102
Electronics 2023, 12, 3305
Feed-forward Layer. After the multi-head attention layer, the output states are then
processed by a feedforward neural network layer. This layer is identical for all joints, which
includes two linear transformations and a ReLU activation function in between:
where W1 , W2 are the trainable parameters of the feed-forward neural network and b1 and
b2 are bias terms.
Residual connection and normalization. As shown in Figure 4, the residual connection
and normalization operators appear in each layer. The residual connection was introduced
to tackle the difficulty in training deep network optimization algorithms, and normalization
was applied to prevent overfitting. Given the embedded feature Zin , the process of the
residual connection and normalization layer is denoted as follows:
Zin = LayerNorm( Zin + MultiHead( Zin ))
(18)
hout = LayerNorm( Zin + FFN ( Zin )),
where M is the number of joints in human pose, T is the number of time steps in the future
series, x̂t [i ] is the predicted joint position at the t-th time step of the i-th joint, and xt [i ] is
the corresponding ground truth.
We optimised the loss function using the improved Adam method (AdamW [42] ),
which mitigates the overfitting problem by adding a weight decay term and can significantly
improve the robustness of the model.
4. Experiments
In this section, we present experiments on two large-scale human motion capture
benchmark datasets (Human3.6M and CMU-MoCap) to demonstrate the effectiveness
of the GGTr network for human motion prediction. We first introduced the experimen-
tal settings, including datasets, baselines, and parameter settings. Then, we conducted
experiments to compare the performance of the GGTr with other baselines. Finally, we de-
signed comprehensive ablation studies to evaluate the impact of the essential architectural
components.
103
Electronics 2023, 12, 3305
4.1. Datasets
Human3.6M [41] is the largest existing human motion analysis database, consisting
of seven actors (S1, S5, S6, S7, S8, S9, and S11) performing 15 actions such as walking,
eating, smoking, discussing, and directions. Each pose includes 32 joints, represented
in the form of an exponential map. Following the data processing of [18], by converting
these into 3D coordinates, eliminating redundant joints, global rotation, and translation,
the resulting skeleton retains 17 joints that provide sufficient human motion details. These
joints include key ones that locate major body parts (e.g., shoulders, knees, and elbows).
We downsampled the frame rate to 25 fps and used S5 and S11 for testing and validation,
while the remaining five actors were used for training.
CMU-MoCap (Available at [Link] accessed on 5 July 2023) is a
3D human motion dataset, released by Carnegie Mellon University, that used 12 Vicon
infrared MX-40 cameras to record the positions of 41 sensors attached to the human body,
describing human motion. The dataset can be divided into six motion themes, including
human interaction, interaction with environment, locomotion, physical activities and sports,
situations and scenarios, and test motions. We adopted the same data preprocessing method
as described in the literature [43], simplifying each human body and reducing the motion
rate to 25 frames per second. Furthermore, seven actions were selected from the dataset to
evaluate the model’s performance. No hyperparameters were adjusted on this dataset. We
used only the training and testing sets, with the splitting method consistent with common
practice in the literature [43].
Hyperparameter/Config Value
Optimizer AdamW
Base learning rate 5 × 10−3
Weight decay 10−2
Batch size 16
Warmup epochs 5
Epochs 800
104
Electronics 2023, 12, 3305
aimed to demonstrate its effectiveness and highlight its advantages in terms of accuracy
and performance.
Table 2. Prediction of 3D joint positions on Human3.6M for all actions. The best results are marked
in bold.
Time(ms) 80 160 320 400 560 1000 80 160 320 400 560 1000
Action Walking Eating
Res. sup. [18] 29.4 50.8 76.0 81.5 81.7 100.7 16.8 30.6 56.9 68.7 79.9 100.2
LTD [43] 12.3 23.0 39.8 46.1 54.1 59.8 8.4 16.9 33.2 40.7 53.4 77.8
ST-Trans [23] 8.9 15.5 32.1 38.5 - - 9.4 21.1 36.4 42.3 - -
MSR [34] 12.2 22.7 38.6 45.2 52.7 63.0 8.4 17.1 33.0 40.4 52.5 77.1
ST-DGCN [33] 10.2 19.8 34.5 40.3 48.1 56.4 7.0 15.1 30.6 38.1 51.1 76.0
LCDC [15] 11.1 22.4 38.8 45.2 52.7 59.8 7.0 15.5 31.7 39.2 51.9 76.2
Ours 10.3 19.5 34.9 41.8 51.1 54.9 6.9 15.0 31.6 36.3 50.2 71.7
Action Smoking Discussion
Res. sup. [18] 23.0 42.6 70.1 82.7 94.8 137.4 32.9 61.2 90.9 96.2 121.3 161.7
LTD [43] 7.9 16.2 31.9 38.9 50.7 72.6 12.5 27.4 58.5 71.7 91.6 121.5
ST-Trans [23] 8.8 15.2 25.1 24.5 - - 7.9 25.7 39.9 47.5 - -
MSR [34] 8.0 16.3 31.3 38.2 49.5 71.6 12.0 26.8 57.1 69.7 88.6 117.6
ST-DGCN [33] 6.6 14.1 28.2 34.7 46.5 69.5 10.0 23.8 53.6 66.7 87.1 118.2
LCDC [15] 6.6 14.8 29.8 36.7 48.1 71.2 10.0 24.4 54.5 67.4 87.0 116.3
Ours 6.4 14.1 28.4 33.1 45.3 67.3 9.8 23.2 49.5 58.4 83.9 106.5
Action Directions Greeting
Res. sup. [18] 35.4 57.3 76.3 87.7 110.1 152.5 34.5 63.4 124.6 142.5 156.1 166.5
LTD [43] 9.0 19.9 43.4 53.7 71.0 101.8 18.7 38.7 77.7 93.4 115.4 148.8
ST-Trans [23] 10.2 17.8 42.5 48.6 - - 13.5 26.1 54.0 73.2 - -
MSR [34] 8.6 19.7 43.3 53.8 71.2 100.6 16.5 37.0 77.3 93.4 116.3 147.2
ST-DGCN [33] 7.2 17.6 40.9 51.5 69.3 100.4 15.2 34.1 71.6 87.1 110.2 143.5
LCDC [15] 6.9 17.4 41.0 51.7 69.1 99.1 14.3 33.5 72.2 87.3 108.7 142.3
Ours 6.9 17.0 39.2 48.4 66.4 94.6 13.4 39.3 69.4 81.6 103.3 137.6
Action Phoning Posing
Res. sup. [18] 38.0 69.3 115.0 126.7 141.2 131.5 36.1 69.1 130.5 157.1 194.7 240.2
LTD [43] 10.2 21.0 42.5 52.3 69.2 103.1 13.7 29.9 66.6 84.1 114.5 173.0
ST-Trans [23] 15.3 20.4 31.4 38.8 - - 10.6 22.8 57.6 73.7 - -
MSR [34] 10.1 20.7 41.5 51.3 68.3 104.4 12.8 29.4 67.0 85.0 116.3 174.3
ST-DGCN [33] 8.3 18.3 38.7 48.4 65.9 102.7 10.7 25.7 60.0 76.6 106.1 164.8
LCDC [15] 8.5 19.2 40.3 49.9 66.7 102.2 10.1 25.4 60.6 77.3 106.5 163.3
Ours 8.6 18.4 39.9 46.5 63.8 99.1 9.9 22.6 57.6 73.5 103.7 158.4
Action Purchases Sitting
Res. sup. [18] 36.3 60.3 86.5 95.9 122.7 160.3 42.6 81.4 134.7 151.8 167.4 201.5
LTD [43] 15.6 32.8 65.7 79.3 102.0 143.5 10.6 21.9 46.3 57.9 78.3 119.7
ST-Trans [23] 17.3 32.5 60.0 68.3 - - 8.5 22.9 47.8 66.8 - -
MSR [34] 14.8 32.4 66.1 79.6 101.6 139.2 10.5 22.0 46.3 57.8 78.2 120.0
ST-DGCN [33] 12.5 28.7 60.1 73.3 95.3 133.3 8.8 19.2 42.4 53.8 74.4 116.1
LCDC [15] 12.7 29.7 62.3 75.8 97.5 137.8 8.8 19.3 42.9 54.3 74.9 117.8
Ours 12.3 28.5 61.7 67.9 91.1 126.1 8.5 19.1 40.1 49.8 69.6 115.2
105
Electronics 2023, 12, 3305
Table 2. Cont.
Among the comparison methods, RNN-based techniques exhibited the poorest perfor-
mance, while transformer-based approaches outperformed GCN-based methods. Above all,
our method emerged as the top performer. The most significant improvements in the MPJPE
metric were observed with transformer-based techniques, highlighting their aptitude for
modeling 3D human motion dynamics and capturing global dependencies.
Specifically, existing methods generally performed well when predicting periodic and
simpler movements, such as “walking” and “eating”. However, their performance dropped
significantly when tasked with predicting more unpredictable and irregular movements like
“directions”, “posing”, and “purchases”. This indicates that these methods have difficulty
managing the dynamic changes and local–global dependencies inherent in human motion.
On the other hand, the algorithm proposed in our study demonstrated a high pre-
diction accuracy, even with highly complex, non-periodic, and irregular movements. Our
experimental results revealed that our proposed method surpasses most baseline methods
in short-term motion prediction, with even greater improvements noted in long-term pre-
diction. Our method delivered a superior performance in the 560 ms and 1000 ms MPJPE
metrics, an accomplishment attributable to GGTr’s ability to fully capture spatial correlation
and local–global temporal features, thereby bolstering the model’s prediction accuracy.
While our predictions for movements such as “walking” , “smoking” , “greeting” and
“waiting together” fell short compared to those of MSR [34], this nonetheless underscores the
sophisticated nature of transformer-based approaches. Looking ahead, we are committed
to further refining our model to enhance its performance.
Overall, our proposed method outperformed all the baseline models on average,
proving its superior performance in motion prediction. The outstanding performance
across both short-term and long-term human motion prediction highlights our model’s
effective capacity to capture both local and global temporal dependencies.
CMU-MoCap. To further validate the generalization of the proposed method, we com-
pared its performance with five existing algorithms [15,18,22,34,43] on the CMU-MoCap
dataset. The mean per joint position error was calculated for short-term and long-term
predictions. The experimental results are shown in Table 3, which includes the actions “bas-
ketball”, “basketball signal”, “directing traffic”, “jumping”, “soccer”, “walking”, and “wash
window”, as well as the average prediction error across all actions.
106
Electronics 2023, 12, 3305
Table 3. Prediction of 3D joint positions on CMU-MoCap for all actions. The best results are marked
in bold.
Time(ms) 80 160 320 400 560 1000 80 160 320 400 560 1000
Action Basketball Basketball Signal
Res. sup. [18] 29.5 53.1 91.2 106.0 128.7 157.4 14.6 22.1 39.1 46.6 60.0 89.9
LTD [43] 11.7 21.1 40.7 50.6 68.0 95.7 3.4 6.2 13.5 17.9 27.3 51.9
LPJP [22] 11.6 21.7 44.4 57.3 - 90.9 2.6 4.9 12.7 18.7 - 75.8
MSR [34] 10.3 18.9 37.7 47.0 62.0 86.3 3.0 5.6 12.5 16.6 25.5 50.0
LCDC [15] 9.6 17.6 35.4 44.4 60.0 88.4 2.6 4.7 10.4 13.9 21.9 46.2
Ours 9.5 17.5 32.5 41.5 56.8 88.4 2.5 4.6 11.5 13.1 20.6 45.8
Action Directing Traffic Jumping
Res. sup. [18] 21.8 38.8 70.5 85.3 110.3 165.1 30.2 53.0 89.4 103.9 125.6 160.5
LTD [43] 6.8 13.4 29.6 39.1 59.6 112.8 17.1 32.1 59.8 72.5 94.3 127.2
LPJP [22] 6.2 12.7 29.1 39.6 - 149.1 12.9 27.6 73.5 92.2 - 176.6
MSR [34] 6.1 12.6 29.4 39.2 50.5 114.6 15.2 28.9 56.0 69.1 92.4 126.2
LCDC [15] 5.0 10.0 23.4 31.4 49.3 99.6 12.8 26.1 54.6 68.5 91.8 126.1
Ours 5.0 9.9 22.4 30.9 48.6 93.9 13.1 27.2 53.6 67.8 89.8 116.9
Action Soccer Walking
Res. sup. [18] 26.5 47.0 81.5 96.2 117.9 139.1 14.6 22.9 36.1 40.9 51.1 69.5
LTD [43] 13.6 24.3 44.4 54.3 73.1 111.6 6.7 11.1 18.1 21.0 25.2 32.4
LPJP [22] 9.2 18.4 39.2 49.5 - 93.9 6.7 10.7 21.7 27.5 - 37.4
MSR [34] 10.9 19.4 37.4 47.0 65.3 101.9 6.4 10.3 16.9 20.1 25.5 36.8
LCDC [15] 10.3 19.0 36.8 45.7 62.3 96.9 6.3 10.4 16.1 18.6 23.3 33.6
Ours 9.8 19.0 35.3 44.6 59.7 92.7 5.8 10.2 15.4 17.3 22.9 33.2
Action Wash Window Average
Res. sup. [18] 19.3 31.8 56.1 66.0 83.6 125.9 22.4 38.4 66.2 77.8 96.7 129.6
LTD [43] 5.9 11.3 24.1 31.0 43.4 66.9 9.3 17.1 32.9 40.9 55.9 85.5
LPJP [22] 5.4 11.3 29.2 39.6 - 79.1 7.8 15.3 35.7 46.3 - 100.4
MSR [34] 5.4 10.9 24.5 31.8 45.1 70.2 8.2 15.2 30.5 38.7 52.3 83.7
LCDC [15] 4.8 9.5 22.0 29.0 42.5 68.9 7.3 13.9 28.4 35.9 50.1 80.0
Ours 4.8 9.5 20.3 28.0 41.3 67.4 7.2 14.0 27.3 34.7 48.5 76.9
Through quantitative evaluation, we clearly observe that our method effectively han-
dles various types of actions and consistently achieves a superior performance across all of
them. These empirical findings reinforce the superiority of our approach in human motion
prediction, both for short-term and long-term predictions. The consistent and significant
performance improvement over state-of-the-art methods on the dataset demonstrates the
robustness of our method.
Notably, as can be seen from Tables 2 and 3, our model performed well on the Hu-
man3.6m and CMU-MoCap datasets for prediction tasks up to 320 ms, and even better for
tasks above 320 ms. This intriguing performance pivot at 320 ms may represent a transition
point where prognostic difficulty shifts from short to long-term. We believe this is due to
our model’s capability to capture local and global time dependencies across body joints
through a strategic combination of GRU and transformer layers, effectively handling this
transition. Furthermore, this phenomenon could be tied to the inherent complexity of
human movement, especially when the prediction time exceeds 320 ms. This complexity
poses a challenge for models that rely on short-term prognosis. However, our model, thanks
to its specific structure and training procedure, seems to cope with this challenge more
effectively. In future work, we plan to delve deeper into this “prognostic barrier” at 320 ms,
aiming to better understand its underlying causes and how we can further improve our
model’s performance at this critical transition point. This understanding will potentially
enable us to optimize the dynamics of 3D joint position prediction.
107
Electronics 2023, 12, 3305
ground truth pose sequences (in black), followed by the predicted poses (in blue). In other
words, each row presents the prediction results from one model.
*URXQG 7UXWK
3KRQLQJ
PV
5HVVXS
065
/&'&
2XUV
(a)
*URXQG 7UXWK
'LVFXVVLRQ
PV
5HVVXS
065
/&'&
2XUV
(b)
*URXQG 7UXWK
3XUFKDVHV
PV
5HVVXS
065
/&'&
2XUV
(c)
Figure 5. Qualitative analysis of the Human3.6M dataset: comparing phoning, discussion, and pur-
chases scenarios. (a) Phoning; (b) Discussion; (c) Purchases.
The visualization results indicate that our model is capable of adequately capturing
spatial dependencies and local–global temporal dependencies. It is noticeable that the
predictions generated by our method show higher similarity to the actual sequences and
better continuity between frames. For instance, in the case of subtler movements such
as “phoning”, our model successfully captures the long-distance temporal dependencies
concealed within the movement sequence, yielding superior long-term predictions. More-
over, in the “purchases” motion visualization, the movements between hands are more
coordinated. This illustrates our model’s proficiency in forecasting highly complicated
irregular movements and complex periodic motions.
108
Electronics 2023, 12, 3305
are presented in Table 4. The optimal performance was achieved by integrating these
three components.
Table 4. The influence of the GCNN , GRU, and transformer layers (Tr) on the Human3.6M dataset,
on average, is notable. These three components of our model significantly contribute to its overall
accuracy. The best results are marked in bold.
Human3.6M,MPJPE (mm)
GCNN GRU Tr 80 160 320 400 560 1000
√ √
√ √ 10.3 23.2 48.1 57.0 75.7 107.4
√ √ 10.2 22.7 47.5 56.2 75.3 107.0
√ √ √ 10.2 23.5 47.7 57.8 77.2 111.0
10.0 21.8 46.8 55.1 74.4 105.4
GCNN : We used the original graph convolution module in place of the proposed
module. As shown in the first and third rows of Table 4, when this module is replaced,
prediction performance decreases. This result clearly indicates that there is critical spatially
relevant information hidden in the adjacent pose. When this information is ignored, it is
difficult for the model to capture time evolution trends, resulting in degraded performance.
GRU and transformer layers(Tr): The GRU unit, designed to capture local temporal
dependence, shows a remarkable performance, enhancing the accuracy of short-term pre-
dictions. The transformer layer exhibits exceptional ability in handling global temporal
dependence, which improves the accuracy of long-term predictions. We removed the
transformer layer directly from the proposed method. It can be seen that the long-term pre-
diction performance of the model was significantly reduced, and the short-term prediction
performance was also slightly reduced.
5. Conclusions
In this paper, we have proposed a novel framework for human motion prediction
that leverages the power of position-wise enhanced graph convolutional networks, gated
recurrent unit networks, and transformer layers. By strategically combining these networks,
our model effectively captures spatial information across body joints and temporally aligns
these dependencies, both locally and globally. The proposed framework has shown signifi-
cant improvements in predicting long-term movements, surpassing existing state-of-the-art
methods by a substantial margin. Experiments on the Human3.6M and CMU-MoCap
datasets provide evidence supporting the effectiveness of our proposed model. The efficacy
of our approach accentuates the potential of integrating advanced neural network archi-
tectures for improved understanding and prediction of complex human motion dynamics.
Future work will explore the integration of more sophisticated attention mechanisms and
deep learning architectures to further enhance prediction accuracy and efficiency.
Author Contributions: Conceptualization, B.H.; methodology, B.H.; software, B.H.; validation, B.H.;
formal analysis, B.H.; investigation, B.H.; resources, B.H.; data curation, B.H.; writing—original draft
preparation, B.H.; writing—review and editing, B.H.; visualization, B.H.; supervision, X.L.; project
administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published
version of the manuscript.
Funding: This work was supported in part by the National Natural Science Foundation of China
under grants 62233003 and 62073072, the Key Projects of the Key R&D Program of Jiangsu Province
under grants BE2020006 and BE2020006-1, and Shenzhen Natural Science Foundation under grants
JCYJ20210324132202005 and JCYJ20220818101206014.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
109
Electronics 2023, 12, 3305
Data Availability Statement: The datasets generated and/or analyzed during the current study are
publicly available. The Human3.6M dataset can be accessed through the reference in [41]. The CMU-
MoCap dataset is publicly available and can be accessed online at [Link] accessed
on 5 July 2023. The use of these datasets is governed by their respective usage policies.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Koppula, H.S.; Saxena, A. Anticipating human activities using object affordances for reactive robotic response. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 38, 14–29.
2. Gui, L.Y.; Zhang, K.; Wang, Y.X.; Liang, X.; Moura, J.M.; Veloso, M. Teaching robots to predict human motion. In Proceedings
of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018;
pp. 562–567.
3. Li, H.; Li, X.; Zhang, Z.; Hu, C.; Dunkin, F.; Ge, S.S. ESUAV-NI: Endogenous Security Framework for UAV Perception System
Based on Neural Immunity. IEEE Trans. Ind. Inform. 2023. [CrossRef]
4. Choi, S.H.; Park, K.B.; Roh, D.H.; Lee, J.Y.; Mohammed, M.; Ghasemi, Y.; Jeong, H. An integrated mixed reality system for
safety-aware human-robot collaboration using deep learning and digital twin generation. Robot. Comput.-Integr. Manuf. 2022,
73, 102258. [CrossRef]
5. Dong, Y.; Li, X.; Dezert, J.; Zhou, R.; Zhu, C.; Wei, L.; Ge, S.S. Evidential reasoning with hesitant fuzzy belief structures for human
activity recognition. IEEE Trans. Fuzzy Syst. 2021, 29, 3607–3619.
6. Sheng, W.; Li, X. Multi-task learning for gait-based identity recognition and emotion recognition using attention enhanced
temporal graph convolutional network. Pattern Recognit. 2021, 114, 107868.
7. Kong, Y.; Wei, Z.; Huang, S. Automatic analysis of complex athlete techniques in broadcast taekwondo video. Multimed. Tools
Appl. 2018, 77, 13643–13660. [CrossRef]
8. Lehrmann, A.M.; Gehler, P.V.; Nowozin, S. Efficient nonlinear Markov models for human motion. In Proceedings of the
Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
9. Wang, J.M.; Fleet, D.J.; Hertzmann, A. Gaussian Process Dynamical Models for Human Motion. IEEE Trans. Pattern Anal. Mach.
Intell. 2007, 30, 283–298.
10. Taylor, G.W.; Hinton, G.E.; Roweis, S. Modeling human motion using binary latent variables. In Proceedings of the Advances in
Neural Information Processing Systems 19 (NIPS 2006), Cambridge, MA, USA, 4–7 December 2006; Volume 19.
11. Li, C.; Zhang, Z.; Lee, W.S.; Lee, G.H. Convolutional sequence to sequence model for human dynamics. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2275–2284.
12. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action
Recognition and Motion Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3316–3333. [CrossRef]
13. Li, M.; Chen, S.; Zhao, Y.; Zhang, Y.; Wang, Y.; Tian, Q. Dynamic multiscale graph neural networks for 3d skeleton based human
motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 214–223.
14. Zhong, C.; Hu, L.; Zhang, Z.; Ye, Y.; Xia, S. Spatio-Temporal Gating-Adjacency GCN For Human Motion Prediction. In
Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022.
15. Fu, J.; Yang, F.; Dang, Y.; Liu, X.; Yin, J. Learning Constrained Dynamic Correlations in Spatiotemporal Graphs for Motion
Prediction. IEEE Trans. Neural Netw. Learn. Syst. 2023. [CrossRef]
16. Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent network models for human dynamics. In Proceedings of the 2015 IEEE
International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4346–4354.
17. Jain, A.; Zamir, A.R.; Savarese, S.; Saxena, A. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5308–5317.
18. Martinez, J.; Black, M.J.; Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2891–2900.
110
Electronics 2023, 12, 3305
19. Liu, Z.; Wu, S.; Jin, S.; Liu, Q.; Lu, S.; Zimmermann, R.; Cheng, L. Towards natural and accurate future motion prediction of
humans and animals. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–20 June 2019.
20. Shu, X.; Zhang, L.; Qi, G.J.; Liu, W.; Tang, J. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion
prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3300–3315.
21. Liu, Z.; Wu, S.; Jin, S.; Ji, S.; Liu, Q.; Lu, S.; Cheng, L. Investigating pose representations and motion contexts modeling for 3D
motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 681–697. [CrossRef]
22. Cai, Y.; Huang, L.; Wang, Y.; Cham, T.J.; Cai, J.; Yuan, J.; Liu, J.; Yang, X.; Zhu, Y.; Shen, X.; et al. Learning progressive joint
propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28
August 2020; pp. 226–242.
23. Aksan, E.; Kaufmann, M.; Cao, P.; Hilliges, O. A spatio-temporal transformer for 3d human motion prediction. In Proceedings of
the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 565–574.
24. Yu, H.; Fan, X.; Hou, Y.; Pei, W.; Ge, H.; Yang, X.; Zhou, D.; Zhang, Q.; Zhang, M. Towards Realistic 3D Human Motion Prediction
with A Spatio-temporal Cross-transformer Approach. IEEE Trans. Circuits Syst. Video Technol. 2023. [CrossRef]
25. Chiu, H.K.; Adeli, E.; Wang, B.; Huang, D.A.; Niebles, J.C. Action-agnostic human pose forecasting. In Proceedings of the
2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019;
pp. 1423–1432.
26. Guo, X.; Choi, J. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings
of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 2580–2587.
27. Zhou, L.; Zhou, Y.; Corso, J.J.; Socher, R.; Xiong, C. End-to-end dense video captioning with masked transformer. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8739–8748.
28. Liu, P.J.; Saleh, M.; Pot, E.; Goodrich, B.; Sepassi, R.; Kaiser, L.; Shazeer, N. Generating wikipedia by summarizing long sequences.
arXiv 2018, arXiv:1801.10198.
29. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907.
30. Yuan, J.; Cao, M.; Cheng, H.; Yu, H.; Xie, J.; Wang, C. A unified structure learning framework for graph attention networks.
Neurocomputing 2022, 495, 194–204. [CrossRef]
31. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10–48550.
32. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of
the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32.
33. Ma, T.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality
Human Motion Prediction. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA,
USA, 18–24 June 2022.
34. Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion
Prediction. In Proceedings of the International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2021.
35. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In
Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December
2017 ; Volume 30.
36. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229.
37. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
38. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
39. Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; Yeung, D.Y. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal
Graphs. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, Monterey, CA, USA, 6–10
August 2018.
40. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078.
41. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human
sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [CrossRef]
42. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101.
43. Mao, W.; Liu, M.; Salzmann, M.; Li, H. Learning trajectory dependencies for human motion prediction. In Proceedings of the
IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4317–4326.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
111
electronics
Article
ArRASA: Channel Optimization for Deep Learning-Based
Arabic NLU Chatbot Framework
Meshrif Alruily
Faculty of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia; mfalruily@[Link]
Abstract: Since the introduction of deep learning-based chatbots for knowledge services, many
research and development efforts have been undertaken in a variety of fields. The global market for
chatbots has grown dramatically as a result of strong demand. Nevertheless, open-domain chatbots’
limited functional scalability poses a challenge to their implementation in industries. Much work
has been performed on creating chatbots for languages such as English, Chinese, etc. Still, there is a
need to develop chatbots for other languages such as Arabic, Persian, etc., as they are widely used on
the Internet today. In this paper, we introduce, ArRASA as a channel optimization strategy based on
a deep-learning platform to create a chatbot that understands Arabic. ArRASA is a closed-domain
chatbot that can be used in any Arabic industry. The proposed system consists of four major parts.
These parts include tokenization of text, featurization, intent categorization and entity extraction.
The performance of ArRASA is evaluated using traditional assessment metrics, i.e., accuracy and F1
score for the intent classification and entity extraction tasks in the Arabic language. The proposed
framework archives promising results by securing 96%, 94% and 94%, 95% accuracy and an F1 score
for intent classification and entity extraction, respectively.
Unsubscribe, Wrong Person, Email Bounce, Autoreply and others may be useful when
going through customer emails.
NLP is a subfield of AI that uses natural language to allow human–computer interac-
tion and communication. NLP has spawned a slew of new applications. A chatbot is one
of the most intriguing natural language artificial intelligence applications [4]. A chatbot
is software that uses natural language to conduct a human–computer interaction through
auditory or textual means. As a result, it functions as a virtual assistant that uses artificial
intelligence to mimic conversational abilities and human-like behavior [5]. It also contains
embedded information that aids in identifying and comprehending the phrase and the
generation of the right answer. Many research articles have been published in this area
due to the importance of sentiment analysis. However, this research has concentrated on
English and other Indo-European languages. In morphologically rich languages such as
Arabic, there has been very little research on sentiment analysis [6]. Despite this, many
academics have focused on sentiment analysis in Arabic due to the growing number of
Arabic internet users and the exponential development of Arabic online content in the
past decade. Pipelines can be used to streamline a workflow for machine learning. Pre-
processing, extraction of features, categorization and post-processing are all possible steps
in the pipeline. Many other necessary phases in this pipeline may be added according to the
complexity of applications. By optimization, we intend to modify the model for optimum
performance. Any learning model’s effectiveness depends on choosing the parameters that
produce the greatest outcomes. The concept of optimization can be compared to a search
algorithm that explores a range of parameters and picks out the best among them.
Because Arabic is such a complicated language, developing Arabic chatbots has
posed a significant challenge to the academic community. Only a few works have tried
to create Arabic chatbots so far. ArabChat [7] is one such project, as follows: a rule-based
chatbot capable of pattern matching and delivering appropriate responses to user inquiries.
BOTTA [8], another project, is a retrieval-based model that supports the Egyptian dialect.
Ollobot is a rule-based chatbot that provides health monitoring and assistance in the
medical sector [9]. However, because of its restricted functional scalability, an open-domain
chatbot cannot be successfully used in every business. Furthermore, since most chatbot
frameworks are written in English, building an efficient and multi-objective chatbot for
Arabic is necessary. This research proposes an ArRASA, a pipeline optimization approach
based on a deep learning-based open-source chatbot system that understands Arabic, to
solve this issue.
The proposed approach consists of the following steps: Tokenization, feature extrac-
tion, specific intent classification and suitable entity extraction are the four phases of this
closed-domain chatbot. A closed-domain chatbot, also known as a domain-specific chatbot,
focuses on a certain range of issues and provides limited replies based on the business
issue. For instance, a food delivery chatbot can only let users place, monitor, or cancel
an order. Such straight-shooting discussion is kind of bumping into an acquaintance as
follows: you expect people to be likely to inquire about your work and maybe comment
on the environment. You have prepared answers to every topic, and the idea is just to
satisfy the enquiries. While an open-domain chatbot is required to grasp any matter and
provide appropriate answers. The proposed model can be scaled by adding more intents
and entities. Open-domain chatbots are less effective in the industry, so the proposed study
focuses on developing closed-domain chatbots. Moreover, a handsome amount of work is
performed using traditional machine and deep learning approaches, while the proposed
study uses transformers-based techniques for the development of a more effective and
reliable chatbot.
This paper discusses ArRASA, a pipeline optimization approach based on a deep
learning-based open-source chatbot system that understands Arabic. To cope with this
topic, first, we discuss the Arabic language and its different perspectives and challenges,
as it has some special characteristics and rules to deal with a problem. We discuss the
related approaches in the literature review section. The proposed methodology discusses
113
Electronics 2022, 11, 3745
the complete operations of the ArRASA. The proposed model can be scaled by adding
more intents and entities. In terms of Arabic language understanding, an optimization
experiment is carried out at each step. The prime contributions of the proposed solution
can be summarized as follows:
• ArRASA is a channel optimization strategy proposed based on a deep-learning plat-
form to create a chatbot that understands Arabic;
• ArRASA is a novel approach for a closed-domain chatbot using RASA (an open-source
conversational AI platform) that can be used in any Arabic industry;
• Tokenization, feature extraction, specific intent classification and suitable entity extrac-
tion are the four phases of the proposed approach;
• The performance of ArRASA is evaluated using traditional assessment metrics, i.e.,
accuracy and F1 score for the intent classification and entity extraction tasks in the
Arabic language;
• The performance is also compared with the existing approaches regarding accuracy
and F1 score.
The remainder of the paper is organized as follows: Section 2 discusses the related
work and Arabic language, and its challenges are elaborated in Section 3. The proposed
solution is developed in Section 4, while the system structure is discussed in Section 5. The
performance evaluation is presented in Sections 6 and 7 concludes the paper.
2. Related Work
BOTTA [8] is the first Arabic dialect chatbot developed for Egyptian Arabic to work as
a conversational agent to mimic user-friendly chats. Various components of the BOTTA
chatbot are defined, and it presents various solutions. Researchers working on Arabic
chatbot technology can access the BOTTA database files for free and with public access.
Shawar et al. [10] show how machine-learning techniques were used to create an
Arabic chatbot that accepts user feedback in Arabic and responds with Qur’anic quotes.
A device that learned conversational patterns from a corpus of transcribed conversations
was used to create various chatbots that spoke English, French and Afrikaans. Because the
Qur’an is not a copy of a dialogue, the learning method has been altered to accommodate
the Qur’an’s format in terms of sooras and ayyas.
Bashir et al. [11] propose a method for using named entity recognition and text
categorization using deep learning methods in the Arabic area of home automation. To do
this, we provide an NLU module that can be further combined with an ASR, a conversation
manager and a natural language generator module to create a fully functional dialogue
system. The process of gathering and annotating the data, constructing the intent classifier
and entity extractor models, and ultimately the assessment of these techniques against
various benchmarks are all included in the study.
AlHumoud et al. [12] summarize published Arabic chatbot studies to recognize infor-
mation gaps and illustrate areas that need more investigation and research. This research
found a scarcity of Arabic chatbots and that all available works are retrieval-based. The
experiments are divided into the following two classes, depending on the method of chat-
bot communication interaction: text and voice conversational chatbots. The study was
presented and assessed according to the deployment method, the duration and breadth of
the presentations, and the model used for the chatbot dataset. According to the study, all
the assessed chatbots used a retrieval-based dataset model.
Nabiha [13] is a chatbot that uses the Saudi Arabic dialect to converse with King Saud
University Information Technology (IT) students. As a consequence, Nabiha will be the
first Saudi chatbot to communicate in the Saudi dialect. Nabiha is accessible on several
platforms, including Android, Twitter and the Web, to make it simpler to use. Students may
contact Nabiha by downloading an app, tweeting her, or visiting her website. According to
the students in the IT department who tested Nabiha, the results were acceptable, given
the general challenges of the Arabic language and the Saudi dialect.
114
Electronics 2022, 11, 3745
The study in [14] is the first Arabic end-to-end generative model for task-oriented DS
(AraConv). It makes use of various parameters for the multilingual transformer model mT5.
We also provide the Arabic-TOD discourse dataset, which was utilized to train and evaluate
the AraConv model. Compared to research employing identical monolingual conditions,
the findings obtained are fair. We propose joint training, in which the model is jointly
trained on Arabic conversation data with data from one or two high-resource languages
such as English and Chinese, to minimize issues related to a short training dataset and
enhance the AraConv model’s outcomes.
Many authors worked on the development of Arabic chatbots, but most of them
worked on the development of open-domain chatbots. Open-domain chatbots are less
effective in the industry, so the proposed study focuses on developing closed-domain
chatbots. Moreover, a handsome amount of work is performed using traditional machine
and deep learning approaches, while the proposed study uses transformer-based techniques
for the development of a more effective and reliable chatbot. The previous studies focused
on the development of chatbots, but few worked on optimizing the proposed techniques.
The proposed technique also worked on optimizing the proposed architecture for enhancing
the accuracy and efficiency of the proposed Arabic chatbot; Table 1 presents the comparison
of previous techniques.
Rasa is a platform for building AI-powered, quality chatbots in the industry. It is used
by developers all around the world to build chatbots and contextual assistants. ArRASA
is a channel optimization strategy based on a deep-learning platform to create a chatbot
that understands Arabic. ArRASA is a closed-domain chatbot that can be used in any
Arabic industry. As we proposed an optimized Arabic language chatbot using RASA, so
we named it ArRASA.
3. Arabic Language
With about 300 million people speaking Arabic, it is one of the world’s most commonly
spoken languages. It is also widely spoken as the primary language in the Central African
Republic of Chad, a non-Arab nation, and a minority language in Afghanistan and Israel
(where Arabic and Hebrew are both official languages). Iran and Nigeria are among those
countries [15]. Arabic, along with Chinese, English, French, Russian and Spanish, became
one of the six official languages of the United Nations in 1974. In India, Indonesia, Pakistan
and Tanzania, about one billion Muslims study Arabic as a foreign or second language
for liturgical and scholastic reasons. Several Muslim and Arab populations in the United
States use Arabic in their everyday contacts and for religious reasons.
115
Electronics 2022, 11, 3745
differ from MSA and each other. Dialects are not recognized as languages in the Arab
world and are not taught in schools. Dialectal Arabic, on the other hand, is widely used in
online chats. This is why we believe it is more fitting to concentrate on dialectal Arabic in
the sense of a chatbot.
Orthographic Ambiguity and Inconsistency: Arabic orthography uses optional dia-
critical marks to represent short vowels and consonantal doubling, which are seldom used in
the text. As a consequence, there is a lot of uncertainty. Furthermore, Arabic writers often mis-
spell various difficult letters, including the Alif-Hamza forms and Ta-Marbuta [16]. Regarding
Arabic dialects, orthography is compounded by the lack of standard orthographies [17].
Geomorphological Richness of Arabic: Gender, number, person, voice, aspect and
other characteristics are all inflected in Arabic words, as well as taking a variety of connected
clitics [8]. This is especially challenging in the context of a chatbot system. Due to the
gender-specific nature of verbs, adjectives and pronouns, the chatbot must answer in the
following two ways: for male and female users. The ArRASA solves the geomorphological
richness of Arabic issue by using the data about gender, number, person, etc., in the dataset.
Moreover, the proposed model has various advantages, i.e., is easy to integrate because it is
developed using the open-source RASA platform and supports single and multiple intents.
4. Proposed Solution
As we know, the number of services and products is growing rapidly around the
globe. Due to this, the number of queries to the producers is also increasing. To solve this
problem, companies hire individuals to serve as customer support for their products and
services. However, this procedure of responding to consumers’ questions is expensive
for the company and quite slow for the users. There must be an effective and accessible
approach to addressing this issue. Various researchers presented automated chatbots that
give responses to customer queries instead of humans in various languages. Researchers
presented different Arabic chatbots to work on the Arabic language. These chatbots have
their own limitations. We propose a framework for optimizing Arabic chatbots by using the
RASA framework, which is one of the current leading open-source platforms for chatbot
development. The reason behind using the RASA is that it has not been used for Arabic
chatbots in the past. ArRASA is a channel optimization method that uses a deep-learning
model to develop an Arabic chatbot. ArRASA is a closed-domain chatbot that may be
utilized in any Arabic industry.
116
Electronics 2022, 11, 3745
dŽŬĞŶŝnjĂƚŝŽŶ
ƌĂďŝĐ/ŶƚĞŶƚ
ůĂƐƐŝĨŝĐĂƚŝŽŶ
&ĞĂƚƵƌĞdžƚƌĂĐƚŝŽŶ
ŶƚŝƚLJdžƚƌĂĐƚŝŽŶ
WƌĞͲƉƌŽĐĞƐƐŝŶŐ
117
Electronics 2022, 11, 3745
are well suited to generalize across tasks since they were trained on large-scale natural
language text corpora.
Input phrases are interpreted as a series of tokens, either words or sub-words, based
on the featurization pipeline. We add a special classification token for the Arabic language
at the end of each phrase. Each input token is characterized using sparse and/or dense
characteristics. At the token level, one-hot encodings of character n-grams (n ≤ 5) and multi-
hot encodings of character n-grams (n ≤ 5) are scarce. Because character n-grams contain
much redundant information, we use dropout for these sparse features to prevent overfitting.
Figure 2 presents the proposed optimized pipeline architecture for the Arabic chatbot.
dƌĂŶƐĨŽƌŵĞƌƉƉůŝĞĚ;ƵĂů/ŶƚĞŶƚ
ƌĂďŝĐ>ĂŶŐƵĂŐĞdŽŬĞŶŝnjĞƌ ŽƵŶƚsĞĐƚŽƌŝnjĂƚŝŽŶ ŶƚŝƚLJdƌĂŶƐĨŽƌŵĞƌͿKƉƚŝŵŝnjĂƚŝŽŶ
WƌŽĐĞĚƵƌĞ
dŽŬĞŶϭ dŽŬĞŶϮ ͘͘͘ dŽŬĞŶŶ
dƌĂŶƐĨŽƌŵĞƌŵďĞĚĚŝŶŐŝĂůŽŐƵĞ
ƌĂďŝĐŶƚŝƚLJ^LJŶŽŶLJŵDĂƉƉĞƌ
WŽůŝĐLJ
A two-layer transformer with relative position attention is utilized for encoding context
throughout the whole phrase. The input dimension of the transformer design must be the
same as the transformer layers. The concatenated features are sent through another fully
connected layer with shared weights across all sequence stages to match the dimension of
the transformer layers, which is set to 256 for the proposed model.
5. System Architecture
Although the overall architecture of the proposed framework is identical to that
of the DIET baseline model (DIET-Base), numerous tests were conducted to enhance
intent classification and entity extraction performance. The ideal number of epochs was
determined through a performance experiment. The ideal number of epochs was chosen as
100. The structure of the proposed framework is illustrated in Figure 3.
118
Electronics 2022, 11, 3745
The tokenizer extracts tokens from the data, and these tokens are sent to the proposed
transformer layer through feed-forward layers of the LSTM model. The transformer layer
provides a sequence (a) of entity labels, which is utilized as input data for the conditional
random field (CRF) technique, which generates the sequence (Xentity) of entity labels. The
loss of an entity can be computed by the negative value of the log-likelihood of CRF, as
shown in Equation (1).
Entity = CRF(a, Xentity) (1)
This mechanism reduces the weight sparsity from 0.85 to 0.75, and the number of
transformer layers is raised from two to four. For parameter optimization, the number of
embedding dimensions of the model increases from 25 to 35, and the hidden layer size
increases from 256 to 512. The characteristics of the proposed architecture are listed in
Table 2. There are now four total transformer layers, up from the two before.
119
Electronics 2022, 11, 3745
a broad range of industrial sectors. The data collection includes seven categories for intent
categorization. Greeting (greet), closure (goodbye), emotions (happy, sad), food (food-
related menu), departmental contact details (dept. contacts), division of labor works (Pers
work) and calculator are examples of these categories (Calc). The dataset’s distribution is
shown in Figure 4.
/ŶƚĞŶƚĂƚĂ^Ğƚ
ϯϬϬ
ϮϱϬ
ϮϬϬ
ĂƚĂƐŝnjĞ
ϭϱϬ
ϭϬϬ
ϱϬ
Ϭ
&ŽŽĚ ĞƉĂƌƚŵĞŶƚ ^ĞŶƚŝŵĞŶƚ ŝǀŝƐŝŽŶŽĨ ĂůĐƵůĂƚŽƌ ůŽƐŝŶŐ 'ƌĞĞƚŝŶŐƐ
ĐŽŶƚĂĐƚ ǁŽƌŬƐ
A total of 2540 datasets containing the following seven entities were created for the
entity extraction experiment. These include department (dept), date (date), work (work),
place (place), company (company), name (name), number (num), time (time) and no entity.
The division of these is illustrated in Table 3.
Entity Count
Date 120
Time 32
Name 73
Company 12
Work 720
Department 223
Number 37
No_Entity 1824
120
Electronics 2022, 11, 3745
from the gathered data [23]. It is the technique for refining and turning raw data into the
form used for ML/DL models. It is the procedure of cleaning the data, determining the
variables to utilize and changing the data into a suitable format for analysis in the processing
phase. It is among the most crucial steps in the entire procedure. We preprocess the data to
convert it into a form so that it can be used for the deep learning models.
6.2.1. Tokenization
Tokenization is the most critical phase in data preprocessing, where all the words from
the text are gathered and the number of times each word appears is counted. Five tok-
enizers were used for this purpose. These tokenizers include Takseem, Tf-idf, WhiteSpace,
Arcab and ConveRT. With the help of these tokenizers, we determine how many times a
single word appears in the text. In a dataset, we count words and create tokens for distinct
words that occur. Each word is given a unique number when tokens are created. The token
includes one-of-a-kind feature values that are used to create feature vectors. A tokenization
library called tkseem has many methods for tokenizing and preprocessing Arabic text. This
library is widely used for the tokenization of Arabic text. Arcab is a data-driven, unsu-
pervised method for tokenizing subwords in Arabic phrases. There is no pre-tokenization
phase where terms are extracted depending on the whitespaces; instead, it considers the
training corpus as a sequence of raw Unicode characters. This allows the technique to be
used for any string of characters, making it language-neutral. Transformer-backed dual-
encoder networks serve as the foundation for ConveRT (conversational representations
from transformers), a compact model of neural response selection for dialogue that has
proven to perform at the cutting edge on various response selection tasks as well as in
transfer learning for intent classification tasks. The performance of various tokenizers is
listed in Table 4.
121
Electronics 2022, 11, 3745
English
Arabic Word
Meaning
ϲϓ in
Prepositions ϰϠϋ on
to
ϧΎ I
Pronouns
ϦΤϧ we
ΖΤΗ below
above
Adverbs
now
άϨϣ since
! what
Question
ϰΘϣ when
if
Articles ϢΛ then
" except
6.2.3. Featurization
Methods for feature selection are used in machine learning techniques. An attribute
of a system or process that has been built from the initial input variables is represented
by a feature. Due to the enormous magnitude of the data, it is challenging to train effec-
tive classifiers before deleting the undesirable characteristics. A real-world classification
challenge may be better understood by reducing the number of characteristics that are
redundant or unnecessary. Feature selection aids in data comprehension, lessens the impact
of dimensionality, reduces processing needs and enhances prediction performance. To
increase prediction accuracy, feature selection selects a subset of features. Featurization
is the conversion of words into meaningful numbers (or vectors) that the deep learning
algorithm can use for its training. For this purpose, we use the features of the count vector
featurizer, lexical syntactic featurizer and Tf-idf featurizer. For feature extraction, the
TF-IDF technique uses word statistical data. This solely evaluates the expressions for terms
that are the same throughout the texts, such as ASCLL, despite considering that synonyms
may replace them.
The performance of used featurizers is compared for analysis purposes. The count
vectorizer outperforms all three featurizers for both tasks, i.e., intent classification and
entity extraction. The performance of these featurizers is listed in Table 6.
The first phase of the proposed model was trained on 100 epochs to evaluate intent
classification and entity extraction performance. The accuracy of the trained transformer-
based model over 100 epochs is illustrated in Figure 5 below.
122
Electronics 2022, 11, 3745
As shown in the figure above, the proposed model achieves 97% accuracy for intent
classification and 95% for entity extraction tasks. The performance of the model was also
analyzed with the F1 score metric. The F1 score of the proposed model on both tasks.
As shown in Figure 6, the proposed model gains an F1 score of 95% for entity ex-
traction and 96% for intent classification. This research compared the suggested intent
classifier to existing intent classifiers in terms of intent classification and entity extraction.
A performance assessment experiment was conducted to compute the DIET-base classi-
fiers, keyword classifiers and fallback classifiers. According to the performance evaluation
results, the proposed model’s F1 score was 17.8%, 0.2% and 0.3% in intent classification
from a given number of sentences. For entity extraction, the value of the F1 score was 3.1,
2.3 and 2.9% higher than the traditional DIET-base, keyword and fallback classifier. The
results comparison of the proposed model for intent classification with other models is
illustrated in Figure 7.
WĞƌĨŽƌŵĂŶĐĞŽŵƉĂƌŝƐŝŽŶĨŽƌ/ŶƚĞŶƚ
ůĂƐƐŝĨŝĐĂƚŝŽŶ
123
Electronics 2022, 11, 3745
WĞƌĨŽƌŵĂŶĐĞŽŵƉĂƌŝƐŝŽŶĨŽƌ/ŶƚĞŶƚ
ůĂƐƐŝĨŝĐĂƚŝŽŶ
ϭϬϮ
ϭϬϬ
ϵϴ
ϵϲ
ϵϰ
ϵϮ
ϵϬ
ϴϴ
ϴϲ
ϴϰ
<ĞLJǁŽƌĚ &ĂůůďĂĐŬ /dͲĂƐĞ WƌŽƉŽƐĞĚDŽĚĞů
Confusion matrices are a widely used metric when attempting to solve classification
issues. Both binary and multiclass classification tasks can be performed with it. The
confusion matrix for the entity extraction task is illustrated in Table 7.
The confusion matrix for the indent classification tasks is also calculated in the evalua-
tion phase of the study. Table 8 presents the indent classification confusion matrix.
Food 18 1 0 0 0 0 3
Department 0 34 0 3 0 0 0
Sentiment 0 0 81 5 0 0 3
Div. of 1 0 0 127 3 0 1
Works
Calulator 0 0 0 0 7 0
Closing 0 0 1 0 0 9 0
Greetings 0 0 0 0 1 0 12
Furthermore, the suggested model’s entity extractor was evaluated with existing entity
extraction models. A performance evaluation occurs for a conditional random field (CRF)
and DIET-Base. The F1 scores of the suggested model were 1.4, 0.9 and 0.3% higher in intent
124
Electronics 2022, 11, 3745
classification and 4.2, 3.1 and 2.6% higher in the entity extraction, as per the performance
evaluation findings. The performance of the entity extractor of the proposed model is
illustrated in Figure 8. The main reason for the high accuracy and performance of the
proposed model is the use of Dual Intent and Entity Transformer (DIET) for the Arabic
language. The previous studies used different techniques for developing Arabic NLUs,
but Rasa provides that DIET is a multi-task transformer framework that simultaneously
performs intent categorization and entity recognition. It allows us to plug-and-play a
variety of pre-trained embeddings. Moreover, in terms of accuracy and stability, it is
comparable to large-scale, pre-trained language models.
WĞƌĨŽƌŵĂŶĐĞŽŵƉĂƌŝƐŝŽŶĨŽƌŶƚŝƚLJdžƚƌĂĐƚŝŽŶ
ϭϬϭ
ϭϬϬ
ϵϵ
ϵϴ
ϵϳ
ϵϲ
ϵϱ
ϵϰ
ϵϯ
ϵϮ
ϵϭ
Z& /dͲĂƐĞ WƌŽƉŽƐĞĚDŽĚĞů
We have performed a comparison of the proposed dataset with another dataset pro-
posed by Fuad [17], as shown in Figure 9. The proposed model achieved an efficient
accuracy level on the dataset provided by Fuad [17]. Figure 9 presents the comparison of
accuracy and F1 score on both datasets.
125
Electronics 2022, 11, 3745
ϵϴ
ϵϲ
ĐĐƵƌĂĐLJĂŶĚ&ϭͲ^ĐŽƌĞ
ϵϰ
ϵϮ
ϵϬ
ϴϴ
ϴϲ
ϴϰ
/ŶĚĞŶƚůĂƐƐŝĨŝĐĂƚŝŽŶ ŶƚŝƚLJdžƚƌĂĐƚŝŽŶ /ŶĚĞŶƚůĂƐƐŝĨŝĐĂƚŝŽŶ ŶƚŝƚLJdžƚƌĂĐƚŝŽŶ
ĐĐƵƌĂĐLJ &ϭƐĐŽƌĞ
&ĂƵĚ WƌŽƉŽƐĞĚ
7. Conclusions
This article presents ArRASA, a deep learning-based channel optimization framework
for an open-source Arabic chatbot platform to interact with users remotely. ArRASA is
a closed domain chatbot that can be used in nearly any Arabic industry. There are four
phases, i.e., tokenization, feature extraction, suitable intent classification, and proper entity
extraction, used in the proposed model and these are also tuned to interpret the Arabic
language. The Taqseem tokenizer, specific for the Arabic language and a few others, was
used for tokenization, while the count vector featurizer was utilized for featurization.
This study created the DIET-based Arabic model (ArRASA) for the intent classification
and proper entity extraction phases by tweaking and optimizing the DIET-Base model’s
parameters. The accuracy of the proposed system for intent classification is 96%, while it
achieved 94% accuracy for the entity extraction task. The proposed system gained a 95% F1
score for intent classification and 94% for entity extraction.
References
1. Rickli, J.M. The Economic, Security and Military Implications of Artificial Intelligence for the Arab Gulf Countries; Emirates Diplomatic
Academy: Abu Dhabi, United Arab Emirates, 2018; pp. 1–13.
2. Hahm, Y.; Kim, J.; An, S.; Lee, M.; Choi, K.S. Chatbot Who Wants to Learn the Knowledge: KB-Agent. In Proceedings of the 17th
International Semantic Web Conference (ISWC 2018), NLIWod4, Monterey, CA, USA, 8–12 October 2018. 4p.
3. Aleem, S.; Huda, N.u.; Amin, R.; Khalid, S.; Alshamrani, S.S.; Alshehri, A. Machine Learning Algorithms for Depression:
Diagnosis, Insights, and Research Directions. Electronics 2022, 11, 1111. [CrossRef]
4. Sarddar, D.; Dey, R.K.; Bose, R.; Roy, S. Topic modeling as a tool to gauge political sentiments from twitter feeds. Int. J. Nat.
Comput. Res. 2020, 9, 14–35. [CrossRef]
5. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers:
State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Online, 3 June 2020; pp. 38–45.
6. Iqbal, A.; Amin, R.; Iqbal, J.; Alroobaea, R.; Binmahfoudh, A.; Hussain, M. Sentiment Analysis of Consumer Reviews Using Deep
Learning. Sustainability 2022, 14, 10844. [CrossRef]
7. Hijjawi, M.; Bandar, Z.; Crockett, K.; Mclean, D. ArabChat: An arabic conversational agent. In Proceedings of the 2014 6th
International Conference on Computer Science and Information Technology (CSIT), Piscataway, NJ, USA, 26–27 March 2014;
IEEE: Piscataway, NJ, USA; pp. 227–237.
8. Ali, D.A.; Habash, N. Botta: An arabic dialect chatbot. In Proceedings of the COLING 2016, the 26th International Conference on
Computational Linguistics: System Demonstrations, Osaka, Japan, 11–16 December 2016; pp. 208–212.
126
Electronics 2022, 11, 3745
9. Fadhil, A. OlloBot-towards a text-based arabic health conversational agent: Evaluation and results. In Proceedings of the
International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September
2019; pp. 295–303.
10. Shawar, A.; Atwell, E.S. An Arabic chatbot giving answers from the Qur’an. In Proceedings of the TALN04: XI Conference sur le
Traitement Automatique des Langues Naturelles, Fez, Morocco, 19–22 April 2004; ATALA: Monza, Italy, 2004; Volume 2, pp.
197–202.
11. Bashir, A.M.; Hassan, A.; Rosman, B.; Duma, D.; Ahmed, M. Implementation of a neural natural language understanding
component for Arabic dialogue systems. Procedia Comput. Sci. 2018, 142, 222–229. [CrossRef]
12. AlHumoud, S.; al Wazrah, A.; Aldamegh, W. Arabic chatbots: A survey. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 535–541. [CrossRef]
13. Al-Ghadhban, D.; Al-Twairesh, N. Nabiha: An Arabic dialect chatbot. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 1–8. [CrossRef]
14. Fuad, A.; Al-Yahya, M. AraConv: Developing an Arabic Task-Oriented Dialogue System Using Multi-Lingual Transformer Model
mT5. Appl. Sci. 2022, 12, 1881. [CrossRef]
15. Wilie, B.; Vincentio, K.; Winata, G.I.; Cahyawijaya, S.; Li, X.; Lim, Z.Y.; Soleman, S.; Mahendra, R.; Fung, P.; Bahar, S. Indonlu:
Benchmark and resources for evaluating indonesian natural language understanding. arXiv 2020, arXiv:2009.05387.
16. Bunk, T.; Varshneya, D.; Vlasov, V.; Nichol, A. Diet: Lightweight language understanding for dialogue systems. arXiv 2020,
arXiv:2004.09936.
17. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural
language understanding. arXiv 2020, arXiv:1804.07461.
18. Habash, N.Y. Introduction to Arabic Natural Language Processing. In Synthesis Lectures on Human Language Technologies; Springer:
Cham, Switzerland, 2010; Volume 3, pp. 1–187.
19. Al-Ayyoub, M.; Khamaiseh, A.A.; Jararweh, Y.; Al-Kabi, M.N. A comprehensive survey of arabic sentiment analysis. Inf. Process.
Manag. 2019, 56, 320–342. [CrossRef]
20. Habash, N.; Eryani, F.; Khalifa, S.; Rambow, O.; Abdulrahim, D.; Erdmann, A.; Faraj, R.; Zaghouani, W.; Bouamor, H.;
Zalmout, N.; et al. Unified guidelines and resources for Arabic dialect orthography. In Proceedings of the Eleventh Inter-
national Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018.
21. Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. SuperGLUE: A multi-task
benchmark and analysis platform for natural language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 3261–3275.
22. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543.
23. Zelaya, C.V.G. Towards explaining the effects of data preprocessing on machine learning. In Proceedings of the 2019 IEEE
35th international conference on data engineering (ICDE), Macao, China, 8–12 April 2019; IEEE: Piscataway, NJ, USA, 2019;
pp. 2086–2090.
127
electronics
Article
Research on Log Anomaly Detection Based on Sentence-BERT
Caiping Hu 1, *, Xuekui Sun 2 , Hua Dai 2 , Hangchuan Zhang 1 and Haiqiang Liu 1
Abstract: Log anomaly detection is crucial for computer systems. By analyzing and processing the
logs generated by a system, abnormal events or potential problems in the system can be identified,
which is helpful for its stability and reliability. At present, due to the expansion of the scale and
complexity of software systems, the amount of log data grows enormously, and traditional detection
methods have been unable to detect system anomalies in time. Therefore, it is important to design
log anomaly detection methods with high accuracy and strong generalization. In this paper, we
propose the log anomaly detection method LogADSBERT, which is based on Sentence-BERT. This
method adopts the Sentence-BERT model to extract the semantic behavior characteristics of log events
and implements anomaly detection through the bidirectional recurrent neural network, Bi-LSTM.
Experiments on the open log data set show that the accuracy of LogADSBERT is better than that
of the existing log anomaly detection methods. Moreover, LogADSBERT is robust even under the
scenario of new log event injections.
1. Introduction
Logs usually contain information about the operational status of a system, including
Citation: Hu, C.; Sun, X.; Dai, H.;
operation records, fault information, security time, etc., which can provide a comprehensive
Zhang, H.; Liu, H. Research on Log
view of the system’s operational status [1]. Logs are time-series in nature; the information
Anomaly Detection Based on in the logs is recorded by time, which allows us to analyze it in order to gain insight
Sentence-BERT. Electronics 2023, 12, into the operation of the system. Logs can provide a historical view—they collect all
3580. [Link] information about the application, and there are a lot of helpful insights that can be gleaned
electronics12173580 from an application’s history record, including information about potential problems and
benchmarks for determining when a process becomes an exception. Logs can monitor
Academic Editor: Grzegorz Dudek
the behavior of a system, and in contrast to other data sources, they can go deeper into
Received: 23 July 2023 the system and track the actual behavior of the system as it runs. Log records contain
Revised: 17 August 2023 information and trends during system operation. Analyzing and mining the log data can
Accepted: 22 August 2023 help detect and diagnose system anomalies.
Published: 24 August 2023 With the expansion of software systems’ scale, complexity, and application scope, the
number of logs generated shows exponential growth, making it difficult for the traditional
log anomaly detection methods based on rules and statistics. In order to adapt to the
development of software systems, researchers have shifted their research focus to deep
Copyright: © 2023 by the authors.
learning-based solutions, and, currently, log anomaly detection based on deep learning has
Licensee MDPI, Basel, Switzerland.
become a hot spot in the field of anomaly detection [2]. Compared to the traditional methods
This article is an open access article
based on rules and statistics, an anomaly detection method based on deep learning requires
distributed under the terms and
conditions of the Creative Commons
no human intervention and can quickly and accurately identify abnormal behaviors in logs.
Attribution (CC BY) license (https://
Moreover, traditional log anomaly detection methods are constrained by the limitations of
[Link]/licenses/by/
algorithms and capacity, whereas a log anomaly detection method based on deep learning
4.0/). can process a large amount of data in parallel. It can efficiently solve the problems of
repeated sampling and information extraction. In addition, deep learning models can
extract useful information from dozens of data metrics, which can better capture the details
of the log data that reflect the anomalies of the system. This is because log data are usually
in plain text format, and natural language processing is a specialized field for processing
and analyzing text data. For example, chatbots [3] based on machine translation have
become popular in recent years. Most of the log anomaly detection methods based on
deep learning that have emerged so far are associated with natural language processing
(NLP). NLP is used to extract the semantic features in log files, such as vocabulary, phrases,
sentences, and grammatical structures. These features are useful for pattern recognition
and classification in log anomaly detection, and the models built in this way can better
process and analyze the log data, as well as predict abnormal behaviors and events.
In this paper, we propose the log anomaly detection method LogADSBERT. It uses the
Sentence-BERT model [4] to extract the semantic features of log events and realizes the final
anomaly detection using the recurrent neural network model Bi-LSTM [5]. LogADSBERT
consists of two stages: the model training and the anomaly detection. In the model training
stage, the log parser parses the original logs into log events and log triples. The log events
are used as the corpus to train the Sentence-BERT model, and the log triples are used to
construct a sliding window sequence of log event semantic vectors to train the Bi-LSTM
neural network classification model, Bi-LSTM-ADM. In the anomaly detection stage, Bi-
LSTM-ADM is used to detect anomalies in the log data. LogADSBERT can achieve anomaly
detection with high accuracy and robustness.
The contributions of this paper can be summarized as follows:
1. We construct a log event semantic feature extraction model, T-SBERT, based on the
Sentence-BERT model, which can convert log events into log event semantic fea-
ture representations. The Bidirectional Long Short-Term Memory Recurrent Neural
Network model (Bi-LSTM) with an attention mechanism is adopted to generate an
anomaly detection model.
2. We propose a log event semantic feature matching algorithm and an anomaly detec-
tion algorithm. The log event semantic matching dictionary is established, and the log
anomaly detection method LogADSBERT, based on Sentence-BERT, is constructed. It
is, to the best of our knowledge, the first to extract log event semantic features using
the Sentence-BERT model.
3. In the scenario of new log event injection, LogADSBERT can ensure high accuracy
and strong robustness of anomaly detection. Experiment results demonstrate the
effectiveness of the proposed method.
This paper is structured as follows: Section 2 discusses the related work; Section 3
presents the preliminary knowledge of this paper; Section 4 presents the definitions related
to the proposed method; Section 5 presents the framework of our anomaly detection
method; Section 6 describes the experiments used to evaluate the effectiveness of the
proposed method; and finally, the conclusion is provided in Section 7.
2. Related Work
The traditional log anomaly detection methods are based on rules and statistics [6–8]
and generally need to analyze normal and abnormal behavior patterns using mathematical
counting methods. They usually define a set of features, design response rules for each
feature, and combine these rules into a complete system. In the testing stage, the newly
generated logs are compared with the existing rules to determine the existence of anomalies.
For example, Prewett et al. [7] proposed the log file analysis tool Logsurfer, which achieves
anomaly detection by defining rules for the expected behavior of the system and then
matching them using regular expressions. At the same time, Logsurfer can also update
its rule set at runtime. Rouilard et al. [8] proposed the SEC simple temporal correlator to
create feature rule sets by analyzing log sequences, which reduces the false alarm rate but is
less automated and incurs higher labor costs. Due to the expansion and update of the scale
of log data, the traditional log anomaly detection methods based on rules and statistics are
129
Electronics 2023, 12, 3580
usually not effective in detecting complex and or unknow anomalies. Thus, researchers
in the field have shifted their research direction to the area of machine learning and deep
learning.
Traditional machine learning log anomaly detection includes supervised and unsuper-
vised machine learning methods. Supervised machine learning methods include Support
Vector Classifier (SVM) [9,10], Linear Regression (LR) [11,12], Decision Tree (DT) [13], K-
Neighborhood Algorithm (KNN) [14], etc. These are based on the log frequency statistics
vector to record the frequency of occurrence of each log event within the log sequence, and
they use the frequency statistics vector as input and dichotomous labels as the classification
result. Unsupervised machine learning methods include Principal Component Analysis
(PCA) [15] and clustering-based methods such as Isolated Forest (IF) [16], Invariant Min-
ing (IM) [17], and Log Clustering (LC) [18]. These use unlabeled data for training, and
unsupervised log anomaly detection can be achieved.
The deep learning-based log anomaly detection methods [19–21] usually have three
steps: First, a log parser is used to split the system log data into two parts, the log event and
the parameter. The log event describes the system or process behavior, and the parameter
element records state information such as the timestamp and the process identifier. Second,
the behavior sequence of the system or process is constructed using the timestamp and log
event of the log record. Third, anomaly detection is performed based on the behavioral
sequences. Researchers have been developing log anomaly detection methods based on
recurrent neural networks. For example, Du et al. [19] trained LSTM based on log keys and
parameters to obtain a log key anomaly detection model and a parameter value anomaly
detection model. They combined two models to achieve anomaly detection. However, the
log key is the index of the log event, which is not combined with the semantic features in
the real sense. Log key-based detection requires knowledge of the size of the collection
of log events before the detection, which may fail when the log events are updated or
added. Meng et al. [20] proposed a template2vec-based method, LogAnomaly, that used
the Bi-LSTM model with an attention mechanism to combine log event features and word
features within the event to obtain the log event semantic feature space vector. When the log
event is updated, the semantic feature vector of the log event is computed first, and then the
existing log event is replaced by selecting the closest log event with the Euclidean distance.
However, the performance drops sharply when more log events are added. Brown et al. [21]
also proposed an LSTM-based approach for routine detection that incorporates multiple
implementations of attention mechanisms into the LSTM model to extract log features
and achieve eventual anomaly detection. Although the experiments show a high accuracy
rate for this method on the LANL cyber security datasets, the experimental datasets are
relatively limited, and high accuracy cannot be achieved on several publicly available and
commonly used datasets. This method only focuses on discovering relationships hidden
in system logs and the effectiveness of multiple attention mechanisms in log anomaly
detection, which causes limitations in practical application scenarios. In addition, the
BERT model and its derivative models, which have recently become popular in the field
of natural language processing, have been used in the field of log anomaly detection. For
example, Chen et al. [22] produced semantic log vectors by utilizing a pre-trained language
BERT model and used the linear classification to detect anomalies. This method uses a
single BERT implementation, which may lose semantic information in sequence feature
extraction processing. Zhang et al. [23] adopted the SBERT model to extract the semantic
representation of log events, which considers the semantic and word order relationship of
each word in log events. They designed a GRU model for anomaly detection; however, as
the content of exception log is diverse, including sequence pattern, frequency, correlation,
etc., GRU can only capture one-way sequence information. Guo et al. [24] learned the
patterns of normal log sequences using two novel, self-supervised training tasks: the
masked log message prediction and volume of hypersphere minimization. Nevertheless,
this work does not identify and train the semantic information of abnormal logs.
130
Electronics 2023, 12, 3580
Currently, the log anomaly detection methods based on rules and statistics can no
longer meet the rapid development of software systems, and machine learning-based log
anomaly detection suffers from weak feature extraction ability, poor adaptability, large labor
cost, and low accuracy rate compared to deep learning. Therefore, current log anomaly
detection research focuses on the deep learning-based methods. However, the existing log
anomaly detection methods based on deep learning still do not fully utilize the semantic
information existing in the log data, as well as some other feature information such as
frequency statistics, location embedding, etc. As a result, the accuracy rate of the methods
does not reach the required standard, and the robustness of these methods to the addition
of new logs needs to be further improved.
3. Preliminary Knowledge
3.1. Log Parser
System log data as semi-structured data are difficult to input directly into model
training and detection, so processing semi-structured log data into structured log data is
the first step of data processing and is crucial for subsequent anomaly detection. A system
log data includes variable and constant parts. When generating a log, it is actually a process
of combining constants and variables. The variable is the log parameters, which change
dynamically depending on the type of log generated. The constants are usually fixed and
unchanged log events that are the system log in the parameter part of the use of wildcard
replacement to get the standard event. LogParser does exactly the opposite of the log
generation process; the log parser must generate logs reverse-parsed into log events and
parameters in order to better complete the anomaly detection—there are many open-source
log parsers to choose from. Currently, log parsers [25–29] can be divided into two main
groups: log parsers based on clustering and log parsers based on heuristic structures.
131
Electronics 2023, 12, 3580
where the hidden output of the current layer is obtained by splicing the processed results
of the forward inputs with the processed outputs of the reverse inputs. Bi-LSTM captures
backward and forwards temporal correlation and can maximize the use of historical and
future information through bi-directional propagation to achieve better performance.
4. Definitions of LogADSBERT
Assuming that the system log set is L = {l1 , l2 , . . ., ln }. After parsing the log set L using
LogParser, we obtain a set of the log events T = {t1 , t2 , . . ., tm } and a set of the log triples P =
{p1 , p2 , . . ., pn }.
Definition 1 (Log Event (LE)). A log event is a structured text information obtained by removing
the variable parameter from the system logs li using the log parser, which is denoted as ti ∈ T.
Definition 2 (Log Triple (LT)). A log triple is a structured log information obtained by parsing
the system logs through the log parser, which is denoted as pi = (id, t, ts), where id is the process ID,
t is the log event, and ts is the timestamp of the log generation.
Definition 3 (Log Event Semantic Vector (LE-SV)). Taking the log events of T as the input of
the T-SBERT model, the output is the log event semantic vector set V = {v1 , v2 , . . ., vm }.
Definition 4 (Log Event Semantic Dictionary (LE-SD)). The log event semantic dictionary is
denoted as D, and D is initialized as the mapping set ti → v j , that is D = { ti → v j ti ∈ T, v j ∈ V }.
When a new type of log appears, the log event semantic vector of the new log is obtained by the log
event semantic matching algorithm based on the T-SBERT model, and the new mapping ti → v j is
added to the log event semantic dictionary.
Definition 5 (Log Event Semantic Vector Sliding Window (LE-SV-SW)). Assuming that h
is the size of a sliding window, Ti = {e1 , e2 , . . ., eq } is the sequence of the log event, and Ti ⊆T
is the sequence of the log event, the semantic matching algorithm of the log event based on the
T-SBERT model converts the log event sequence Ti into the log event semantic vector sequence Si =
<ve1 , ve2 , . . . , veq >. Given ve j+1 ∈ Si , the corresponding sliding window is denoted as W(Si , ve j )
which is generated according to the following rules.
1. If (h ≤ j < q), then W(Si , ve j ) = <ve j−h+1 , ve j−h+2 , . . . , ve j >;
2. Else, W(Si , ve j ) = ∅.
Definition 6 (Log Sequence Anomaly Detection (LSAD)). Assuming that the log event se-
quence is Ti = {e1 , e2 , . . ., eq }, the log event semantic vector window set is Wsi = {W(Si , ve j )|ve j ∈ Si
∧ j ∈ [h, q)}, and the corresponding set of log event semantic vector ve j+1 is Ve j+1 = {ve j+1 |ve j+1 ∈
Si ∧ j ∈ [h,q)}, the result vector set predicted by the Bi-LSTM-ADM with inputting Wsi is Re j+1 =
{re j+1 |j ∈ [h,q)}. Given the threshold ξ, the log sequence anomaly detection is performed as follows.
1. For each ve j+1 ∈ Ve j+1 and ∀ re j+1 ∈ Re j+1 , if the similarity between ve j+1 and re j+1 is greater
than the threshold ξ, it can be determined that the log event sequence Ti is normal;
2. Otherwise, the log event sequence Ti is abnormal.
132
Electronics 2023, 12, 3580
5. Algorithms of LogADSBERT
The proposed LogADSBERT consists of two stages: the model training and the
anomaly detection. The specific implementation process of these two stages is described as
follows.
Model training stage: The log parser parses the logs into a set of log events and a set of
log triples. The set of log events is used as training data for Sentence-BERT and is trained
to generate the T-SBERT log event vector generation model based on the TSBERTTrain
algorithm (Algorithm 1). While the log triples are ordered according to the time stamp
ts and transformed into a sequence of log event semantic vectors using the log event
semantic matching algorithm based on T-SBERT model (Algorithm 2), they are converted
into sequences of log event semantic vectors, then the sliding window mechanism is utilized
and sliding window training data are constructed based on the log event semantic vector
sequences. The Bi-LSTM model is trained to generate the Bi-LSTM-ADM model using the
BILSTMADMTrain algorithm (Algorithm 3).
Anomaly detection stage: The logs to be detected are first transformed into a set of log
triples using the log parser, then the log event semantic matching algorithm is used to obtain
a log event semantic vector sequence. Finally, the log event semantic vector sequence is used
to complete the log anomaly detection by the LogADSBERTDetect algorithm (Algorithm 4).
The framework of the proposed log anomaly detection method LogADSBERT is shown
in Figure 1.
0RGHOWUDLQLQJ
dƌĂŝŶŝŶŐ>^dD
*HQHUDWLQJ
Ċ
݈
/RJHYHQWVHW
7UDLQLQJ
ͳ 6%(57 /RJHYHQW
ʹ 76%(57 VHPDQWLF
%L/670
GLFWLRQDU\
Ċ
$QRPDO\GHWHFWLRQ
,QFUHPHQWDO
3UHGLFWLYH
&RQVXOW
$QRPDO\
GHWHFWLRQ
8SGDWD
IHHGEDFN
XSGDWH
133
Electronics 2023, 12, 3580
Algorithm 1: TSBERTTrain(T)
Input: Log event set T
Output: Log event semantic vector generation model T-SBERT
(1) Initialize the text corpus TC = ∅;
(2) Initialize log event semantic dictionary D = ∅;
(3) Initialize the Sentence-BERT model instance;
(4) FOR ti ∈ T DO
(5) Split ti into word lists WL;
(6) FOR EACH word IN WL DO
(7) word = lowerCase(word);
(8) IF word is a stop-words or no semantic identifiers THEN
(9) Remove word from WL;
(10) END IF
(11) END FOR
(12) Add the corresponding WL of the processed sentence to the corpus;
(13) Add the corpus to the TC;
(14) END FOR
(15) Train Sentence-BERT model to get T-SBERT using text library TC;
(16) RETURN T-SBERT;
134
Electronics 2023, 12, 3580
BILSTMADMTrain(S, h): The log event prediction model training algorithm uses the
sliding window training pairs generated from the sequence of log event semantic vectors
(Definition 5) to train the Bi-LSTM model to obtain the log event prediction model Bi-
LSTM-ADM. The initial sliding window length is h. The log event sequence Ti = {e1 , e2 , . . .,
eq } will be converted into the log event semantic vector sequence Si = <ve1 , ve2 , . . . , veq >
by Algorithm 2. Sliding with the size of the sliding window h to construct the training
data pair (TDP), the sliding window is denoted as W(Si , ve j ). The training data pair TDP
constructed by ve j+1 is denoted as (wi , ve j+1 ), and the training data pair TDP is stored in
the list to form the training data pair list (TDPL). The Bi-LSTM model is trained with
TDPL to obtain the log event prediction model Bi-LSTM-ADM, which is then used for log
event prediction for further anomaly detection. The specific process of Bi-LSTM training to
generate Bi-LSTM-ADM is shown in Algorithm 3.
135
Electronics 2023, 12, 3580
6. Evaluation
In this section, we evaluate the proposed LogADSBERT by conducting experiments on
the real log datasets. We implement the LogADSBERT together with the existing log anomaly
detection methods based on deep learning, such as DeepLog [19] and LogAnomaly [20].
TP
Precision = (1)
TP + FP
4. Recall: the proportion of log sequences with real anomalies that are successfully
marked; the computation of recall is shown in Equation (2).
TP
Recall = (2)
TP + FN
5. F1-Score: the reconciliation average of the detection result accuracy and detection
result completeness, which is denoted as F1-Score; the calculation of F1-Score is shown
in Equation (3).
2 × Precision × Recall
F1-Score = (3)
Precision + Recall
136
Electronics 2023, 12, 3580
The experimental parameters were set according to the characteristics of the log data, the
structure of the model, and the final experimental results. We tried a variety of different
parameter combinations and found that the following parameters can achieve the best
detection results. Table 1 shows the specific hyperparameter Settings.
Hyperparameters Value
Learning rate 0.001
Batch size 2048
Epoch 300
l (Neural network layers) 2
α (Hide layer cell size) 64
h (Sliding window size) 10
6.2. Result
1. Precision, Recall, and F1-Score
Figure 2 shows the precision, recall, and F1-Score of LogADSBERT on the HDFS
dataset. It indicates that LogADSBERT is better than DeepLoog and LogAnomaly in all
performance metrics. In the F1-Score, LogADSBERT improves by 7.0% and 4.3% compared
to DeepLog and LogAnomaly, respectively. There are improvements in both precision and
recall for LogADSBERT. Specifically, LogADSBERT improves 8.8% and 5.1% more than
DeepLog in precision and recall, respectively. Moreover, LogADSBERT improves 5.5% and
3.0% more than LogAnomaly in precision and recall, respectively.
Figure 3 illustrates the precision, recall, and F1-score of the three methods on the Open-
Stack dataset. The performance of LogADSBERT compared to DeepLog and LogAnomaly
on the OpenStack dataset is more pronounced than on the HDFS dataset. There is al-
ready a more pronounced gap between LogADSBERT and the better-performing method,
LogAnomaly, in terms of precision and F1-Score, with a difference of 7.1% and 7.0%, respec-
tively. In addition, LogADSBERT achieves 100% in terms of recall performance, whereas
the other methods achieve more than 90%.
137
Electronics 2023, 12, 3580
Table 3 shows the number of FP and FN of the three methods on the HDFS dataset.
The FP and FN of DeepLog and LogAnomaly are both significantly higher than those of
LogADSBERT. Compared to the worst-performing method DeepLog, the FP and FN of
138
Electronics 2023, 12, 3580
LogADSBERT are reduced by 244 and 127, respectively, which means that LogADSBERT
makes an 80.5% and 80.0% improvement in the FP and FN, respectively.
Table 4 shows the number of FP and FN of the three methods on the OpenStack
dataset. The result is similar to that shown in Table 3. The number of FP and FN in
LogADSBERT is obviously less than that of DeepLog and LogAnomaly. It indicates that
LogADSBERT outperforms DeepLog and LogAnomaly in the FP and FN metrics on the
OpenStack dataset.
3. Effects of different parameters on LogADSBERT
The experiments on the effect of different parameters on the precision, recall, and
F1-Score of LogADSBERT needed to be carried out using a control variable. For simplicity,
the more commonly used HDFS dataset was adopted in the experiment. The results of the
experiment are shown in Figures 4–7.
139
Electronics 2023, 12, 3580
Figure 4 shows the effect of t on the three performance metrics of LogADSBERT. When
t = 40, the performance of LogADSBERT is optimal, and when t = 45, the performance
of the method decreases, but overall, the effect is not significant. Figure 5 shows the
effect of the sliding window size h on the three performance metrics of LogADSBERT,
where the accuracy of LogADSBERT is gradually improved as h increases. As shown in
Figures 6 and 7, the effects of the number of neural network layers l and the hidden layer
unit size α on the LogADSBERT’s precision, recall, and F1-Score all reach the highest rate
at l = 2 and α = 64. In summary, under the conditions of different hyperparameters of the
number of log events t, sliding window size h, the number of neural network layers l, and
the size of the hidden layer unit α, LogADSBERT can ensure the stability of the overall
performance and obtain a high accuracy, which means that LogADSBERT is robust. In this
way, it can cope with the various uncertainties and complex factors that need to be faced
in the actual network system application scenario to achieve accurate and stable anomaly
detection.
4. Performance comparison of new log event injection
In order to further validate the robustness and effectiveness of LogADSBERT, we
conducted experiments involving the addition of new log events on the HDFS dataset.
We once again used precision, recall, and F1-Score as the performance metrics, and the
comparison methods employed DeepLog and LogAnomaly. The set of log events in the
training stage covers the system log datasets and contains 13 log events, and the number of
newly added log events was 33. DeepLog does not provide a solution for newly added log
events, and here, it was set to mark the log sequence as abnormal when the newly added
log events were detected. The results of the experiments are shown in Table 5.
Table 5 shows that for LogADSBERT, two of the evaluation metrics, precision and F1-
Score, were significantly better than for the other two methods. In particular, the F1-Score
reached 93.2%, which is 23.8% higher than LogAnomaly. Since LogADSBERT is based on
the semantic features of log events for log anomaly detection, the new log events will be
matched by a T-SBERT-based log event semantic matching algorithm to obtain the most
similar log event semantic representations, so it can vastly reduce the impact of new log
events on the anomaly detection results. Additionally, in the experiments, DeepLog was
set to detect all log sequences of the new log events as abnormal log sequences, which
would certainly lead to a significantly better DeepLog detection rate compared with the
other methods, but this setting made the number of FP too high and, consequently, both
140
Electronics 2023, 12, 3580
the precision and F1-Score were much lower than for the other methods. The solution
strategy of LogAnomaly for new log events is to replace the log events by calculating the
Euclidean distance with the already determined log events; however, this method does not
represent the new log events well, and when the number of new log events is too large, the
overall performance decreases rapidly. In summary, LogADSBERT, a log anomaly detection
method based on Sentence-BERT, maintains strong robustness in the scenario of adding
new log events.
7. Conclusions
In this paper, to solve the existing problems of log anomaly detection methods based
on deep learning, we proposed a Sentence-BERT-based log anomaly detection method,
LogADSBERT. The proposed anomaly detection model trained by inputting the log event
corpus not only extracts the log event information containing semantic features, but also
obtains the most relevant log event semantic information based on the log event semantic
matching algorithm for the newly added log events. The proposed method shows improved
accuracy compared to the existing anomaly detection methods, and it also shows robustness
when new log events are added.
With the rapid development of software systems, log anomaly detection needs to be
updated and iterated to meet new requests. In the future, the following aspects should
be focused on: (1) optimizing the preprocessing of log data to improve the efficiency of
anomaly detection; and (2) realizing multimodal log anomaly detection, where log anomaly
detection integrates multiple types of log data to conduct joint analysis and processing to
improve the accuracy and robustness of anomaly detection.
Author Contributions: Conceptualization, C.H. and H.D.; methodology, C.H. and X.S.; software,
C.H. and X.S.; validation, H.D., H.Z. and C.H.; formal analysis, X.S. and H.D.; investigation, H.Z.;
resources, X.S. and C.H.; data curation, H.L.; writing—original draft preparation, C.H., H.Z. and X.S.;
writing—review and editing, C.H. and H.D.; visualization, X.S.; supervision, H.D. and C.H.; project
administration, C.H. and H.D.; funding acquisition, C.H. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by Jinling Institute of Technology High-level Talent Research
Start-up Project (JIT-RCYJ-202102), Key R&D Plan Project of Jiangsu Province (BE2022077), Jinling
Institute of Technology Science and Education Integration Project (2022KJRH18), and Jiangsu Province
College Student Innovation Training Program Project (202313573080Y, 202313573081Y).
Data Availability Statement: This research employed publicly available datasets for its experimental
studies.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Lam, H.; Russell, D.; Tang, D.; Munzner, T. Session viewer: Visual exploratory analysis of web session logs. In Proceedings of
the2007 IEEE Symposium on Visual Analytics Science and Technology, Sacramento, CA, USA, 30 October–1 November 2007;
IEEE: Piscataway, NJ, USA, 2007; pp. 147–154.
2. Yadav, R.B.; Kumar, P.S.; Dhavale, S.V. A survey on log anomaly detection using deep learning. In Proceedings of the 2020 8th
International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida,
India, 4–5 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1215–1220.
3. Anastasiou, D.; Ruge, A.; Ion, R.; Segărceanu, S.; Suciu, G.; Pedretti, O.; Gratz, P.; Afkari, H. A machine translation-powered
chatbot for public administration. In Proceedings of the 23rd Annual Conference of the European Association for Machine
Translation, Ghent, Belgium, 1–3 June 2022; pp. 327–328.
4. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992.
5. Jang, B.; Kim, M.; Harerimana, G.; Kang, S.U.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining
Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [CrossRef]
141
Electronics 2023, 12, 3580
6. Roy, S.; König, A.C.; Dvorkin, I.; Kumar, M. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In
Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Republic of Korea, 13–17 April 2015;
IEEE: Piscataway, NJ, USA, 2015; pp. 1167–1178.
7. Prewett, J.E. Analyzing cluster log files using logsurfer. In Proceedings of the 4th Annual Conference on Linux Clusters, St.
Petersburg, Russia, 2–4 June 2003; Citeseer: State College, PA, USA, 2003; pp. 1–12.
8. Rouillard, J.P. Real-time Log File Analysis Using the Simple Event Correlator (SEC). LISA 2004, 4, 133–150.
9. Liang, Y.; Zhang, Y.; Xiong, H.; Sahoo, R. Failure prediction in ibm bluegene/l event logs. In Proceedings of the Seventh IEEE
International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; IEEE: Piscataway, NJ, USA, 2007;
pp. 583–588.
10. Wang, Y.; Wong, J.; Miner, A. Anomaly intrusion detection using one class SVM. In Proceedings of the Fifth Annual IEEE SMC
Information Assurance Workshop, West Point, NY, USA, 10–11 June 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 358–364.
11. Breier, J.; Branišová, J. Anomaly detection from log files using data mining techniques. In Information Science and Applications;
Springer: Berlin/Heidelberg, Germany, 2015; pp. 449–457.
12. He, P.; Zhu, J.; He, S.; Li, J.; Lyu, M.R. Towards automated log parsing for large-scale log data analysis. IEEE Trans. Dependable
Secur. Comput. 2017, 15, 931–944. [CrossRef]
13. Chen, M.; Zheng, A.X.; Lloyd, J.; Jordan, M.I.; Brewer, E. Failure diagnosis using decision trees. In Proceedings of the International
Conference on Autonomic Computing, New York, NY, USA, 17–19 May 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 36–43.
14. Ying, S.; Wang, B.; Wang, L.; Li, Q.; Zhao, Y.; Shang, J.; Huang, H.; Cheng, G.; Yang, Z.; Geng, J. An improved KNN-based efficient
log anomaly detection method with automatically labeled samples. ACM Trans. Knowl. Discov. Data (TKDD) 2021, 15, 1–22.
[CrossRef]
15. Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Detecting large-scale system problems by mining console logs. In
Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, Big Sky, MT, USA, 11–14 October 2009; pp.
117–132.
16. Xu, D.; Wang, Y.; Meng, Y.; Zhang, Z. An improved data anomaly detection method based on isolation forest. In Proceedings of
the 2017 10th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 9–10 December
2017; IEEE: Piscataway, NJ, USA, 2017; Volume 2, pp. 287–291.
17. Lou, J.G.; Fu, Q.; Yang, S.; Xu, Y.; Li, J. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the
USENIX Annual Technical Conference, Virtual, 14–16 July 2010; pp. 1–14.
18. Vaarandi, R.; Pihelgas, M. Logcluster-a data clustering and pattern mining algorithm for event logs. In Proceedings of the 2015
11th International Conference on Network and Service Management (CNSM), Barcelona, Spain, 9–13 November 2015; IEEE:
Piscataway, NJ, USA, 2015; pp. 1–7.
19. Du, M.; Li, F.; Zheng, G.; Srikumar, V. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3
November 2017; pp. 1285–1298.
20. Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. LogAnomaly: Unsupervised
detection of sequential and quantitative anomalies in unstructured logs. IJCAI 2019, 19, 4739–4745.
21. Brown, A.; Tuor, A.; Hutchinson, B.; Nichols, N. Recurrent neural network attention mechanisms for interpretable system log
anomaly detection. In Proceedings of the First Workshop on Machine Learning for Computing Systems, Tempe, AZ, USA, 12
June 2018; pp. 1–8.
22. Chen, S.; Liao, H. Bert-log: Anomaly detection for system logs based on pre-trained language model. Appl. Artif. Intell. 2022, 36,
2145642. [CrossRef]
23. Zhang, M.; Chen, J.; Liu, J.; Wang, J.; Shi, R.; Sheng, H. LogST: Log semi-supervised anomaly detection based on sentence-BERT.
In Proceedings of the 2022 7th International Conference on Signal and Image Processing (ICSIP), Suzhou, China, 20–22 July 2022;
IEEE: Piscataway, NJ, USA, 2022; pp. 356–361.
24. Guo, H.; Yuan, S.; Wu, X. Logbert: Log anomaly detection via bert. In Proceedings of the 2021 International Joint Conference on
Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8.
25. Mizutani, M. Incremental mining of system log format. In Proceedings of the 2013 IEEE International Conference on Services
Computing, Santa Clara, CA, USA, 28 June–3 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 595–602.
26. Shima, K. Length matters: Clustering system log messages using length of words. arXiv 2016, arXiv:1611.03213.
27. Hamooni, H.; Debnath, B.; Xu, J.; Zhang, H.; Jiang, G.; Mueen, A. Logmine: Fast pattern recognition for log analytics. In
Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA,
24–28 October 2016; pp. 1573–1582.
28. He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the 2017 IEEE
International Conference on Web Services (ICWS), Honolulu, HI, USA, 25–30 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp.
33–40.
29. Makanju, A.; Zincir-Heywood, A.N.; Milios, E.E. A lightweight algorithm for message type extraction in system application logs.
IEEE Trans. Knowl. Data Eng. 2011, 24, 1921–1936. [CrossRef]
142
Electronics 2023, 12, 3580
30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9
December 2017; Curran Associates Inc.: New York, NY, USA; pp. 6000–6010.
31. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473.
32. Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the Similarity-Based Pattern Recognition:
Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, 12–14 October 2015; Springer International Publishing:
Berlin/Heidelberg, Germany, 2015; pp. 84–92.
33. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
34. Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST), Lake Tahoe, NV, USA, 3–7 May 2010; IEEE: Piscataway, NJ, USA,
2010; pp. 1–10.
35. Sefraoui, O.; Aissaoui, M.; Eleuldj, M. OpenStack: Toward an open-source solution for cloud computing. Int. J. Comput. Appl.
2012, 55, 38–42. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
143
electronics
Article
Picture Fuzzy Soft Matrices and Application of Their Distance
Measures to Supervised Learning: Picture Fuzzy Soft k-Nearest
Neighbor (PFS-kNN)
Samet Memiş
Department of Marine Engineering, Faculty of Maritime Studies, Bandırma Onyedi Eylül University,
Balıkesir 10200, Türkiye; samettmemis@[Link]
Abstract: This paper redefines picture fuzzy soft matrices (pfs-matrices) because of some of their
inconsistencies resulting from Cuong’s definition of picture fuzzy sets. Then, it introduces several
distance measures of pfs-matrices. Afterward, this paper proposes a new kNN-based classifier,
namely the Picture Fuzzy Soft k-Nearest Neighbor (PFS-kNN) classifier. The proposed classifier
utilizes the Minkowski’s metric of pfs-matrices to find the k-nearest neighbor. Thereafter, it performs
an experimental study utilizing four UCI medical datasets and compares to the suggested approach
using the state-of-the-art kNN-based classifiers. To evaluate the performance of the classification, it
conducts ten iterations of five-fold cross-validation on all the classifiers. The findings indicate that
PFS-kNN surpasses the state-of-the-art kNN-based algorithms in 72 out of 128 performance results
based on accuracy, precision, recall, and F1-score. More specifically, the proposed method achieves
higher accuracy and F1-score results compared to the other classifiers. Simulation results show that
pfs-matrices and PFS-kNN are capable of modeling uncertainty and real-world problems. Finally, the
applications of pfs-matrices to supervised learning are discussed for further research.
Keywords: soft sets; picture fuzzy sets; picture fuzzy soft matrices; distance measures; machine
learning; k-nearest neighbor (kNN)
[Link]/licenses/by/
4.0/). Figure 1. Some hybrid versions and extensions of fuzzy and soft sets [1,7,9–22].
Although f -sets and if -sets may overcome many difficulties and uncertainties [23], far
more are encountered in practice. Consider the voting process for a presidential election.
During this procedure, the electorate’s decisions can be divided into three categories:
yes, no, and abstention. To represent such a process, Cuong proposed the notion of
picture fuzzy sets (pf -sets) [16]. A pf -set has elements with the degrees of membership,
non-membership, and neutral membership denoted by μ( x ), ν( x ), and η ( x ), respectively.
The refusal to vote or non-participation in voting leads to the indeterminacy described
above. Furthermore, 1 − (μ( x ) + η ( x ) + ν( x )) reflects the degree of indeterminacy in pf -
sets because μ( x ) + η ( x ) + ν( x ) ≤ 1 in Cuong’s definition. Even though pf -sets model the
aforementioned difficulties, the definitions and operations put forward by Cuong have
conceptual errors. Memiş [21] revised the idea of pf -sets and associated operations to
maintain consistency, where μ( x ) + η ( x ) + ν( x ) ≤ 2.
Conversely, pf -sets are unable to model the problems comprising parameters and alter-
natives (objects) with a picture fuzzy membership (pf -membership) degree. In other words,
pfs-sets [16,18,24] can represent problems with alternatives (objects) using pf -membership
(Figure 3), with the expert voting on whether to accept, reject, or abstain from the alternatives.
Recently, various studies have been conducted on pf -sets and pfs-sets. The idea of
a rough picture set has been introduced, and several of its topological features, includ-
ing the lower and upper rough picture fuzzy approximation operators, have also been
investigated [25]. The creation of clustering algorithms that can explore latent knowledge
from a large number of datasets is an emerging research field in pf -sets. The distance and
similarity measure is one of the most crucial tools in clustering that establishes the level of
association between two objects. Therefore, generalized picture distance measure has been
defined, and it has been applied to picture fuzzy clustering [26]. In addition to distance
measure, picture fuzzy similarity has also been studied [27,28]. A technique for solving
decision-making issues utilizing the generalized pfs-sets and an adjustable weighted soft
discernibility matrix has been presented, and threshold functions have been defined [29].
145
Electronics 2023, 12, 4129
A weighted soft discernibility matrix in the generalized pfs-sets has been employed to
offer an illustrative example to demonstrate the superiority of the suggested approach
therein. Matrix representations of mathematical concepts, such as pfs-sets are crucial in the
context of computerizing [30,31]. Thus, Arikrishnan and Sriram [20] define picture fuzzy
soft matrices and investigate their algebraic structures. Because the related study is based
on Cuong’s [16] study, there are some theoretical inconsistencies. Moreover, Arikrishnan
and Sriram have only focused on the algebraic structures. The study of Sahu et al. [32]
aims to analyze students’ characteristics, such as career, memory, interest, knowledge, envi-
ronment, and attitude, in order to predict the most suitable career path. This will enable
students to explore and excel in their chosen field comfortably. A hybridized distance
measure has been proposed, using picture fuzzy numbers to evaluate students, subjects,
and students’ characteristics for career selection. However, related studies only rely on
fictitious problem data. A research study that integrates pfs-sets with Quality Function
Deployment (QFD) to propose a Multiple Criteria Group Decision-Making (MCGDM)
method has been discussed [33]. In this approach, the preferences of the decision-makers
are collected in linguistic terms and transformed into Picture Fuzzy Numbers (PFNs). The
study applies the proposed MCGDM method to rank social networking sites, specifically
evaluating Facebook, Whatsapp, Instagram, and Twitter, providing valuable insights into
their comparative performance. The study of Lu et al. [34] has introduced the concept of
generalized pfs-sets by combining an image fuzzy soft set with a fuzzy parameter set. They
discuss five main operations for generalized pfs-sets: subset, equality, union, intersection,
and complement.
Suppose the problem has picture fuzzy uncertainty and a large number of data. In that
case, pfs-sets cannot operate efficiently with a large number of data. Therefore, processing
data through the computer is compulsory, and the matrix versions of the pfs-sets are needed.
The concept of picture fuzzy soft matrices (pfs-matrices) was propounded in 2020 [20];
however, in the aforementioned study, only the algebraic structures of the concept have
been investigated. To this end, this paper redefines the concept of pfs-matrices, defines the
distance measures of the pfs-matrices, and applies them to supervised learning to manifest
their modeling ability. The major contributions of this paper are as follows:
• pfs-matrices are redefined, and some of their basic properties are investigated.
• Distance measures of pfs-matrices are introduced.
• Picture fuzzy soft k-nearest neighbor (PFS-kNN) based on distance measure of pfs-
matrices is proposed.
• An application of PFS-kNN to medical diagnosis is provided.
146
Electronics 2023, 12, 4129
In Section 2 of the paper, definitions of pf -sets and pfs-sets are provided. In Section 3,
the motivations of the redefining of pfs-matrices are detailed. In Section 4, the idea of
pfs-matrices is redefined, and their properties are further examined. In Section 5, distance
measures of pfs-matrices are introduced, and their basic properties are researched. In
Section 6, a PFS-kNN classifier is proposed. In Section 7, the proposed classifier is applied
to medical diagnosis and compared with the well-known kNN-based classifiers. Finally,
we discuss pfs-matrices and PFS-kNN and provide conclusive remarks for further research.
2. Preliminaries
In this section, we present the concepts of pf -sets and pfs-sets by considering the
notations used across this study. Across this paper, let E and U denote the parameter and
alternative sets, respectively.
Definition 1 ([16,21]). Let f be a function such that f : E → [0, 1] × [0, 1] × [0, 1]. Then,
the graphic
{( x, f ( x )) : x ∈ E} = {( x, μ( x ), η ( x ), ν( x )) : x ∈ E}
μ(x)
is called a picture fuzzy set (pf-set) over E. Here, a pf-set is denoted by η (x) x :x∈E
ν( x )
instead of {( x, μ( x ), η ( x ), ν( x )) : x ∈ E}.
Remark 1. In PF ( E), the notations graph( f ) and f are interchangeable since they have generated
each other uniquely. Thus, we prefer the notation f to graph( f ) for brevity, provided that it results
in no confusion.
{( x, α(( x, μ( x ), η ( x ), ν( x )))) : x ∈ E}
is called a picture fuzzy soft set (pfs-set) parameterized via E over U (or briefly over U).
Throughout this paper, the set of all the pfs-sets over U is symbolized by PFSE (U ).
Remark 2. In PFSE (U ), the notations graph(α) and α are interchangeable since they have gener-
ated each other uniquely. Thus, we prefer the notation α to graph(α) for brevity, provided that it
results in no confusion.
is a pfs-set over U.
147
Electronics 2023, 12, 4129
Definition 3 ([16]). Let κ : E → [0, 1] × [0, 1] × [0, 1]. Then, the graphic
μ(x)
{( x, κ ( x )) : x ∈ E} = η (x) x :x∈E
ν( x )
In this section, the set of all the pf -sets over E according to Cuong’s definition is
denoted by PFC ( E) and κ ∈ PFC ( E).
To hold the conditions “Empty pf -set over E is a subset of all the pf -set over E” and
“All pf -sets over E are the subset of universal pf -set over E”, the definition and operations
of pf -sets in [16] must be as follows [21]:
Therefore, Memiş [21] has provided the definition and operations of pf -sets in [16] to
overcome the aforementioned inconsistencies.
148
Electronics 2023, 12, 4129
In this section, the set of all the pfs-sets over U according to Cuong’s definition is
denoted by PFSC (U ) and FA ∈ PFSC (U ).
Cuong [16] defined pfs-sets based on his own definition and operations of pf -sets. As a
result, the inconsistencies mentioned earlier also apply to his concept of pfs-sets. Addition-
ally, Yang et al. [18] claimed to have introduced the concept of pf -sets, even though Cuong
had already defined it in [16]. Thus, the concept of pfs-sets has also similar inconsistencies
therein. Hence, pfs-sets were redefined to deal with inconsistencies mentioned above [22].
Furthermore, the concept of pfs-matrices has similar inconsistencies therein, since
Arikrishnan and Sriram [20] have introduced the pfs-matrices according to Cuong’s defini-
tion [16] and defined their union, intersection, and complement.
Definition 12 ([20]). Let FA ∈ PFSC (U ). Then, [ aij ] is called pfs-matrix of FA and defined by
⎡ ⎤
a11 a12 a13 ... a1n ...
⎢ ⎥
⎢ a21 a22 a23 ... a2n ... ⎥
⎢ ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. ⎥
[ aij ] := ⎢
⎢ . . . . . . ⎥⎥
⎢ ⎥
⎢ am1 am2 am3 ... amn ... ⎥
⎢ ⎥
⎣ ⎦
.. .. .. .. .. ..
. . . . . .
Definition 13. Let [ aij ], [bij ] ∈ PFS AS [U ]. For all i and j, if μija ≤ μijb , ηija ≤ ηijb , and νija ≥ νijb ,
then [ aij ] is called a submatrix of [bij ] and is denoted by [ aij ]⊆[
˜ bij ].
Definition 14. Let [ aij ], [bij ] ∈ PFS AS [U ]. For all i and j, if μija = μijb , ηija = ηijb , and νija = νijb ,
then [ aij ] and [bij ] are called equal pfs-matrices and denoted by [ aij ] = [bij ].
Definition 15 ([20]). Let [ aij ], [bij ], [cij ] ∈ PFS AS [U ]. For all i and j, if μijc = max{μija , μijb },
ηijc = min{ηija , ηijb }, and νijc = min{νija , νijb }, then [cij ] is called union of [ aij ] and [bij ] and denoted
by [ aij ]∪[
˜ bij ].
149
Electronics 2023, 12, 4129
Definition 16 ([20]). Let [ aij ], [bij ], [cij ] ∈ PFS AS [U ]. For all i and j, if μijc = min{μija , μijb },
ηijc = min{ηija , ηijb }, and νijc = max{νija , νijb }, then [cij ] is called intersection of [ aij ] and [bij ] and
denoted by [ aij ]∩[
˜ bij ].
Definition 17 ([20]). Let [ aij ], [bij ] ∈ PFS AS [U ]. For all i and j, if μijb = νija , ηijb = ηija , and
νijb = μija , then [bij ] is complement of [ aij ] and denoted by [ aij ]c̃ .
Definition 18. Let [ aij ] ∈ PFS AS [U ]. For all i and j, if μij = 0, ηij = 0, and νij = 1, then [ aij ] is
!0"
empty pfs-matrix and is denoted by 0 .
1
Definition 19. Let [ aij ] ∈ PFS AS [U ]. For all i and j, if μij = 1, ηij = 1, and νij = 0, then [ aij ] is
!1"
universal pfs-matrix and is denoted by 1 .
0
⎡ 0.4 0.2 ⎤
⎢ 0.3 0.4
⎥
Example 5. Let [ aij ] ∈ PFS AS [U ] such that [ aij ] = ⎢ 0.1
⎣ 0.7
0.3
0.1
⎥. Then,
⎦
0.1 0.5
0.1 0.2
⎡ 0.4 0.2 ⎤ ⎡ 0.4 0.2 ⎤
!0" 0 0
⎢ ⎥ ⎢ 0.3 0.4
⎥
[ aij ]∪˜ 0 =⎢
⎣
0.1
0.7
0.3
0.1
⎥ = ⎢ 0.1
⎦ ⎣ 0.7
0.3
0.1
⎥ = [ aij ]
⎦
1
0 0 0.1 0.5
0.1 0.2 0.1 0.2
and
⎡ 1 1 ⎤
!1" !1"
⎢ 0.3 0.4
⎥
[ aij ]∪˜ 1 =⎢
⎣
01 01 ⎥ =
⎦ 1 .
0 0
0.1 0.5
0 0
Consequently, since the aforesaid definitions and operations of pfs-matrices and how
they operate are inconsistent, this concept and its operations must be redefined.
150
Electronics 2023, 12, 4129
pfs-sets in addressing various issues with the uncertainties modeled by membership, non-
membership, and neutral degrees, namely picture fuzzy uncertainties. These results
suggest that researching the idea of pfs-sets is worthwhile. However, it is important to
note that these ideas have drawbacks, such as complexity and lengthy computation times.
Therefore, it is crucial to understand their matrix representations, i.e., pfs-matrices, and
ensure their theoretical consistency in the context of computerizing the aforementioned
problems. For instance, for utilizing pfs-sets in machine learning, pfs-matrices, which are
matrix representation of pfs-sets, and their consistent theoretical definition and operations
are needed.
Thus, in the present section, we make consistent the idea of pfs-matrices and present
some of its fundamental properties. Since some of the propositions in this section have
elementary proof, only the propositions with the complex proof are demonstrated.
Definition 20. Let α ∈ PFSE (U ) (See Definition 2). Then, [ aij ] is called pfs-matrix of α and
defined by
⎡ ⎤
a11 a12 a13 ... a1n ...
⎢ ⎥
⎢ a21 a22 a23 ... a2n ... ⎥
⎢ ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. ⎥
[ aij ] := ⎢
⎢ . . . . . . ⎥⎥
⎢ ⎥
⎢ am1 am2 am3 ... amn ... ⎥
⎢ ⎥
⎣ ⎦
.. .. .. .. .. ..
. . . . . .
such that for i ∈ {1, 2, · · · } and j ∈ {1, 2, · · · },
μ(x )
j
aij := α η ( x j ) x j ( ui )
ν( x j )
Definition 21. Let [ aij ] ∈ PFSE [U ]. For all i and j, if μij = λ, ηij = ε, and νij = ω, then [ aij ] is
! λ " !0" !1"
(λ, ε, ω )-pfs-matrix and denoted by ε . Moreover, 1 is empty pfs-matrix and 0 is
ω 1 0
universal pfs-matrix.
151
Electronics 2023, 12, 4129
Definition 22. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ], IE := { j : x j ∈ E}, and R ⊆ IE . For all i and j, if
⎧ ⎧ ⎧
⎨ μija , j∈R ⎨ ηija , j∈R ⎨ νija , j∈R
μijc = , ηijc = , and νijc =
⎩ μb , j ∈ IE \ R ⎩ ηb , j ∈ IE \ R ⎩ νb , j ∈ IE \ R
ij ij ij
!0"
then [cij ] is called Rb-restriction of [ aij ] and is denoted by [( a Rb )ij ]. Briefly, if [bij ] = 1 , then
1
[( a R )ij ] can be used instead of [( a Rb )ij ]. It is clear that
⎧ $ a%
⎪ μij
⎪
⎪ j∈R
⎪ ηa
⎨ νija ,
( a R )ij =
ij
⎪ 0
⎪
⎪
⎪
⎩ 1 , j ∈ IE \ R
1
Definition 23. Let [ aij ], [bij ] ∈ PFSE [U ]. For all i and j, if μija ≤ μijb , ηija ≥ ηijb , and νija ≥ νijb ,
then [ aij ] is called a submatrix of [bij ] and denoted by [ aij ]⊆[
˜ bij ].
Definition 24. Let [ aij ], [bij ] ∈ PFSE [U ]. For all i and j, if μija = μijb , ηija = ηijb , and νija = νijb ,
then [ aij ] and [bij ] are called equal pfs-matrices and denoted by [ aij ] = [bij ].
Remark 3. From Proposition 1, it is straightforward that the inclusion relation herein is a partial
ordering relation in PFSE [U ].
Definition 26. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ]. For all i and j, if μijc = max{μija , μijb }, ηijc =
min{ηija , ηijb }, and νijc = min{νija , νijb }, then [cij ] is called union of [ aij ] and [bij ] and denoted by
[ aij ]∪[
˜ bij ].
Definition 27. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ]. For all i and j, if μijc = min{μija , μijb }, ηijc =
max{ηija , ηijb }, and νijc = max{νija , νijb }, then [cij ] is called intersection of [ aij ] and [bij ] and denoted
by [ aij ]∩[
˜ bij ].
152
Electronics 2023, 12, 4129
Example 7. Assume that two pfs-matrices [ aij ] and [bij ] are as follows:
⎡ 0.2 0 1 ⎤ ⎡ 0.6 0.7 0.5 ⎤
0.3 1 0 0.2 0.2 0.4
⎢ ⎥ ⎢ ⎥
⎢ 0.5 1
0.6 0.9 ⎥
0 ⎢ 0.1 0
1 0.1 ⎥
0.1
⎢ 0.1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ 0
⎥
[ aij ] = ⎢ 0.5 0.4 0 ⎥ and [bij ] = ⎢ 1 0 0.8 ⎥
⎢ 0.2 0 0.1 ⎥ ⎢ ⎥
⎢ 0.8 0.5 ⎥ ⎢ 1 0.1
0 0.1
1 ⎥
⎣ 1 ⎦ ⎣ 0.4 ⎦
0 0 0.1 0.3 0.3 0
0 0 0.2 0.3 0.2 0
Then,
⎡ 0.6 0.7 1 ⎤ ⎡ 0.2 0 0.5 ⎤
0.2 0.2 0 0.3 1 0.4
⎢ ⎥ ⎢ ⎥
⎢ 0.1 0
1 0.9 ⎥
0 ⎢ 0.5 1
0.6 0.1 ⎥
0.1
⎢ 0.1 ⎥ ⎢ ⎥
˜ bij ] = ⎢ ⎥ ˜ bij ] = ⎢ ⎥
0
[ aij ]∪[ ⎢ 0.5 0 0 ⎥ and [ aij ]∩[ ⎢ 1 0.4 0.8 ⎥
⎢ 0.2 0 0.1 ⎥ ⎢ ⎥
⎢ 0.6 1 ⎥ ⎢ 1 0
0.1
0.1
0.5 ⎥
⎣ 1 ⎦ ⎣ 0.4 ⎦
0 0 0 0.3 0.3 0.1
0 0 0 0.3 0.2 0.2
153
Electronics 2023, 12, 4129
Definition 28. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ]. For all i and j, if μijc = min{μija , νijb }, ηijc =
max{ηija , 1 − ηijb }, and νijc = max{νija , μijb }, then [cij ] is called difference between [ aij ] and [bij ] and
denoted by [ a ]\[˜ b ].
ij ij
Remark 4. It must be emphasized that the difference operation herein is non-commutative and
non-associative.
Definition 29. Let [ aij ], [bij ] ∈ PFSE [U ]. For all i and j, if μijb = νija , ηijb = 1 − ηija , and νijb = μija ,
!1"
then [bij ] is complement of [ aij ] and denoted by [ aij ] or [ aij ]. It is clear that [ aij ] =
c̃ c̃ c̃ 0 ˜ a ].
\[ ij
0
˜ b ] = [ a ]∩[
iii. [ aij ]\[ ij ˜ bij ]
c̃
ij
Proposition 5. Let [ aij ], [bij ] ∈ PFSE [U ]. Then, the following De Morgan’s laws are valid.
& '
˜ bij ] c̃ = [ aij ]c̃ ∩[
i. [ aij ]∪[ ˜ bij ]c̃
& '
˜ bij ] c̃ = [ aij ]c̃ ∪[
ii. [ aij ]∩[ ˜ bij ]c̃
* +c̃ * +c̃
= aij ∩˜ bij
154
Electronics 2023, 12, 4129
Definition 30. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ]. For all i and j, if
and
νijc = min max{νija , μijb }, max{νijb , μija }
˜ bij ].
then [cij ] is called symmetric difference between [ aij ] and [bij ] and denoted [ aij ][
Remark 5. It must be emphasized that the symmetric difference operation herein is non-associative.
Secondly, Minkowski, Euclidean, and Hamming metrics over PFSE [U ] are propounded.
Thereafter, their three properties are investigated.
155
Electronics 2023, 12, 4129
p
Proposition 7. The function d M : PFSE [U ] × PFSE [U ] → R defined by
1
1 m n a p p p p p
∑ ∑
p
d M ([ aij ], [bij ]) := μij − μijb + ηija − ηijb + νija − νijb + πija − πijb
3 i =1 j =1
such that p ∈ N+ is Minkowski metric over PFSE [U ]. Its normalized version, namely normalized
Minkowski metric, is defined as follows:
1
m n p p p p p
1
∑ ∑ μija − μijb + ηija − ηijb + νija − νijb + πija − πijb
p
dˆM ([ aij ], [bij ]) :=
3mn i =1 j =1
such that p ∈ N+ .
Specifically, d1M and d2M are Hamming and Euclidean metrics and represented by
d H and d E , respectively. Moreover, dˆ1M and dˆ2M are normalized Hamming and Euclidean
metrics and are represented by dˆH and dˆE , respectively.
Proof. Let [ aij ], [bij ], [cij ] ∈ PFSE [U ] and p ∈ N+ . Satisfying of d M the conditions i and ii is
p
1
p p p p p
m n
≤ 1
3 ∑ ∑ μija − μijc + ηija − ηijc + νija − νijc + πija − πijc
i =1 j =1
1
p p p p p
m n
+ 1
3 ∑ ∑ μijc − μijb + ηijc − ηijb + νijc − νijb + πijc − πijb
i =1 j =1
p p
= d M ([ aij ], [cij ]) + d M ([cij ], [bij ])
Moreover, 0 ≤ |μija − μijb | ≤ 1, 0 ≤ |ηija − ηijb | ≤ 1, 0 ≤ |νija − νijb | ≤ 1, and 0 ≤ |πija − πijb | ≤ 1
because 0 ≤ μija , μijb , ηija , ηijb , νija , νijb , πija , πijb ≤ 1, for all i ∈ Im and j ∈ In . Hence,
156
Electronics 2023, 12, 4129
= μija + μijb + |ηija − ηijb | + νija + νijb + (1 − μija − νija ) + (1 − μijb − νijb )
= 2 + |ηija − ηijb |
≤3
Then,
1 1
m n p m n p
p
1
3mn ∑ ∑ 0 ≤ dˆM ([ aij ], [bij ]) ≤ 1
3mn ∑ ∑ 3
i =1 j =1 i =1 j =1
1
p p
0 ≤ dˆM ([ aij ], [bij ]) ≤ 1
3mn 3mn
p
0 ≤ dˆM ([ aij ], [bij ]) ≤ 1
!0" !1"
Proposition 8. Let 1 , 0 ∈ PFSE [U ] and p ∈ N+ . Then,
1 m×n 0 m×n
!0" !1" !0" !1"
p √ p
dM 1 , 0 = p mn and dˆM 1 , 0 =1
1 0 1 0
Proposition 10. Let [ aij ]m×n , [bij ]m×n , [cij ]m×n ∈ PFSE [U ] and p ∈ N+ . Then,
& ' & ' & ' & '
i. [ aij ]⊆[ ˜ cij ] ⇒ d p [ aij ], [bij ] ≤ d p [ aij ], [cij ] ∧ d p [bij ], [cij ] ≤ d p [ aij ], [cij ]
˜ bij ]⊆[
M M M M
& ' & ' & ' & '
ii. [ aij ]⊆[ ˜ cij ] ⇒ dˆp [ aij ], [bij ] ≤ dˆp [ aij ], [cij ] ∧ dˆp [bij ], [cij ] ≤ dˆp [ aij ], [cij ]
˜ bij ]⊆[
M M M M
157
Electronics 2023, 12, 4129
Definition 33. Consider the training matrix ( Dtrain )m1 ×n attained from D = [dij ]m×(n+1) ,
i ∈ Im1 , and j ∈ In . Then, the matrix defined by
⎧
⎪ dij−train − min {dkj }
⎪
⎪ k∈ Im
⎨ max{dkj }− min {dkj } , max{dkj } = min{dkj }
k∈ Im k∈ Im
d˜ij−train := k∈ Im k∈ Im
⎪
⎪
⎪
⎩ 1, max{dkj } = min{dkj }
k∈ Im k∈ Im
is called feature-fuzzification matrix of Dtrain , namely column normalized matrix of Dtrain , and it is
denoted by D̃train = [d˜ij−train ]m1 ×n .
Definition 34. Consider the testing matrix ( Dtest )m2 ×n attained from D = [dij ]m×(n+1) , i ∈ Im2 ,
and j ∈ In . Then, the matrix defined by
⎧
⎪ dij−test − min {dkj }
⎪
⎪ k ∈ Im
⎨ max{dkj }− min {dkj } , max{dkj } = min{dkj }
k∈ Im k∈ Im
d˜ij−test := k ∈ Im k ∈ Im
⎪
⎪
⎪
⎩ 1, max{dkj } = min{dkj }
k∈ Im k∈ Im
is called feature-fuzzification matrix of Dtest , namely column normalized matrix of Dtest , and it is
denoted by D̃test = [d˜ij−test ]m1 ×n .
Definition 35. Let D̃train = [d˜ij−train ]m1 ×n be a feature-fuzzification matrix of ( Dtrain )m1 ×n .
Then, the matrix ⎡$ ˜ λ %⎤
μijD̃−train
˜
D̃ λ
= [d ˜λ
]= ⎣ ηijD̃−train ⎦
˜λ
train train−ij ˜λ
νijD̃−train
m1 × n
˜λ ˜λ d˜ij−train ˜λ
μijD̃−train := 1 − (1 − d˜ij−train )λ , ηijD̃−train := , and νijD̃−train := (1 − d˜ij−train )λ(λ+1)
λ
such that i ∈ Im1 , j ∈ In , and λ ∈ [0, ∞).
Definition 36. Let D̃test = [d˜ij−test ]m2 ×n be a feature-fuzzification matrix of ( Dtest )m2 ×n . Then,
the matrix ⎡$ %⎤ ˜λ
μijD̃−test
˜ λ = [d˜λ ⎣ ⎦
test−ij ] =
˜λ
D̃ test ηijD̃−test
˜λ
νij−test
D̃
m2 × n
˜λ ˜λ d˜ij−test ˜λ
μijD̃−test := 1 − (1 − d˜ij−test )λ , ηijD̃−test := , and νijD̃−test := (1 − d˜ij−test )λ(λ+1)
λ
such that i ∈ Im2 , j ∈ In , and λ ∈ [0, ∞).
158
Electronics 2023, 12, 4129
Definition 37.⎡Let ( D̃train )m ×n be a feature-fuzzification matrix of ( Dtrain )m1 ×n and D̃˜ train
λ =
$ ˜ λ %⎤ 1
μijD̃−train
[d˜λtrain−ij ] = ⎣ ˜λ
ηijD̃−train ⎦ be the picture fuzzification of D̃train . Then, the pfs-matrix
˜λ
νijD̃−train
! " m1 ×n
˜λ
D̃ ˜λ
bij k−train ˜ λ and is defined by bD̃k−train :=
is the training pfs-matrix attained by kth row of D̃ train 1j
$ ˜λ % 1×n
μkj
D̃
−train
˜λ
ηkjD̃−train such that k ∈ Im1 and j ∈ In .
˜λ
νkjD̃−train
Secondly, a new classifier named PFS-kNN employing the Minkowski metric of pfs-
matrices is suggested (Algorithm 1). Pseudocode of the proposed PFS-kNN is presented
in Algorithm 1. In Line 1, it obtains feature fuzzification of testing and training matrices
required for feature picture fuzzification. In Line 2, the feature picture fuzzification of
testing and training matrices utilizing their feature fuzzification versions. The aim herein is
to make the data ready in a way that can be used in the distance calculation of pfs-matrices.
In Lines 3–4, the ith testing pfs-matrix is constructed by extracting ith sample from the
feature picture fuzzification of the testing matrix. Similarly, in Lines 5–6, the jth training
pfs-matrix is constructed by extracting jth sample from the feature picture fuzzification of
the training matrix. In Line 7, the distance between the ith test sample and the jth training
sample is calculated utilizing the Minkowski metric over the pfs-matrices in accordance
with Proposition 7, and Dm j1 is attained. In Line 9, k-nearest neighbor according to the
matrix of picture fuzzy soft distances, namely Dm j1 , is determined. In Line 10, the most
repetitive class label (predicted class label) of the determined k-nearest neighbor is obtained.
In Line 11, the predicted class label, particularly diagnosis label in medical diagnosis, is
assigned to the test sample. In Line 12–13, finally, the predicted label (class) matrix is
created for the test queries.
8: end for
9: Find k-nearest neighbor using [ Dm j1 ]
10: Find the most repetitive class label in the considered k-nearest neighbor
11: tk1 ← most repetitive class label (predicted class label)
12: end for
13: return Tm2 ×1
159
Electronics 2023, 12, 4129
Medical Dataset
Test Train
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
kNN Fuzzy kNN WkNN IFROWANN LCkNN GMkNN LMRkNN BM-Fuzzy kNN PFS-kNN
(Cover and (Keller et al., (Dubey and (Ramentol et (Gou et al., (Gou et al., (Gou et al., (Kumbure et al., (Proposed,
Hart, 1967) 1985) Pudi, 2013) al., 2015) 2019a) 2019b) 2019c) 2020) 2023)
Yes
Figure 4. Simulation methodology of the present study via kNN-based classifiers [36–43].
160
Electronics 2023, 12, 4129
TP + TN
Accuracy( T, T ) :=
TP + TN + FP + FN
TP
Precision( T, T ) :=
TP + FP
161
Electronics 2023, 12, 4129
TP
Recall( T, T ) :=
TP + FN
2TP
F1-Score( T, T ) :=
2TP + FP + FN
where true positive (TP), true negative (TN), false positive (FP), and false negative (FN)
are defined as follows:
TP := y j |1 ∈ Tj ∧ 1 ∈ Tj , 1 ≤ j ≤ l
TN := y j |0 ∈
/ Tj ∧ 0 ∈
/ Tj , 1 ≤ j ≤ l
FP := y j |0 ∈
/ Tj ∧ 1 ∈ Tj , 1 ≤ j ≤ l
FN := y j |1 ∈ Tj ∧ 0 ∈
/ Tj , 1 ≤ j ≤ l
1 l TPi + TNi
l i∑
Accuracy( T, T ) :=
=1
TPi + TN i + FPi + FNi
1 l TPi
l i∑
Precision( T, T ) :=
=1
TPi + FPi
1 l TPi
Recall( T, T ) := ∑
l i=1 TPi + FNi
1 l 2TPi
F1-Score( T, T ) := ∑
l i=1 2TPi + FPi + FNi
where ith true positive (TPi ), ith true negative (TNi ), ith false positive (FPi ), and ith false
negative (FNi ) for the class i are defined as follows:
TPi := x j |i ∈ Tj ∧ i ∈ Tj , 1 ≤ k ≤ l
TNi := x j |i ∈
/ Tj ∧ i ∈
/ Tj , 1 ≤ k ≤ l
FPi := x j |i ∈
/ Tj ∧ i ∈ Tj , 1 ≤ k ≤ l
FNi := / Tj , 1 ≤ k ≤ l
x j |i ∈ Tj ∧ i ∈
162
Electronics 2023, 12, 4129
selected for training and the other for testing (for more details about CV, see [47]), randomly
split the considered dataset into five parts. Table 3 presents the average Accuracy, Precision,
Recall, and F1-Score results of PFS-kNN, kNN, Fuzzy kNN, WkNN, IFROWANN, LCkNN,
GMkNN, LMRkNN, and BM-Fuzzy kNN for the datasets.
163
Electronics 2023, 12, 4129
Based on the results obtained from Accuracy, it is evident that PFS-kNN surpasses
all other kNN-based classifiers that were compared. This is similarly observed when it
comes to F1-Score results. However, it should be noted that the proposed approach has
lower Precision and Recall results when compared to the other classifiers. Nevertheless,
the results are still close to the highest score in general.
These simulation results manifest that pfs-matrices and PFS-kNN can model uncer-
tainty and real-world problems, such as medical diagnosis and machine learning. It is
important to note that applying these models can significantly impact the accuracy of such
issues, leading to more reliable and effective solutions. Therefore, using PFS-kNN and
pfs-matrices is recommended when dealing with similar problems.
In this study, we evaluated the Accuracy performance values of various algorithms
on four medical datasets. To obtain a comprehensive understanding of the algorithms’
performance, we ran each algorithm 50 times (10 times five-fold cross-validation) and
plotted the results as box plots in Figure 5.
95
100
95
90
90
85
Accuracy Ratio
Accuracy Ratio
85
80
75
80
70
65
75
60
55
kN
N
kN
N NN NN
kN
N NN NN kN
N
-kN
N
kN
N
kN
N NN N N
kN
N NN NN kN
N
-kN
N
Wk WA LC
k Rk Wk WA LC
k Rk
z zy O GM LM u zzy P FS z zy O GM LM u zzy P FS
Fu IFR -F Fu IFR -F
BM BM
Classifier Classifier
(a) (b)
100
75
95 70
90 65
85 60
Accuracy Ratio
Accuracy Ratio
80 55
75 50
70 45
65 40
60 35
30
55
kN
N
kN
N NN N N
kN
N NN NN kN
N
-kN
N
kN
N
kN
N NN NN
kN
N NN NN kN
N
-kN
N
Wk WA LC
k Rk Wk WA LC
k Rk
z zy O GM LM u zzy P FS z zy O GM LM u zzy P FS
Fu IFR -F Fu IFR -F
BM BM
Classifier Classifier
(c) (d)
Figure 5. Box plot of Accuracy results of 50 runs for the classifiers: (a) Breast Tissue, (b) Parkinson’s,
(c) Breast Cancer, (d) Indian Liver.
164
Electronics 2023, 12, 4129
From the visual results in Figure 5a–d, we can observe that PFS-kNN outperforms the
other algorithms, with the highest performance value and a performance value distribution
that is close to normal distribution. This indicates that PFS-kNN is a reliable algorithm for
these medical datasets.
Similarly, in Figure 5b, we see that PFS-kNN produces the highest performance re-
sults, with the 50 performance values almost following a normal distribution. Moreover,
the distance between quartiles is relatively low, suggesting that PFS-kNN is consistent
in performance.
Overall, the box plots in Figure 5 demonstrate that PFS-kNN is a superior algorithm
compared to the others evaluated in this study, and it is a promising option for medical
data analysis.
165
Electronics 2023, 12, 4129
9. Conclusions
This paper redefined the idea of pfs-matrices, and their fundamental properties were
examined extensively. Afterward, distance measures of pfs-matrices were introduced.
Then, PFS-kNN, via the aforementioned distance measures, was suggested and applied
to medical diagnosis. The results manifested that the concept of pfs-matrices and the
proposed PFS-kNN approach can model uncertainty and real-world problems such as
medical diagnosis.
The current study, which focuses on soft sets, has significantly contributed to the
literature in both theoretical and practical aspects. This study has introduced three crucial
additions that redefine the mathematics underlying pfs-matrices and proposed new dis-
tance measures between pfs-matrices and PFS-kNN. By doing so, this paper has expanded
the understanding of this field and enhanced its applicability in real-world problems. In
addition, this research has gained prominence in the literature due to its innovative con-
tributions, which have opened up new avenues for further exploration and research in
the field.
In future works, there is potential for further investigation into the algebraic and
topological structures of pfs-matrices and the exploration of new distance and similarity
measures. While pfs-matrices have proven effective in addressing specific problems, it is
essential to acknowledge their limitations when dealing with picture fuzzy parameters.
To overcome this issue, research can be conducted on several related concepts, such as
intuitionistic fuzzy parameterized intuitionistic fuzzy soft matrices (ifpifs-matrices) [49,50],
aggregation operators of pfs-matrices [51,52], picture fuzzy parameterized picture fuzzy
soft sets (pfppfs-sets) [53], and picture fuzzy parameterized picture fuzzy soft matrices
(pfppfs-matrices). Additionally, interval-valued intuitionistic fuzzy parameterized interval-
valued intuitionistic fuzzy soft sets (d-sets) [4] and interval-valued intuitionistic fuzzy
parameterized interval-valued intuitionistic fuzzy soft matrices (d-matrices) [5] are other
related concepts that may be worth exploring. We can better understand their potential
applications and limitations by studying and applying these concepts to different real-
world problems. For instance, different real-world problems, such as trend prediction of
component stock [54], remote sensing image fusion [55], and Landsat image fusion [56] can
be investigated, and the applications of pfs-matrices to them can be focused.
References
1. Zadeh, L.A. Fuzzy Sets. Inf. Control 1965, 8, 338–353. [CrossRef]
2. Memiş, S.; Enginoğlu, S.; Erkan, U. Numerical Data Classification via Distance-Based Similarity Measures of Fuzzy Parameterized
Fuzzy Soft Matrices. IEEE Access 2021, 9, 88583–88601. [CrossRef]
3. Memiş, S.; Enginoğlu, S.; Erkan, U. Fuzzy Parameterized Fuzzy Soft k-Nearest Neighbor Classifier. Neuroomputing 2022,
500, 351–378. [CrossRef]
4. Aydın, T.; Enginoğlu, S. Interval-Valued Intuitionistic Fuzzy Parameterized Interval-Valued Intuitionistic Fuzzy Soft Sets and
Their Application in Decision-Making. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 1541–1558. [CrossRef]
5. Aydın, T.; Enginoğlu, S. Interval-Valued Intuitionistic Fuzzy Parameterized Interval-Valued Intuitionistic Fuzzy Soft Matrices
and Their Application to Performance-Based Value Assignment to Noise-Removal Filters. Comput. Appl. Math. 2022, 41, 192.
[CrossRef]
6. Mushrif, M.M.; Senqupta, S.; Ray, A.K. Texture Classification Using a Novel, Soft-Set Theory Based Classification Algorithm. In
Proceedings of the 7th Asian Conference on Computer Vision, Hyderabad, India, 13–16 January 2006; pp. 246–254.
7. Çağman, N.; Enginoğlu, S. Soft Matrix Theory and Its Decision Making. Comput. Math. Appl. 2010, 59, 3308–3314. [CrossRef]
8. Zimmermann, H.J., Fuzzy Set Theory and Its Applications; Springer Science+Business Media: New York, NY, USA, 2011.
9. Çağman, N.; Enginoğlu, S. Fuzzy Soft Matrix Theory and Its Application in Decision Making. Iran. J. Fuzzy Syst. 2012, 9, 109–119.
10. Atanassov, K.T. Intuitionistic Fuzzy Sets. Fuzzy Sets Syst. 1986, 20, 87–96. [CrossRef]
11. Molodtsov, D. Soft Set Theory-First Results. Comput. Math. Appl. 1999, 37, 19–31. [CrossRef]
166
Electronics 2023, 12, 4129
12. Maji, P.K.; Biswas, R.; Roy, A.R. Fuzzy Soft Sets. J. Fuzzy Math. 2001, 9, 589–602.
13. Maji, P.K.; Biswas, R.; Roy, A.R. Intuitionistic Fuzzy Soft Sets. J. Fuzzy Math. 2001, 9, 677–692.
14. Chetia, B.; Das, P.K. Some Results of Intuitionistic Fuzzy Soft Matrix Theory. Adv. Appl. Sci. Res. 2012, 3, 412–423.
15. Yager, R.R. Pythagorean Fuzzy Subsets. In Proceedings of the 2013 Joint IFSA World Congress and NAFIPS Annual Meeting
(IFSA/NAFIPS) Conference, Edmonton, AB, Canada, 25–29 May 2021; pp. 57–61.
16. Cuong, B.C. Picture Fuzzy Sets. J. Comput. Sci. Cybern. 2014, 30, 409–420.
17. Peng, X.; Yang, Y.; Song, J.; Jiang, Y. Pythagorean Fuzzy Soft Set and Its Application. Comput. Eng. 2015, 41, 224–229.
18. Yang, Y.; Liang, C.; Ji, S.; Liu, T. Adjustable Soft Discernibility Matrix Based on Picture Fuzzy Soft Sets and Its Applications in
Decision Making. J. Intell. Fuzzy Syst. 2015, 29, 1711–1722. [CrossRef]
19. Guleria, A.; Bajaj, R.K. On Pythagorean Fuzzy Soft Matrices, Operations and Their Applications in Decision Making and Medical
Diagnosis. Soft Comput. 2018, 23, 7889–7900. [CrossRef]
20. Arikrishnan, A.; Sriram, S. Algebraic Operations on Picture Fuzzy Soft Matrices. Adv. Math. Sci. J. 2020, 9, 6349–6358. [CrossRef]
21. Memiş, S. A Study on Picture Fuzzy Sets. In Proceedings of the 7th IFS and Contemporary Mathematics Conference, Mersin,
Turkey, 25–29 May 2021; pp. 125–132.
22. Memiş, S. Another View on Picture Fuzzy Soft Sets and Their Product Operations with Soft Decision-Making. J. New Theory 2022,
2022, 1–13. [CrossRef]
23. Atanassov, K.T., On Intuitionistic Fuzzy Sets Theory; Springer: Berlin/Heidelberg, Germany, 2012.
24. Naeem, K.; Memiş, S. Picture Fuzzy Soft σ-Algebra and Picture Fuzzy Soft Measure and Their Applications to Multi-Criteria
Decision-Making. Granul. Comput. 2023, 8, 397–410. [CrossRef]
25. Thao, N.X.; Dinh, N.V. Rough Picture Fuzzy Set and Picture Fuzzy Topologies. J. Comput. Sci. Cybern. 2015, 31, 245–253.
[CrossRef]
26. Sezgin, A. A New Approach to Semigroup Theory I: Soft Union Semigroups, Ideals and Bi-Ideals. Algebra Lett. 2016, 2016, 3.
27. Jin, J.; Garg, H.; You, T. Generalized Picture Fuzzy Distance and Similarity Measures on the Complete Lattice and Their
Applications. Expert Syst. Appl. 2023, 220, 119710. [CrossRef]
28. Wang, T.; Wu, X.; Garg, H.; Liu, Q.; Chen, G. A Prospect Theory-Based MABAC Algorithm with Novel Similarity Measures and
Interactional Operations for Picture Fuzzy Sets and Its Applications. Eng. Appl. Artif. Intell. 2023, 126, 106787. [CrossRef]
29. Khan, M.J.; Kumam, P.; Liu, P.; Kumam, W.; Rehman, H. An Adjustable Weighted Soft Discernibility Matrix Based on Generalized
Picture Fuzzy Soft Set and Its Applications in Decision Making. J. Intell. Fuzzy Syst. 2020, 38, 2103–2118. [CrossRef]
30. Memiş, S.; Enginoğlu, S.; Erkan, U. A Classification Method in Machine Learning Based on Soft Decision-Making via Fuzzy
Parameterized Fuzzy Soft Matrices. Soft Comput. 2022, 26, 1165–1180. [CrossRef]
31. Memiş, S.; Enginoğlu, S.; Erkan, U. A New Classification Method Using Soft Decision-Making Based on An Aggregation Operator
of Fuzzy Parameterized Fuzzy Soft Matrices. Turk. J. Electr. Eng. Comput. Sci. 2022, 30, 871–890. [CrossRef]
32. Sahu, R.; Dash, S.R.; Das, S. Career Selection of Students Using Hybridized Distance Measure Based on Picture Fuzzy Set and
Rough Set Theory. Decis. Mak. Appl. Manag. Eng. 2021, 4, 104–126. [CrossRef]
33. Singh, A.; Kumar, S. Picture Fuzzy Set and Quality Function Deployment Approach Based Novel Framework for Multi-Criteria
Group Decision Making Method. Eng. Appl. Artif. Intell. 2021, 104, 104395. [CrossRef]
34. Lu, H.; Khalil, A.M.; Alharbi, W.; El-Gayar, M.A. A New Type of Generalized Picture Fuzzy Soft Set and Its Application in
Decision Making. J. Intell. Fuzzy Syst. 2021, 40, 12459–12475. [CrossRef]
35. Dua, D.; Graff, C. UCI Machine Learning Repository. 2019. Available online: [Link] (accessed on 30 May
2023).
36. Cover, T.M.; Hart, P.E. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [CrossRef]
37. Keller, J.M.; Gray, M.R.; Givens, J.A. A Fuzzy K-Nearest Neighbor Algorithm. IEEE Trans. Syst. Man Cybern. 1985, 15, 580–585.
[CrossRef]
38. Dubey, H.; Pudi, V. Class Based Weighted k-Nearest Neighbor over Imbalance Dataset. In Proceedings of the 17th Pacific-Asia
Conference on Advances in Knowledge Discovery and Data Mining, Gold Coast, Australia, 14–17 April 2013; pp. 305–316.
39. Ramentol, E.; Vluymans, S.; Verbiest, N.; Caballero, Y.; Bello, R.; Cornelis, C.; Herrera, F. IFROWANN: Imbalanced Fuzzy-Rough
Ordered Weighted Average Nearest Neighbor Classification. IEEE Trans. Fuzzy Syst. 2015, 23, 1622–1636. [CrossRef]
40. Gou, J.; Qiu, W.; Yi, Z.; Shen, X.; Zhan, Y.; Ou, W. Locality Constrained Representation-Based K-Nearest Neighbor Classification.
Knowl.-Based Syst. 2019, 167, 38–52. [CrossRef]
41. Gou, J.; Ma, H.; Ou, W.; Zheng, S.; Rao, Y.; Yang, H. A Generalized Mean Distance-Based k-Nearest Neighbor Classifier. Expert
Syst. Appl. 2019, 115, 356–372. [CrossRef]
42. Gou, J.; Qıu, W.; Yi, Z.; Xu, Y.; Mao, Q.; Zhan, Y. A Local Mean Representation-Based k-Nearest Neighbor Classifier. ACM Trans.
Intell. Syst. Technol. 2019, 10, 29:1–29:25. [CrossRef]
43. Kumbure, M.M.; Luukka, P.; Collan, M. A New Fuzzy k-Nearest Neighbor Classifier Based on the Bonferroni mean. Pattern
Recognit. Lett. 2020, 140, 172–178. [CrossRef]
44. Bennett, K.P. Decision Tree Construction Via Linear Programming; Technical Report; University of Wisconsin-Madison Department
of Computer Sciences: Madison, WI, USA, 1992.
45. Bennett, K.P.; Mangasarian, O.L. Robust Linear Programming Discrimination of Two Linearly Inseparable Sets. Optim. Methods
Softw. 2006, 1, 23–34. [CrossRef]
167
Electronics 2023, 12, 4129
46. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [CrossRef]
47. Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. Ser. B (Methodol.) 1974, 36, 111–147.
[CrossRef]
48. Erkan, U. A Precise and Stable Machine Learning Algorithm: Eigenvalue Classification (EigenClass). Neural Comput. Appl. 2021,
33, 5381–5392. [CrossRef]
49. Enginoğlu, S.; Arslan, B. Intuitionistic Fuzzy Parameterized Intuitionistic Fuzzy Soft Matrices and Their Application in Decision-
Making. Comput. Appl. Math. 2020, 39, 325. [CrossRef]
50. Memiş, S.; Arslan, B.; Aydın, T.; Enginoğlu, S.; Camcı, Ç. Distance and Similarity Measures of Intuitionistic Fuzzy Parameterized
Intuitionistic Fuzzy Soft Matrices and Their Applications to Data Classification in Supervised Learning. Axioms 2023, 12, 463.
[CrossRef]
51. Dhumras, H.; Bajaj, R.K. Modified EDAS Method for MCDM in Robotic Agrifarming with Picture Fuzzy Soft Dombi Aggregation
Operators. Soft Comput. 2023, 27, 5077–5098. [CrossRef]
52. Mahmood, T.; Ali, Z.; Naeem, M. Aggregation Operators and CRITIC-VIKOR Method for Confidence Complex q-Rung Orthopair
Normal Fuzzy Information and Their Applications. CAAI Trans. Intell. Technol. 2023, 8, 40–63. [CrossRef]
53. Memiş, S. Picture Fuzzy Parameterized Picture Fuzzy Soft Sets and Their Application in a Performance-Based Value Assignment
Problem to Salt-and-Pepper Noise Removal Filters. Int. J. Fuzzy Syst. 2023, 2023, 1–15.
54. Li, P.; Gu, H.; Yin, L.; Li, B. Research on Trend Prediction of Component Stock in Fuzzy Time Series Based on Deep Forest. CAAI
Trans. Intell. Technol. 2022, 7, 617–626. [CrossRef]
55. Singh, D.; Kaur, M.; Singh, H. Remote Sensing Image Fusion Using Fuzzy Logic and Gyrator Transform. Remote. Sens. Lett. 2018,
9, 5077–5098. [CrossRef]
56. Singh, D.; Garg, D.; Pannu, H.S. Efficient Landsat Image Fusion Using Fuzzy and Stationary Discrete Wavelet Transform. Imaging
Sci. J. 2017, 65, 108–114. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
168
electronics
Article
Meteorological Variables Forecasting System Using Machine
Learning and Open-Source Software
Jenny Aracely Segovia *, Jonathan Fernando Toaquiza *, Jacqueline Rosario Llanos *
and David Raimundo Rivas
Department of Electrical and Electronic Engineering, Universidad de las Fuerzas Armadas (ESPE),
Sangolquí 171103, Ecuador
* Correspondence: jasegovia4@[Link] (J.A.S.); tcjonathan@[Link] (J.F.T.); jdllanos1@[Link] (J.R.L.)
Abstract: The techniques for forecasting meteorological variables are highly studied since prior
knowledge of them allows for the efficient management of renewable energies, and also for other
applications of science such as agriculture, health, engineering, energy, etc. In this research, the
design, implementation, and comparison of forecasting models for meteorological variables have
been performed using different Machine Learning techniques as part of Python open-source software.
The techniques implemented include multiple linear regression, polynomial regression, random
forest, decision tree, XGBoost, and multilayer perceptron neural network (MLP). To identify the
best technique, the mean square error (RMSE), mean absolute percentage error (MAPE), mean
absolute error (MAE), and coefficient of determination (R2 ) are used as evaluation metrics. The most
efficient techniques depend on the variable to be forecasting, however, it is noted that for most of
them, random forest and XGBoost techniques present better performance. For temperature, the best
performing technique was Random Forest with an R2 of 0.8631, MAE of 0.4728 ◦ C, MAPE of 2.73%,
and RMSE of 0.6621 ◦ C; for relative humidity, was Random Forest with an R2 of 0.8583, MAE of
2.1380RH, MAPE of 2.50% and RMSE of 2.9003 RH; for solar radiation, was Random Forest with
an R2 of 0.7333, MAE of 65.8105 W/m2 , and RMSE of 105.9141 W/m2 ; and for wind speed, was
Random Forest with an R2 of 0.3660, MAE of 0.1097 m/s, and RMSE of 0.2136 m/s.
makes it impossible to optimally manage renewable energies and obtain a greater benefit
from them.
There are multiple scientific studies of modeling and prediction in order to forecast
future conditions of phenomena in various fields; among the most prominent are ARIMA,
Chaos Theory, and Neural Networks [6]. Forecasting models have evolved in recent
decades, from smart systems with formal rules and logical theories, to the emergence of
artificial intelligence techniques that allow us to propose alternatives in the treatment of
information [7].
Currently, forecasting models have a high impact and are used for several applica-
tions, such as management of energy units for renewable resources microgrids [8,9], load
estimation methods for isolated communities that do not receive energy or only receive it
for a limited time each day [10,11], the operation of energy systems [12,13], in agriculture
to predict the water consumption of plants and plan the irrigation sheet [14], in agriculture
4.0 for the prediction of variables that affect the quality of crops, for micronutrient analysis
and prediction of soil chemical parameters [15], optimization of agricultural procedures
and increasing productivity in the field, forecasting of SPI and Meteorological Drought
Based on the Artificial Neural Network and M5P Model Tree [16], and in controllers based
on forecasting models and predictive controllers. They are also used in the health field to
predict the solar radiation index and to obtain a correct assessment in people with skin
cancer [17], therefore, all the applications mentioned above need forecasting models that
have the lowest error rate for their effective operation.
Having a forecasting model system is costly because computer packages are used
in which licensing costs can be significant. On the other hand, free software is an option
to reduce costs. This research proposes a system based on free software (Python), which
is currently used at industrial level for its reliability, for example in applications such as
the following: Advanced Time Series: Application of Neural Networks for Time Series
Forecasting [18], Machine Learning in Python: main developments and technological
trends in data science, Machine Learning and artificial intelligence [19], Development of
an smart tool focused on artificial vision and neural networks for weed recognition in rice
plantations, using Python programming language [20], etc.
In this research, different prediction techniques were evaluated and compared—among
them, multiple linear regression, polynomial regression, random forest, decision tree, XG-
Boost, and multilayer perceptron neural network—in order to identify the best performing
strategy, using evaluation metrics such as the root mean square error (RMSE) and the
coefficient of determination (R2 ). The variables to be predicted are temperature, relative
humidity, solar radiation, and wind speed, from data taken from the weather station located
in Ecuador, Tungurahua province, Baños. The predicted variables will be the inputs for a
smart irrigation system and used for an energy management system of a microgrid based
on predictive control, therefore, models with high approximation to online measurements
are required.
The contributions of this work are as follows: (i) To design, validate, and compare
different machine learning techniques, and with them select the best technique that adapts
to climate variables for agriculture and energy applications, (ii) To develop a forecast system
for climate variables of low cost based in free software (Python), (iii) To generate forecasting
models that can be replicated for other types of variables applied to smart control systems
based on forecasting models.
170
Electronics 2023, 12, 1007
Figure 1. Flowchart of the methodology used to obtain forecasting models for meteorological
variables.
171
Electronics 2023, 12, 1007
important to note that the values of temperature, solar radiation (net solar radiation at
surface), and relative humidity were measured at a distance of 2 m, while the wind speed
was measured at 10 m.
y = a + b1 X1 + b2 X2 + · · · + bn Xn (1)
y = a + b1 Xi + b2 Xi 2 + b3 Xi 3 + . . . + bn Xi n (2)
172
Electronics 2023, 12, 1007
Polynomial Regression
Degree of the
Predicted Variable Inputs Variables
Polynomial
Temperature Solar radiation, relative humidity, wind speed 4
Solar radiation Temperature, relative humidity, wind speed 5
Wind speed Temperature, solar radiation, relative humidity 6
Relative Humidity Temperature, solar radiation, wind speed 4
where: Pi,k : is the radio of class k instances among the training instances in the ith node,
m : number of class labels, and Gi (Gini impurity): represents the measure for constructing
decision trees.
After performing different heuristic tests and using sensitivity analysis for this forecast
technique, it is deduced that the best parameters for tuning are those described in Table 3.
Decision Tree
Predicted Variable Inputs Variables Max_Depth Min_Samples_Leaf
Temperature Solar radiation, relative humidity, wind speed 10 18
Solar radiation Temperature, relative humidity, wind speed 10 7
Wind speed Temperature, solar radiation, relative humidity 19 6
Relative Humidity Temperature, solar radiation, wind speed 9 16
173
Electronics 2023, 12, 1007
This technique has several parameters that can be configured, such as the following:
N◦ estimators: the number of trees in the forest. Max leaf nodes: the maximum
number of leaf nodes, this hyperparameter sets a condition for splitting the tree nodes and
thus restricts the growth of the tree. If after splitting there are more terminal nodes than the
specified number, the splitting stops and the tree does not continue to grow, which helps to
avoid overfitting. And Max features: the maximum number of features that are evaluated
for splitting at each node, increasing max_features generally improves model performance,
since each node now has a greater number of options to consider [23].
After performing different heuristic tests and using sensitivity analysis for this forecast
technique, it is deduced that the best parameters for tuning are those described in Table 4.
Random Forest
Predicted Variable Inputs Variables N◦ Estimators Max Leaf Nodes Max Features
Temperature Solar radiation, relative humidity, wind speed 100 3000 0.1
Solar radiation Temperature, relative humidity, wind speed 100 3000 0.1
Wind speed Temperature, solar radiation, relative humidity 100 2000 0.3
Relative Humidity Temperature, solar radiation, wind speed 100 2000 0.2
174
Electronics 2023, 12, 1007
After performing different heuristic tests and using sensitivity analysis for this forecast
technique, it is deduced that the best parameters for its tuning are those described in Table 5.
XGBoost
Predicted Variable Inputs Variables Max Depth N◦ Estimators
Temperature Solar radiation, relative humidity, wind speed 2 100
Temperature, relative humidity,
Solar radiation 2 20
wind speed
Temperature, solar radiation,
Wind speed 5 19
relative humidity
Temperature, solar radiation,
Relative Humidity 7 19
wind speed
175
Electronics 2023, 12, 1007
In general, all weights and biases will be different. The output of the multilayer
perceptron neural network is defined by Equation (4). Where: yk is the output, f k activation
function of output layer, θk bias of the output layer, Wij hidden layer weights, yj output of
the hidden layer, f j activation function of the hidden layer, Xi neuron inputs, Wjk output
layer weights, θ j bias of hidden layer, r is the number of inputs for the neuron j from the
hidden layer, and u is the number of inputs for the neuron k from the output layer [27].
r
y j = f j ∑ Xi Wij − θ j
i=1 (4)
u
yk = f k ∑ yj Wjk − θk
j =1
For this research, backpropagation was used as a training technique. After performing
different heuristic tests and using sensitivity analysis for this forecasting technique, it is
deduced that the best parameters for its tuning are those described in Table 6.
Table 6. Tuning parameters for the multilayer perceptron neural network technique.
176
Electronics 2023, 12, 1007
3. Results
3.1. Indicators for Assessing the Performance of Weather Forecasting Models
To measure the performance of the forecast techniques for each of the variables de-
scribed above, two types of metrics were used: to evaluate the forecast accuracy, the mean
square error RMSE is used, which allows comparing their results and defining the technique
with the lowest error, and therefore, the best method for each variable to be predicted. In
addition, to determine if the implemented models perform well in their training and to
define their predictive ability, the coefficient of determination is R2 .
where: yc : are the values taken by the target variable, ŷc : are the values of the prediction,
and y: is the mean value of the values taken by the target variable.
where: yc : are the values taken by the target variable, ŷc : are the values of the prediction,
and o: is the sample size.
where yc : are the values taken by the target variable, ŷc : are the values of the prediction,
and o: is the sample size.
Equation (7) helps to understand one of the important caveats when using MAPE,
since to calculate this metric, you need to divide the difference by the actual value. This
means that if you have actual values close to 0 or at 0, the MAPE score will receive a
division error by 0 or will be extremely high. Therefore, it is recommended not to use
MAPE when it has real values close to 0 [30].
177
Electronics 2023, 12, 1007
the difference between the true value and the predicted value for the instance [16,31]. It is
given by Equation (8):
1 o
o c∑
MAE = | yc − ŷc | (8)
=1
where: yc : are the values taken by the target variable, ŷc : are the values of the prediction,
and o: is the sample size.
Table 7 shows that R2 obtained from the implemented algorithms converge to appro-
priate values, i.e., there is a correct approximation between the real temperature and the
predicted temperature, thus guaranteeing the good performance of the algorithm, which
178
Electronics 2023, 12, 1007
allows a comparison of the performance in terms of forecast error. Comparison of the root
mean square errors (RMSE), mean absolute percentage errors (MAPE), and mean absolute
errors (MAE), and analysis of the coefficient of determination R2 of the different techniques
implemented show that the best performing technique for forecasting the temperature
variable is Random Forest, with an R2 of 0.8631, MAE of 0.4728 ◦ C, MAPE of 2.73%, and
RMSE of 0.6621 ◦ C. This is followed by XGBoost, with an R2 of 0.8599, MAE of 0.5335 ◦ C,
MAPE of 3.09%, and RMSE of 0.7565 ◦ C.
Figure 5 shows the real (red) and prediction (blue) profiles using the different Machine
Learning techniques to predict the temperature variable: (a) Multiple linear regression
technique, (b) Polynomial regression technique, (c) Decision tree technique, (d) Random
Forest technique, (e) XGboost technique, (f) Multilayer perceptron neural network tech-
nique. Figure 5c,d, validate that the best performance corresponds to the Decision tree and
Random forest techniques.
Table 8 shows that R2 obtained from the implemented algorithms converge to appro-
priate values, i.e., there is a correct approximation between the real relative humidity and
the predicted relative humidity, thus guaranteeing the good performance of the algorithm,
which allows a comparison of the performance in terms of forecast error. Comparison
of the root mean square errors (RMSE), mean absolute percentage errors (MAPE), and
mean absolute errors (MAE), and analysis of the coefficient of determination R2 of the
different techniques implemented show that the best performing techniques for forecasting
the relative humidity variable are Random Forest, with an R2 of 0.8583, MAE of 2.1380 RH,
MAPE of 2.50%, and RMSE of 2.9003 RH; and XGBoost, with an R2 of 0.8597, MAE of
2.2907 RH, MAPE of 2.67%, and RMSE of 3.1444 RH.
Figure 6 shows the real (red) and prediction (blue) profiles using the different Machine
Learning techniques to predict the relative humidity variable: (a) Multiple linear regression
technique, (b) Polynomial regression technique, (c) Decision tree technique, (d) Random
forest technique, (e) XGboost technique, (f) Multilayer perceptron neural network technique.
Figure 6d and Figure 6c validate that the best performance corresponds to the Random
forest and Decision tree techniques.
179
Electronics 2023, 12, 1007
(a) (b)
(c) (d)
(e) (f)
Figure 5. Temperature forecast techniques: (a) Multiple linear regression, (b) Polynomial regression,
(c) Decision tree, (d) Random forest, (e) XGboost, (f) Multilayer perceptron neural network.
180
Electronics 2023, 12, 1007
(a) (b)
(c) (d)
(e) (f)
Figure 6. Techniques for relative humidity forecasting: (a) Multiple linear regression, (b) Polynomial
regression, (c) Decision tree, (d) Random forest, (e) XGboost, (f) Multilayer perceptron neural network.
181
Electronics 2023, 12, 1007
Coefficient of Determination Mean Absolute Error (MAE) Mean Square Error (RMSE)
Technique
(R2 ) [W/m2 ] [W/m2 ]
Multiple linear regression 0.6689 106.9741 164.7435
Polynomial regression 0.7394 76.6667 129.1836
Decision tree 0.7253 75.8177 127.3530
Random forest 0.7333 65.8105 105.9141
XGboost 0.7075 87.6137 145.0170
Multilayer perceptron 0.7423 88.5897 140.0681
Table 9 shows that R2 obtained from the implemented algorithms converge to appro-
priate values, i.e., there is a correct approximation between the real solar radiation and
the predicted solar radiation, thus guaranteeing the good performance of the algorithm,
which allows a comparison of the performance in terms of forecast error. Comparison of
the root mean square errors (RMSE), and mean absolute errors (MAE), and analysis of the
coefficient of determination R2 of the different techniques implemented show that the best
performing techniques for forecasting the solar radiation variable are Random Forest with
an R2 of 0.7333, MAE of 65.8105 W/m2 , and RMSE of 105.9141 W/m2 ; and Decision Tree
with an R2 of 0.7253, MAE of 75.8177 W/m2 , and RMSE of 127.3530 W/m2 .
Figure 7 shows the real (red) and prediction (blue) profiles using the different Machine
Learning techniques to predict the variable solar radiation: (a) Multiple linear regression
technique, (b) Polynomial regression technique, (c) Decision tree technique, (d) Random
forest technique, (e) XGboost technique, (f) Multilayer perceptron neural network technique.
Figure 7d validates that the best performance corresponds to the Random forest technique.
182
Electronics 2023, 12, 1007
not taken into account because it is used only when it is known that the quantity to be
predicted remains well above 0.
(a) (b)
(c) (d)
(e) (f)
Figure 7. Solar radiation forecast techniques: (a) Multiple linear regression, (b) Polynomial regression,
(c) Decision tree, (d) Random forest, (e) XGboost, (f) Multilayer perceptron neural network.
183
Electronics 2023, 12, 1007
Figure 8 shows the real (red) and prediction (blue) profiles using the different Machine
Learning techniques to predict the wind speed variable: (a) Multiple linear regression
technique, (b) Polynomial regression technique, (c) Decision tree technique, (d) Random
forest technique, (e) XGboost technique, (f) Multilayer perceptron neural network technique.
Figure 8d validates that the best performance corresponds to the Random forest technique.
(a) (b)
(c) (d)
Figure 8. Cont.
184
Electronics 2023, 12, 1007
(e) (f)
Figure 8. Techniques for wind speed forecast: (a) Multiple linear regression, (b) Polynomial regression,
(c) Decision tree, (d) Random forest, (e) XGboost, (f) Multilayer perceptron neural network.
4. Conclusions
For the forecasting of meteorological variables in this research, information obtained
from the Parque de la Familia Baños meteorological station located in Ecuador was used
and the following prediction techniques were tested: multiple linear regression, polynomial
regression, decision tree, random forest, XGBoost, and multilayer perceptron neural net-
work. For forecasting the temperature variable, a better result is obtained by using Random
Forest with an R2 of 0.8631, MAE of 0.4728 ◦ C, MAPE of 2.73%, and RMSE of 0.6621 ◦ C.
In addition, XGBoost also performed well with an R2 of 0.8599, MAE of 0.5335 ◦ C, MAPE
of 3.09%, and RMSE of 0.7565 ◦ C. For forecasting the relative humidity variable, a better
result is obtained by using Random Forest with an R2 of 0.8583, MAE of 2.1380 RH, MAPE
of 2.50%, and RMSE of 2.9003 RH. In addition, XGBoost also performed well with an R2
of 0.8597, MAE of 2.2907 RH, MAPE of 2.67%, and RMSE of 3.1444 RH. For forecasting
the solar radiation variable, a better result is obtained by using Random Forest with an
R2 of 0.7333, MAE of 65.8105 W/m2 , and RMSE of 105.9141 W/m2 . In addition, Deci-
sion Tree also performed well with an R2 of 0.7253, MAE of 75.8177 W/m2 , and RMSE
of 127.3530 W/m2 . For forecasting the wind speed variable, a better result is obtained by
using Random Forest, with an R2 of 0.3660, MAE of 0.1097 m/s, and RMSE of 0.2136 m/s.
In addition, XGBoost also performed well, with an R2 of 0.3866, MAE of 0.1439 m/s, and
RMSE of 0.3131 m/s.
It can be observed that wind speed has the highest variability compared to the other
predicted variables, therefore, the results of the techniques implemented show that the
coefficient of determination R2 of this variable has a lower value. This is due to the type of
signal we are trying to predict; however, acceptable predictions were obtained.
The prediction of meteorological variables (temperature, solar radiation, wind speed,
and relative humidity) will allow future projects to be implemented in the study area, such
as intelligent agriculture to support food problems in that area and the implementation
of a microgrid based on renewable resources where prediction models will support the
planning and operation of the microgrid in real time, allowing clean energy to this locality,
contributing to the reduction in the use of fossil resources, which is the goal that different
countries have set as part of their policies.
185
Electronics 2023, 12, 1007
Author Contributions: Conceptualization, J.A.S., J.F.T., J.R.L. and D.R.R.; methodology, J.A.S., J.F.T.,
J.R.L. and D.R.R.; software J.A.S. and J.F.T.; validation, J.A.S. and J.F.T.; formal analysis, J.A.S., J.F.T.,
J.R.L. and D.R.R.; investigation, J.A.S., J.F.T. and J.R.L.; resources, J.A.S. and J.F.T.; data curation,
J.A.S. and J.F.T.; writing—original draft preparation, J.A.S., J.F.T., J.R.L. and D.R.R..; writing—review
and editing, J.A.S., J.F.T., J.R.L. and D.R.R.; visualization, J.A.S., J.F.T., J.R.L. and D.R.R.; supervision,
J.R.L. and D.R.R.; project administration, J.R.L.; funding acquisition, J.R.L. All authors have read and
agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Acknowledgments: This work was supported in part by the Universidad de las Fuerzas Armadas
ESPE through the Project “Optimal energy management systems for hybrid generation systems”,
under Project 2023-pis-03. In addition, the authors would like to thank to the project EE-GNP-0043-
2021-ESPE, REDTPI4.0-CYTED, Conv-2022-05-UNACH, “SISMO-ROSAS”–UPS.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Ayala, M.F. Analisis de la Dinamica Caoticapara la Series Temporales de Variables Meteorologicas en la Estacion Climatica de Chone;
Universidad de las Fuerzas Armadas ESPE: Sangolquí, Ecuador, 2017; Available online: [Link]
21000/13629 (accessed on 24 November 2022).
2. Erdil, A.; Arcaklioglu, E. The prediction of meteorological variables using artificial neural network. Neural Comput. Appl. 2013, 22,
1677–1683. [CrossRef]
3. Ruiz-Ayala, D.C.; Vides-Herrera, C.A.; Pardo-García, A. Monitoreo de variables meteorológicas a través de un sistema inalámbrico
de adquisición de datos. Rev. Investig. Desarro. Innov. 2018, 8, 333–341. [CrossRef]
4. Inzunza, J.C. Meteorologia Descriptiva. Univ. Concepción Dep. Geofísica 2015, 1–34. Available online: [Link]
~jinzunza/meteo/[Link] (accessed on 24 November 2022).
5. Millán, H.; Kalauzi, A.; Cukic, M.; Biondi, R. Nonlinear dynamics of meteorological variables: Multifractality and chaotic
invariants in daily records from Pastaza, Ecuador. Theor. Appl. Climatol. 2010, 102, 75–85. [CrossRef]
6. Acurio, W.; Pilco, V. Técnicas Estadísticas para la Modelación y Predicción de la Temperatura y Velocidad del Viento en la Provincia de
Chimborazo; Escuela Superior Politénica de Chimborazo: Riobamba, Ecuador, 2019. Available online: [Link]
ec/handle/123456789/10955 (accessed on 28 November 2022).
7. Tong, H. Non-Linear Time Series: A Dynamical System Approach; Oxford University Press: Oxford, UK, 1990.
8. Palma-Behnke, R.; Benavides, C.; Aranda, E.; Llanos, J.; Sáez, D. Energy management system for a renewable based microgrid
with a demand side management mechanism. In Proceedings of the IEEE Symposium on Computational Intelligence Applications
in Smart Grid 2011, Paris, France, 11–15 April 2011; pp. 1–8. [CrossRef]
9. Rodríguez, M.; Salazar, A.; Arcos-Aviles, D.; Llanos, J.; Martínez, W.; Motoasca, E. A Brief Approach of Microgrids Implementation
in Ecuador: A Review. In Lecture Notes in Electrical Engineering; Springer: Cham, Switzerland, 2021; Volume 762, pp. 149–163.
[CrossRef]
10. Llanos, J.; Morales, R.; Núñez, A.; Sáez, D.; Lacalle, M.; Marín, L.G.; Hernández, R.; Lanas, F. Load estimation for microgrid
planning based on a self-organizing map methodology. Appl. Soft Comput. 2017, 53, 323–335. [CrossRef]
11. Caquilpan, V.; Saez, D.; Hernandez, R.; Llanos, J.; Roje, T.; Nunez, A. Load estimation based on self-organizing maps and
Bayesian networks for microgrids design in rural zones. In Proceedings of the 2017 IEEE PES Innovative Smart Grid Technologies
Conference—Latin America (ISGT Latin America), Quito, Ecuador, 20–22 September 2017; pp. 1–6. [CrossRef]
12. Palma-Behnke, R.; Benavides, C.; Lanas, F.; Severino, B.; Reyes, L.; Llanos, J.; Saez, D. A microgrid energy management system
based on the rolling horizon strategy. IEEE Trans. Smart Grid 2013, 4, 996–1006. [CrossRef]
13. Rey, J.M.; Vera, G.A.; Acevedo-Rueda, P.; Solano, J.; Mantilla, M.A.; Llanos, J.; Sáez, D. A Review of Microgrids in Latin America:
Laboratories and Test Systems. IEEE Lat. Am. Trans. 2022, 20, 1000–1011. [CrossRef]
14. Javier, G.; Quevedo-Nolasco, A.; Castro-Popoca, M.; Arteaga-Ramírez, R.; Vázquez-Peña, M.A.; Zamora-Morales, B.P.; Aguado-
Rodríguez, G.J.; Quevedo-Nolasco, A.; Castro-Popoca, M.; Arteaga-Ramírez, R.; et al. Predicción de Variables Meteorológicas por
Medio de Modelos Arima. Agrociencia 2016, 50, 1–13.
15. Gulhane, V.A.; Rode, S.V.; Pande, C.B. Correlation Analysis of Soil Nutrients and Prediction Model Through ISO Cluster
Unsupervised Classification with Multispectral Data. Springer Link 2022, 82, 2165–2184. [CrossRef]
16. Pande, C.B.; Al-Ansari, N.; Kushwaha, N.L.; Srivastava, A.; Noor, R.; Kumar, M.; Moharir, K.N.; Elbeltagi, A. Forecasting of SPI
and Meteorological Drought Based on the Artificial Neural Network and M5P Model Tree. Land 2022, 11, 2040. [CrossRef]
17. Mora Cunllo, V.E. Diseño e Implementación de un Modelo Software Basado en Técnicas de Inteligencia Artificial, para Predecir el
índice de Radiación Solar en Riobamba-Ecuador. 2015. Available online: [Link]
1/[Link] (accessed on 24 November 2022).
186
Electronics 2023, 12, 1007
18. Universitario, S.; Estad, E.N.; Aplicada, S.; Fern, R.A.; Javier, F.; Morales, A. Series Temporales Avanzadas: Aplicación de Redes
Neuronales para el Pronóstico de Series de Tiempo; Universidad de Granada: Granada, Spain, 2021.
19. Raschka, S.; Patterson, J.; Nolet, C. Machine learning in python: Main developments and technology trends in data science,
machine learning, and artificial intelligence. Information 2020, 11, 193. [CrossRef]
20. Carlos, J.; Rodriguez, M. Desarrollo de una Herramienta Inteligente Centrada en Visión Plantaciones de Arroz, Usando Lenguaje
de Programación Python. Ph.D. Thesis, Universidad de Guayaquil, Guayaquil, Ecuador, 2022.
21. Ben Bouallègue, Z.; Cooper, F.; Chantry, M.; Düben, P.; Bechtold, P.; Sandu, I. Statistical modelling of 2m temperature and 10m
wind speed forecast errors. Mon. Weather. Rev. 2022. Available online: [Link]
MWR-D-22-0107.1/[Link] (accessed on 18 January 2023). [CrossRef]
22. Montero Granados, R. Modelos de Regresión Lineal Múltiple; Technical Report; Documentos de Trabajo en Economía Aplicada;
Universidad de Granada: Granada, Spain, 2006.
23. Aurélien, G. Hands-on Machine Learning with Scikit-Learn & Tensorflow; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017.
24. Elbeltagi, A.; Kumar, M.; Kushwaha, N.L.; Pande, C.B.; Ditthakit, P.; Vishwakarma, D.K.; Subeesh, A. Drought indicator analysis
and forecasting using data driven models: Case study in Jaisalmer, India. Stoch. Environ. Res. Risk Assess. 2022, 37, 113–131.
[CrossRef]
25. Luckner, M.; Topolski, B.; Mazurek, M. Application of XGBoost Algorithm. Data Anal. 2017, 10244, 661–671. [CrossRef]
26. Menacho Chiok, C.H. Modelos de regresión lineal con redes neuronales. An. Científicos 2014, 75, 253. [CrossRef]
27. Popescu, M.C.; Balas, V.E.; Perescu-Popescu, L.; Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans.
Circuits Syst. 2009, 8, 579–588.
28. Soto-Bravo, F.; González-Lutz, M.I. Analysis of statistical methods to evaluate the performance of simulation models in horticul-
tural crops. Agron. Mesoam. 2019, 30, 517–534. [CrossRef]
29. Gopi, A.; Sharma, P.; Sudhakar, K.; Ngui, W.K.; Kirpichnikova, I.; Cuce, E. Weather Impact on Solar Farm Performance: A
Comparative Analysis of Machine Learning Techniques. Sustainability 2023, 15, 439. [CrossRef]
30. de Myttenaere, A.; Golden, B.; Le Grand, B.; Rossi, F. Mean Absolute Percentage Error for regression models. Neurocomputing
2016, 192, 38–48. [CrossRef]
31. Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the
literature. Geosci. Model Dev. 2014, 7, 1247–1250. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
187
electronics
Review
Bias Mitigation via Synthetic Data Generation: A Review
Mohamed Ashik Shahul Hameed, Asifa Mehmood Qureshi and Abhishek Kaushik *
Abstract: Artificial intelligence (AI) is widely used in healthcare applications to perform various
tasks. Although these models have great potential to improve the healthcare system, they have also
raised significant ethical concerns, including biases that increase the risk of health disparities in
medical applications. The under-representation of a specific group can lead to bias in the datasets that
are being replicated in the AI models. These disadvantaged groups are disproportionately affected
by bias because they may have less accurate algorithmic forecasts or underestimate the need for
treatment. One solution to eliminate bias is to use synthetic samples or artificially generated data
to balance datasets. Therefore, the purpose of this study is to review and evaluate how synthetic
data can be generated and used to mitigate biases, specifically focusing on the medical domain.
We explored high-quality peer-reviewed articles that were focused on synthetic data generation to
eliminate bias. These studies were selected based on our defined inclusion criteria and exclusion
criteria and the quality of the content. The findings reveal that generated synthetic data can help
improve accuracy, precision, and fairness. However, the effectiveness of synthetic data is closely
dependent on the quality of the data generation process and the initial datasets used. The study also
highlights the need for continuous improvement in synthetic data generation techniques and the
importance of evaluation metrics for fairness in AI models.
Keywords: synthetic data; artificial data; bias; fairness; AI; data generation
1. Introduction
Citation: Shahul Hameed, M.A.; Bias in AI models can be defined as systematic errors that affect the algorithm and
Qureshi, A.M.; Kaushik, A. Bias unfairly favor certain outcomes over others [1]. These biases can originate from various
Mitigation via Synthetic Data sources, such as using imbalanced or underrepresented datasets to train the models, the
Generation: A Review. Electronics algorithms themselves, and the ways AI models are being deployed by humans. In
2024, 13, 3909. [Link] healthcare, these biases can lead to significant consequences of disparities in treatment
10.3390/electronics13193909 outcomes for specific racial, gender, or demographic groups, resulting in unequal treatment
Academic Editor: Grzegorz Dudek that may lead to life-threatening incidences. They have a direct impact on patient medical
outcomes. Biased models can lead to misdiagnosis, inappropriate treatment plans, and
Received: 29 August 2024 unequal access to medical resources [2]. For instance, a used model to estimate breast
Revised: 23 September 2024
cancer density is trained on data that are under-inclusive of African-American women and
Accepted: 24 September 2024
will generate recommendations that are not well-suited for that population [3]. Moreover,
Published: 2 October 2024
Kiyasseh et al. [4] deployed a surgical AI model to assess the skill level of robotic surgeries
in completing different surgical activities. However, it was found that the model exhibited
an underskilling or overskilling bias. Underskilling occurs when an AI model incorrectly
Copyright: © 2024 by the authors. predicts that a certain surgical skill is of a lower grade than it actually is, hence reducing
Licensee MDPI, Basel, Switzerland. surgical performance. In contrast, overskilling occurred when the AI model incorrectly
This article is an open access article improved surgical performance by estimating that a certain skill would be a higher caliber
distributed under the terms and than it was.
conditions of the Creative Commons Therefore, the broad aim behind ensuring fairness, accuracy, and equality through bias
Attribution (CC BY) license (https:// mitigation in AI models within the medical domain is to create fair and equal outcomes for
[Link]/licenses/by/ a diverse set of populations.
4.0/).
There are several techniques to eliminate biases from datasets. These techniques
include the collection of diverse and representative data [5], handling at the algorithmic
level [6], pre-processing of datasets and post-processing of the model output to reduce
bias [7], and synthetic data generation [8].
Generating data is one of the several strategies to mitigate bias. It can be defined as ar-
tificial data that mimics the statistical properties and patterns of real-world information [9].
While other techniques tend to reduce or process the datasets to ensure fairness, which may
result in information loss, synthetic data generation helps to preserve the data distribution
and add statistically similar data samples to reduce bias. In healthcare settings, synthetic
data can be utilized as a substitute for real patient data while avoiding privacy issues,
which allows for the simulation and sharing of sensitive patient data. There exist multiple
methods, including Generative Adversarial Networks (GANs), Variational Auto-Encoders
(VAEs), the Synthetic Minority Over-sampling Technique (SMOTE), Bayesian networks,
and statistical sampling methods, to generate synthetic data. These methods have been
used vastly to generate new artificial datasets that capture the structure and distribution of
actual datasets. These synthetic datasets can help to tackle situations where there is bias or
no additional data to work with while trying to improve the fairness and accuracy of AI
algorithms [4]. Using synthetic data in healthcare is also a solution to patient data privacy.
Moreover, synthetic data allows you to create big, diverse datasets to train models.
This article reviews research studies that are focused on bias mitigation via synthetic
data generation techniques.
Motivation
AI systems can produce biased results that both reflect and amplify human prejudices
within a community, including historical and contemporary social injustice. Bias may exist
in the algorithm, the original training set, or the predictions the algorithm makes. It can be
passed on through racial, religious, and gender stereotypes [10]. Synthetic data can be a
viable solution to preserve patient data privacy and mitigate bias in the data. Synthetic data
methodologies tend to capture the structure and distributions found in real datasets while
minimizing bias and protecting individually identifiable [11]. Moreover, synthetic data
allows us to create big, diverse datasets to train AI models. Therefore, this paper explores
the literature to highlight the application of synthetic data generation to mitigate bias. The
study provides a comprehensive review of the methods and techniques used to generate
synthetic data. Also, the limitations of these models are discussed. The hypothesis and
formulated research questions are as follows:
Hypothesis: Synthetic data can be used to mitigate bias in medical data effectively, resulting in
unbiased, accurate, and equitable healthcare datasets.
RQ1. What are the most common methods or techniques to generate synthetic data to
handle biases in the dataset?
RQ2. What are the limitations of the existing techniques used for synthetic data generation?
The rest of the article is structured as follows:
Section 2 explains the search methodology. Section 3 gives an overview of the bias
mitigation techniques. Section 4 discusses the hypothesis and research questions. Section 5
concludes the discussion.
2. Search Methodology
Figure 1 shows the detailed search methodology used to conduct this review. Several
major databases, including Google Scholar, IEEE Xplore, Pubmed, ScienceDirect, and ACM
Digital Library, were explored. Google Scholar was used for its broad and inclusive nature.
It covers many topics and has many articles from interdisciplinary fields, so it is a great
resource for literature reviews. PubMed was chosen for its focus on biomedical and life
science literature. Since the focus is on AI applications in healthcare, PubMed has high-
189
Electronics 2024, 13, 3909
quality reviews and survey papers to obtain peer-reviewed medical research articles on the
impact and implementation of AI in clinical settings. The ACM Digital Library was included
for its focus on computing and information technology research. It is great for finding
research on algorithm development, computational methods, and technical advancements
in synthetic data generation and bias mitigation in AI. IEEE Xplore was chosen for its large
collection of technology research and application implementation papers, especially in AI,
ML, and synthetic data. It has technical papers and theory conference articles on the latest
AI technologies and their applications in healthcare.
The selections are based on the scope, relevance, and credibility of the databases and
article papers in the fields of AI, Machine Learning (ML), synthetic data, and healthcare.
A search query was formulated to find research articles on bias in AI, ML, Deep Learning
(DL), and synthetic data generation in healthcare:
“((Bias) AND (Artificial Intelligence OR Machine Learning OR Deep Learning) AND (Syn-
thetic Data) AND (text OR tabular dataset))”
Slight adjustments were made to the query to refine the results further. These ad-
justments included using synonyms of these terms, including fairness, biases, data bias,
artificial data, generated data, and medical data.
To obtain the most relevant and recent studies, inclusion and exclusion criteria were
devised as follows:
190
Electronics 2024, 13, 3909
• Research articles published between 2020 and 2024 were included to review the
relevant, latest techniques.
• Research studies that were published in conferences, journals, book chapters, or
proceedings were included.
• Research articles only written in the English language were selected.
The proportion of selected research studies from each database is shown in Figure 2.
Figure 3 shows the year-wise distribution of the selected articles. The year 2021
constituted the highest number of selected articles, preceded by the years 2022 and 2023.
Moreover, the category of the selected articles is given in Table 2. A total of 12 journals,
4 conferences, and 1 doctoral dissertation were reviewed.
191
Electronics 2024, 13, 3909
0
2020 2021 2022 2023
Year
Table 3. Overview of the methodology, dataset, modality, evaluation metric, strengths, and limitations
of the reviewed research articles.
192
Electronics 2024, 13, 3909
Table 3. Cont.
193
Electronics 2024, 13, 3909
Table 3. Cont.
194
Electronics 2024, 13, 3909
Table 3. Cont.
195
Electronics 2024, 13, 3909
dataset that constitutes 499,344 patient records having 21 features. The results show an
improvement in the Area Under Curve (AUC), Receiver Operating Characteristic (ROC),
and Precision–Recall curves. The proposed model is effective in privacy preservation and
bias identification. However, the approach is complex and relies on the initial dataset
quality, requiring alterations for distinctive datasets.
Also, Breugel et al. [16] investigate the DEbiasing CAusal Fairness (DECAF) system
that utilizes the Structural Causal model and one-sided edge expulsion to produce fair
synthetic data. The consideration centers on data generation, guaranteeing high-quality
synthetic data. Fairness is evaluated by demographic parity and equal opportunity. De-
cency is assessed by utilizing statistical equality and rise-to-opportunity measurements.
The system is congruous with numerous fairness definitions, but it requires an understand-
ing of causal connections and centers essentially on tabular data, thus requiring expert
knowledge for further studies.
Gujar et al. [14] proposed a framework named GenEthos that utilizes GANs and
Interactive Graphical User Interface (GUI) tools to produce synthetic data with integrated
bias detection. The tool is assessed by using datasets, including the German Credit and
Adult datasets. The framework is evaluated by factual fairness measurements like Statistical
Parity Difference (SPD) and Disparate Impact (DI). Learning Fair Representation (LFR) has
improved fairness by 62%. Reduced SPD by 93%. Despite its compelling Interactive GUI,
effective bias mitigation, improved fairness moderation, and comprehensive assessment,
the approach is particular to the utilized datasets and faces generalizability and specific
algorithm issues.
Sharafutdinov et al. [15] utilize mechanistic virtual patient (VP) modeling to de-
crease biases in heterogeneous ICU datasets to identify acute respiratory distress syndrome
(ARDS) using machine learning techniques. Utilizing observational information with mixed
origin data from different hospitals, the VP model-based clustering approach, compared to
clustering based solely on original patient data, makes strides in cluster revelation and miti-
gates bias. The approach depends on the accessibility of heterogeneous datasets, improved
cluster discovery information, and relevant observational data, making it complex to use.
Paladugu et al. [19] audit the application of GANs in restorative imaging and AI
preparation, emphasizing the generation of synthetic data information era and privacy
conservation. It compares different GAN models utilizing restorative datasets like Mag-
netic Resonance Imaging (MRI), Computed Tomography (CT) scans, and retinal images.
Measurements like precision, accuracy, review, visual quality assessments, and F1-score
are utilized to assess the information that has been created. GAN metrics vary in each use
case, whereas GANs appear guaranteed in making high-quality manufactured restorative
pictures and design patterns. However, this approach centers on hypothetical perspec-
tives with restricted experimental information and limited empirical data. Moreover, Celi
et al. [20] explore how bias in AI emerges from clinical medication information, center-
ing on the populace and data-source disparities. Utilizing PubMed clinical papers from
2019, it surveys aberrations about AI-assessed disparities in dataset country sources and
clinical demographics (nationality, sex, expertise). They manually tagged a subsample of
articles to train a model using transfer-learning techniques with the Bidirectional Encoder
Representations from Transformers for Biomedical Text Mining (BioBERT) model to make
predictions. The result of the survey highlights critical aberrations, especially in the US
(40.8%) and Chinese (13.7%) information. Clinical specialties: Radiology: 40.4% utilize
exchange learning with BioBERT, whereas a comprehensive analysis of 300,000 articles
highlights the disparities in data. Transfer-learning and automated tools to analyze the
dataset are mostly trained on US and Chinese datasets.
Also, Fletcher et al. [21] investigate a system for surveying fairness, bias, and the
use of AI/ML in worldwide global healthcare, specifically in low and middle-income
nations. Utilizing clinical records from 200 patients to train logistic regression models
accomplished around 89.2% exactness for both genders; the ROC may vary for different
groups, whereas the system depends on the quality and diversity of the training data. Ad-
196
Electronics 2024, 13, 3909
ditionally, Yogarajan et al. [22] analyze information and algorithmic bias in data collection
and model development, training, and testing by utilizing healthcare records collected in
New Zealand, centering on the Maori populations. Fairness measures, including impact
scores, equal opportunities, and equalized odds, were utilized to assess bias and disparities
among them. The results show evidence of bias when changes were made to algorithmic
designs. However, employing fairness metrics enhances transparency to reduce bias in
underrepresented populations.
Yang et al. [23] present a Deep Reinforcement Learning (DRL) system for bias mitiga-
tion that was acquired in the data collection process. The model is evaluated for COVID-19
predictions by eliminating any hospital and ethnicity-based biases. The technique includes
Adversarial training to decrease biases during model development. The framework is also
evaluated on the Electronic Intensive Care Unit (eICU) Collaborative Inquire Database to
predict patient discharge status. The result reports an AUC-ROC score of 0.818 to 0.875
using the XGBoost model and Random Forest (RF) model.
Libbi et al. [25] explore the utilization of synthetic data information in healthcare,
particularly for preserving privacy for individuals, by creating Electronic Health Record
(EHR) data. Utilizing neural dialect models like LSTM and General-Purpose Transformers-2
(GPT-2), it aims to generate synthetic EHR content that incorporates in-text Protected Health
Information (PHI) labels with annotations for Named-Entity Recognition (NER). Privacy
was evaluated using a recall-oriented understudy for the Gisting Evaluation (ROUGE)
n-gram and Best Match (BM25) scoring. The method utilized a dataset of one million Dutch
EHR notes to train these models. The result illustrates that combining real and synthetic
data information has accomplished a 95% exactness accuracy in de-identification, especially
favoring the LSTM strategy for higher utility compared to GPT-2. However, the study
discoveries are specific to Dutch EHR datasets.
Pettit et al. [26] discuss the application of AI, ML, and DL in healthcare, with a specific
focus on the data processing model, clinical outcome prediction, utilization of data, and
AI fairness. To reduce bias in the dataset, different AI models and algorithms have been
discussed. These methods include linear and regression models, decision trees, ensemble
learning, and neural networks, whereas the paper underscores accuracy, precision, recall,
and the F1-score to measure the performance of the AI models.
Rodriguez et al. [24] examine synthetic data generation to address the issues of small
and imbalanced medical datasets containing tabular data in different sizes. The proposed
method utilizes techniques like Gaussian Copulas, Conditional Tabular Generative Ad-
versarial Networks (CTGAN), and Synthetic Data Vault (SDV). The models are evaluated
on eight different medical datasets, including MNCD, MNCD-RED, BANG, EarlyDM,
HeartDis, Kidney, PIMA, and SACardio. The result shows that synthetic data effectively
improved and enhanced machine learning model training without the original dataset. In
spite of its evaluation on multiple datasets, advanced synthetic data generation techniques
such as CTGAN and SDV are dependent on the quality of original datasets and are not
time efficient.
Baumann et al. [27] present a strategy to generate synthetic data with a controlled
bias to analyze the impact of bias on model performance. They proposed an open-source
toolkit to generate synthetic data with different types of biases. The produced synthetic
datasets with predefined bias are assessed to increase awareness of bias in AI and how it
affects individuals and society. Even though the toolkit offers control over biases that need
to be introduced, it may not capture real-world complexities. Moreover, Draghi et al. [11]
compare different synthetic data generation methods to reduce bias in healthcare data.
Utilizing anonymized CPRD informative large-scale datasets comprised of both real and
synthetic data, the result illustrates a reduction in bias by 15–20% and precision accuracy
changes by 10–12%, with privacy metric underneath 5% of re-identification. Overall, data
utility was retained up to 90–95%. The approach is effective in bias mitigation and ensures
privacy. However, the proposed approach is resource-intensive.
The next section analyses all these studies given our hypothesis and research questions.
197
Electronics 2024, 13, 3909
4. Discussion
Based on our investigation, synthetic data generation provides opportunities to mini-
mize biases in the data by generating artificial samples that are statistically similar to the
real data. Therefore, we fail to reject our hypothesis, stating that synthetic data can be
used to mitigate bias in medical datasets effectively. The previously formulated research
questions are answered below:
RQ1. What are the methods or techniques to generate synthetic data to handle biases
in datasets?
• Generative Adversarial Network (GAN): GANs use two neural networks, the gener-
ator and the discriminator, which compete to produce realistic synthetic data. This
technique compares actual real and synthetic data. The generator creates data, while
the discriminator evaluates it against real data, improving the generator’s output
iteratively. For example, SynSigGAN uses an LSTM as a generator and a CNN as a
discriminator to obtain biomedical signal datasets to produce high-quality synthetic
biomedical signals such as ECG, EEG, and EMG in time-series data signal data [18].
• Bayesian network: The Bayesian networks use probabilistic models to simulate data
with controlled biases. They can model complex dependencies between variables,
providing a framework for generating synthetic datasets. For example, the BayesBoost
method employs Bayesian networks to generate synthetic datasets to address under-
represented group samples in healthcare data, showing improvements in AUC and
ROC curve metrics [17].
• Structural Causal Models (SCM): SCM focuses on understanding causal relationships
within data. They remove biased edges in the causal graph to generate fair synthetic
data, ensuring that the generated data adheres to fairness criteria like demographics
and equal opportunity. For example, DECAF is a framework that uses SCMs to create
fair synthetic data, maintain high-quality data utility, and achieve fairness through
causally aware generative networks [16].
• Synthetic Minority Over-sampling (SMOTE): SMOTE is an oversampling technique
designed to identify data imbalance among the datasets. SMOTE is a kind of data
augmentation approach that creates additional data points by interpolating between
the minority class samples and one of its k-nearest neighbors. This method helps
the model to make the data more balanced, which enhances the minority class’s
representation and reduces the bias in classification tasks. For example, if a dataset has
100 instances of class A and 10 instances of class B, SMOTE would generate additional
synthetic data instances of class B to balance the dataset [28].
• Gaussian Copulas: Copula-based techniques simulate synthetic data by modeling the
samples from different populations with similar marginal dependencies structure be-
tween variables separately from their marginal distributions. Gaussian copulas use the
multivariate uniform distributions method to examine and compare the dependence
between variables to represent the relation between various features and mitigate bias
among them. This technique is especially helpful for producing high-dimensional
data with complex dependencies, such as medical records with multiple variables like
age, blood pressure, and cholesterol levels [24].
RQ2. What are the limitations of the existing techniques used for synthetic data generation?
Despite the vast application of synthetic generation models, there are still some limita-
tions that need to be addressed. These limitations include
• Techniques like GANs, CT-GAN, and adversarial training require significant computa-
tional resources and high-quality initial datasets with meaningful data, which can be a
barrier to the widespread adoption and implementation of this method [18].
• In Bayesian networks, the generation and effectiveness of synthetic data heavily
depend on the quality of the original datasets. Poor-quality or biased input data can
lead to synthetic data that are also biased [17].
198
Electronics 2024, 13, 3909
5. Conclusions
This article reviewed recent research studies to investigate the application of synthetic
data to mitigate bias in healthcare datasets. For this purpose, different databases, including
Google Scholar, PubMed, ACM, and IEEEXplore, were explored to find relevant research
studies. A total of 17 articles were selected and reviewed, including conferences, jour-
nals, and a doctoral dissertation. For each article, the methodologies, dataset, modality,
evaluation metric, strengths, and weaknesses were analyzed. The overall findings show
that synthetic data generation can enhance the fairness and accuracy of datasets used to
train AI models by artificially generating samples representing diverse patient groups.
These models, when trained on unbiased datasets, can produce fair outcomes. Differ-
ent techniques, including Gaussian Copula, SMOTE, the Structural Causal method, and
GANs, are discussed in this review. Each data generation method has its own limitations.
However, the efficacy of synthetic data is contingent upon the quality of the generation
process and the synthetic datasets. Challenges such as ensuring high-quality initial datasets,
managing computational complexity, ethical considerations, and privacy concerns need to
be addressed to fully realize the potential of synthetic data generation. While the study
provides useful insights into the application of synthetic data to mitigate bias, it is essential
to highlight certain limitations that may have influenced the outcome. One significant
limitation is the selection of databases; we have not included Scopus and Web of Science
databases, which limits the scope of our study. These platforms provide a greater selection
of research studies that could have enhanced the findings. Therefore, in the future, wider
databases will be utilized to provide a more comprehensive analysis. Despite this, the
current study lays out a strong foundation for future investigation in the area of artificial
data generation to handle biases.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
199
Electronics 2024, 13, 3909
References
1. Tavares, S.; Ferrara, E. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies.
Sci 2024, 6, 3. [CrossRef]
2. Jain, A.; Brooks, J.R.; Alford, C.C.; Chang, C.S.; Mueller, N.M.; Umscheid, C.A.; Bierman, A.S. Awareness of racial and ethnic
bias and potential solutions to address bias with use of health care algorithms. Proc. JAMA Health Forum. Am. Med. Assoc. 2023,
4, e231197. [CrossRef] [PubMed]
3. Babic, B.; Gerke, S.; Evgeniou, T.; Glenn Cohen, I. Algorithms on Regulatory Lockdown in Medicine. Science (1979) 2019, 366,
1202–1204. [CrossRef]
4. Kiyasseh, D.; Laca, J.; Haque, T.F.; Miles, B.J.; Wagner, C.; Donoho, D.A.; Anandkumar, A.; Hung, A.J. A Multi-Institutional Study
Using Artificial Intelligence to Provide Reliable and Fair Feedback to Surgeons. Commun. Med. 2023, 3, 42. [CrossRef] [PubMed]
5. Mandal, A.; Leavy, S.; Little, S. Dataset Diversity: Measuring and Mitigating Geographical Bias in Image Search and Retrieval. In
Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing, Co-Located with ACM MM 2021,
Virtual, 20–24 October 2021; Volume 21, pp. 19–25. [CrossRef]
6. Kordzadeh, N.; Ghasemaghaei, M.; Mikalef, P.; Popovic, A.; Lundström, J.E.; Conboy, K. Algorithmic Bias: Review, Synthesis, and
Future Research Directions. Eur. J. Inf. Syst. 2022, 31, 388–409. [CrossRef]
7. Suresh, H.; Guttag, J. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. In
Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, Virtually, 5–9
October 2021. [CrossRef]
8. Naresh Mandhala, V.; Bhattacharyya, D.; Midhunchakkaravarthy, D.; Kim, H.-J. Detecting and Mitigating Bias in Data Using
Machine Learning with Pre-Training Metrics. Ingénierie Syst. d’Inf. 2022, 27, 119–125. [CrossRef]
9. Raghunathan, T.E. Synthetic Data. Annu. Rev. Stat. Appl. 2021, 8, 129–140. [CrossRef]
10. Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; Raffel, C. Large Language Models Struggle to Learn Long-Tail Knowledge. In
Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 15696–15707.
11. Draghi, B.; Wang, Z.; Myles, P.; Tucker, A. Identifying and Handling Data Bias within Primary Healthcare Data Using Synthetic
Data Generators. Heliyon 2024, 10, e24164. [CrossRef] [PubMed]
12. Oblizanov, A.; Shevskaya, N.; Kazak, A.; Rudenko, M.; Dorofeeva, A. Evaluation Metrics Research for Explainable Artificial
Intelligence Global Methods Using Synthetic Data. Appl. Syst. Innov. 2023, 6, 26. [CrossRef]
13. Bhanot, K.; Bennett, K.P.; Hendler, J.A.; Zaki, M.J.; Guyon, I.; Baldini, I. Synthetic Data Generation and Evaluation for Fairness.
Doctoral Dissertation, Rensselaer Polytechnic Institute, Troy, NY, USA, 2023.
14. Gujar, S.; Shah, T.; Honawale, D.; Bhosale, V.; Khan, F.; Verma, D.; Ranjan, R. GenEthos: A Synthetic Data Generation System with
Bias Detection and Mitigation. In Proceedings of the International Conference on Computing, Communication, Security and
Intelligent Systems, IC3SIS 2022, Kochi, India, 23–25 June 2022. [CrossRef]
15. Sharafutdinov, K.; Fritsch, S.J.; Iravani, M.; Ghalati, P.F.; Saffaran, S.; Bates, D.G.; Hardman, J.G.; Polzin, R.; Mayer, H.; Marx, G.;
et al. Computational Simulation of Virtual Patients Reduces Dataset Bias and Improves Machine Learning-Based Detection of
ARDS from Noisy Heterogeneous ICU Datasets. IEEE Open J. Eng. Med. Biol. 2023, 5, 611–620. [CrossRef]
16. Van Breugel, B.; Kyono, T.; Berrevoets, J.; van der Schaar, M. DECAF: Generating Fair Synthetic Data Using Causally-Aware
Generative Networks. Adv. Neural. Inf. Process Syst. 2021, 34, 22221–22233.
17. Draghi, B.; Wang, Z.; Myles, P.; Tucker, A.; Moniz, N.; Branco, P.; Torgo, L.; Japkowicz, N.; Wo, M.; Wang, S. BayesBoost:
Identifying and Handling Bias Using Synthetic Data Generators. In Proceedings of the Third International Workshop on Learning
with Imbalanced Domains: Theory and Applications, Bilbao, Spain, 17 September 2021; Volume 154.
18. Hazra, D.; Byun, Y.C. SynSigGAN: Generative Adversarial Networks for Synthetic Biomedical Signal Generation. Biology 2020,
9, 441. [CrossRef] [PubMed]
19. Paladugu, P.S.; Ong, J.; Nelson, N.; Kamran, S.A.; Waisberg, E.; Zaman, N.; Kumar, R.; Dias, R.D.; Lee, A.G.; Tavakkoli, A.
Generative Adversarial Networks in Medicine: Important Considerations for This Emerging Innovation in Artificial Intelligence.
Ann. Biomed. Eng. 2023, 51, 2130–2142. [CrossRef]
20. Celi, L.A.; Cellini, J.; Charpignon, M.-L.; Dee, E.C.; Dernoncourt, F.; Eber, R.; Mitchell, W.G.; Moukheiber, L.; Schirmer, J.; Situ, J.;
et al. Sources of Bias in Artificial Intelligence That Perpetuate Healthcare Disparities—A Global Review. PLoS Digit. Health 2022,
1, e0000022. [CrossRef]
21. Fletcher, R.R.; Nakeshimana, A.; Olubeko, O. Addressing Fairness, Bias, and Appropriate Use of Artificial Intelligence and
Machine Learning in Global Health. Front. Artif. Intell. 2021, 3, 561802. [CrossRef]
22. Yogarajan, V.; Dobbie, G.; Leitch, S.; Keegan, T.T.; Bensemann, J.; Witbrock, M.; Asrani, V.; Reith, D. Data and Model Bias in
Artificial Intelligence for Healthcare Applications in New Zealand. Front. Comput. Sci. 2022, 4, 1070493. [CrossRef]
23. Yang, J.; Soltan, A.A.S.; Eyre, D.W.; Clifton, D.A. Algorithmic Fairness and Bias Mitigation for Clinical Machine Learning with
Deep Reinforcement Learning. Nat. Mach. Intell. 2023, 5, 884–894. [CrossRef] [PubMed]
24. Rodriguez-Almeida, A.J.; Fabelo, H.; Ortega, S.; Deniz, A.; Balea-Fernandez, F.J.; Quevedo, E.; Soguero-Ruiz, C.; Wagner, A.M.;
Callico, G.M. Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets.
IEEE J. Biomed. Health Inf. 2023, 27, 2670–2680. [CrossRef] [PubMed]
25. Libbi, C.A.; Trienes, J.; Trieschnigg, D.; Seifert, C. Generating Synthetic Training Data for Supervised De-Identification of Electronic
Health Records. Future Internet 2021, 13, 136. [CrossRef]
200
Electronics 2024, 13, 3909
26. Pettit, R.W.; Fullem, R.; Cheng, C.; Amos, C.I. Artificial Intelligence, Machine Learning, and Deep Learning for Clinical Outcome
Prediction. Emerg. Top. Life Sci. 2021, 5, 729–745. [CrossRef]
27. Baumann, J.; Castelnovo, A.; Cosentini, A.; Crupi, R.; Inverardi, N.; Regoli, D. Bias On Demand: Investigating Bias with a
Synthetic Data Generator. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
Demonstrations Track, Macao, China, 19–25 August 2023.
28. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
201
MDPI AG
Grosspeteranlage 5
4052 Basel
Switzerland
Tel.: +41 61 683 77 34
Disclaimer/Publisher’s Note: The title and front matter of this reprint are at the discretion of the
Guest Editor. The publisher is not responsible for their content or any associated concerns. The
statements, opinions and data contained in all individual articles are solely those of the individual
Editor and contributors and not of MDPI. MDPI disclaims responsibility for any injury to people or
property resulting from any ideas, methods, instructions or products referred to in the content.
Academic Open
Access Publishing