Synthetic Data in Healthcare: A Review
Synthetic Data in Healthcare: A Review
A R T I C L E I N F O A B S T R A C T
Keywords: Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by
Synthetic data generation data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algo
Data privacy rithms on unbiased data with sufficient sample size and statistical power. Our review explores the application
Healthcare
and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we
Artificial intelligence
Tabular data
systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics,
Imaging data time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The
Radiomics data type of method used for the synthetic data generation process was identified in each study and was categorized
Time-series data into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming
Omics data languages used for the implementation of each method. Our evaluation revealed that the majority of the studies
Multimodal data utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and
conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of
fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-
quality, representative multimodal datasets without exposing sensitive patient information, among others. We
underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with
75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is
finally provided to accelerate research in the field.
* Correspondence to: Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece.
E-mail address: [email protected] (D.I. Fotiadis).
1
ORCID: 0000-0002-7362-5082
https://doi.org/10.1016/j.csbj.2024.07.005
Received 10 June 2024; Received in revised form 4 July 2024; Accepted 4 July 2024
Available online 9 July 2024
2001-0370/© 2024 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
data to improve the generalizability of the AI models across diverse and the Dual-Discriminator Conditional GAN (DDcGAN) for
populations. This is particularly crucial in the case where data can be Multi-resolution PET and MR image fusion [34], among many others.
skewed or underrepresented in the context of harmful bias (e.g. age, However, two fundamental key aspects should be taken into consider
race, gender), where synthetic data generation can be utilized as a ation prior to the training of any synthetic data generator: (i) data
mitigation methodology. anonymization, and (ii) data fidelity. Data anonymization refers to the
Data privacy is a critical concern in the healthcare domain consid process of removing personally identifiable information from the data,
ering the sensitive nature of personal health information [3–5,11]. Any so that the patients remain anonymous whereas data fidelity refers to the
kind of data misuse or data breach can have severe implications for degree to which synthetic data “mimic” the real data using a variety of
patients which in turn obscures their trust in AI systems. Synthetic data metrics like the goodness of fit, correlation, and the Kullback-Leibler
can ensure that personal identifiers are completely absent, thereby divergence, among many others [4,6,16]. High fidelity is vital to
safeguarding patient confidentiality, while allowing researchers to ensure that synthetic data can reliably replace real data without
harness meaningful knowledge. This is particularly important when compromising data integrity.
developing AI models, where access to large scale data is crucial to Multimodality in healthcare refers to the use of multiple forms of
ensure their increased accuracy and reliability. Considering that the data inputs (modalities) to aid in decision-making and patient care.
more data the researchers’ access, the higher the risk of exposing sen These modalities can include tabular data (e.g., demographics, labora
sitive information, the use of synthetic data mitigates any risk of tory examinations, therapies, conditions), imaging data (e.g., CT, MRI,
exposing the real patient data. Through this way, the access to PET; and image based quantitative features which are referred to as
high-quality data is democratized, a fact that accelerates innovations in radiomics), time-series data (e.g., ECG, EEG, PPG), and omics data (e.g.,
AI and data science. In addition, the use of synthetic data can lead to genomics, proteomics, lipidomics, metabolomics), among others, each
more robust and generalized AI models that perform well across various providing different perspectives on patient health. The integration of
demographics and conditions, thereby improving their equity and these diverse data types presents unique challenges in data analysis, but
effectiveness. On the other hand, synthetic data must maintain a balance it also offers a more holistic view of patient health leading to better
between realism and privacy. This balance is critical especially in the outcomes. Synthetic data have a crucial role in this interplay since they
healthcare sector, where the predictive accuracy of the AI models has can provide large and diverse data. However, privacy is an important
significant effects on patient outcomes. Harmful biases which are often factor which is not guaranteed by data fidelity. To this end, best prac
introduced in real data such as gender identity and sexual orientation, tices should be adopted for data protection, clearer standards for
cultural and religious beliefs, language and communication barriers, assessing identifiability, and proportionate regulatory approaches to
geographic location, occupational hazards, and health insurance status, facilitate innovation while ensuring privacy. Thus, the availability of
can be mitigated by creating balanced data that reflect the diversity of high-quality synthetic data can enable researchers to develop multi
the affected populations [12]. modal AI models. Furthermore, synthetic data can enable the simulation
Synthetic data can serve as a substitute for real data when training AI of complex patient scenarios that might not be frequently encountered
models. But how can we generate synthetic data? Synthetic data can be in real datasets, thereby enhancing the robustness of healthcare systems
generated by capturing the statistical properties of the real data to create against rare but critical conditions. Moreover, by utilizing synthetic
new data points with similar properties. According to the literature, a data, researchers can bypass many logistical and ethical hurdles that
variety of methods has been proposed for the generation of high-quality occur during the aggregation and analysis of multimodal data, thus
synthetic tabular, imaging, radiomics, time-series, and omics data, accelerating the pace of research. Ultimately, the use of synthetic data
which are categorized into: (i) statistical-based methods, like the can significantly advance personalized medicine, improving treatment
multivariate normal distribution (MVND) and bootstrapping to generate efficacy and patient outcomes while upholding stringent data privacy
virtual populations for hypertension drug programs [13], (ii) standards.
probabilistic-based methods, like the Stochastic Block Models (SBM) The current review aims to provide a thorough analysis of synthetic
[14] to integrate multi‑omics data with consistent (common) and dif data generation methodologies, open-source repositories with codes and
ferential cluster patterns and the time-evolving graphs with meta synthetic data to drive innovation and address common challenges more
stability [15] to validate methods for capturing temporary changes in effectively across various healthcare domains, as well as, to improve the
the time-evolving graphs for human microbiome analysis, (iii) impact of synthetic data in targeted medical research and practice. The
machine-learning based methods like the tree ensembles [16–18] for primary objectives of this review are the following: (i) to provide a
data augmentation to improve the performance of disease progression better understanding of the methods that are used to generate synthetic
and risk stratification models for cardiovascular and autoimmune dis tabular, imaging, omics, time-series data in healthcare, (ii) to provide
eases, the Gaussian Mixture Models (GMM) [19–21] to generate open source repositories to implement these methods, (iii) to explore
large-scale virtual populations, at reduced complexity, for in silico clin applications and benefits of using synthetic data in healthcare, (iv) to
ical trials, and the Hidden Markov Models (HMMs) [22] to generate evaluate the impact of synthetic data on patient privacy and regulatory
realistic synthetic behavior-based sensor data for activity recognition in compliance, (v) to highlight the challenges and limitations of synthetic
smart homes, and (iv) deep-learning based methods, which dominate data, and (vi) to suggest future directions for research and development
the literature, like the virtual autoencoders (VAEs) to generate synthetic in this area.
PPG signals [23] and myriad variations of the generative adversarial
networks (GANs) like the Adaptive Deconfounding Synthetic GAN 2. Methods
(ADS-GAN) to generate high-fidelity privacy-conscious synthetic patient
data for causal effect estimation with multiple treatments [24], the 2.1. Review process
Conditional GAN (CGAN) to generate realistic synthetic tabular data for
benchmarking [25], the Wasserstein GAN with Gradient Penalty We conducted a systematic review of the existing literature based on
(WGAN-GP) to generate synthetic radiomics data from RT and CT im the PubMed and Scopus databases to ensure a robust thematic analysis
ages [26], the Copula GAN (CopulaGAN) for the generation of digital of the different use cases on synthetic data generation technologies in
twins [27], the Multi-label Time Series GAN (MTGAN) to generate EHRs healthcare based on high quality peer reviewed journals and interna
and simultaneously improve the quality of uncommon disease genera tional conferences. Our analysis focuses on five main types of data:
tion [28], the Transformer-Based Time Series GAN (TTS-GAN) to tabular data, imaging data, radiomics data (image-based quantitative
generate human heartbeat signals, timesteps, accelerator values, and features), time-series data, and omics data. A special case on multimodal
sinusoidal waves [29], the Cycle-Consistent GAN (CycleGAN) [30–33], synthetic data generation cases was also investigated. A custom Python
2893
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
script was developed to automate the retrieval process. The script iter 2.2. PRISMA flowchart
ates over each year from 2015 to 2024 to apply the respective search
query for each data type and retrieves the count of publications per year. The PRISMA flowchart of the study is presented in Fig. 1 to sum
The Scopus API (https://api.elsevier.com/content/search/scopus) and marize the multi-phase process of identifying, screening, assessing, and
the PubMed API (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch. including studies in the review. The identification stage involves the
fcgi) were utilized to send HTTP requests with specific queries and collection of records through extensive database searches (966 records;
additional parameters to obtain the total number of results. The function 719 from Scopus and 247 from PubMed, and 11 additional records from
iterates over each year within the defined range (from 2015 to 2024), other sources). After compiling these records, the screening phase fol
updating the query to include the publication date for that year, and lows to remove duplicate records due to overlapping indexing in the two
retrieves the count of publications. These counts are stored in a dictio databases yielding 462 records. The unique records then underwent an
nary for each data type. Once the metadata have been collected, the total initial screening based on their titles and abstracts to quickly filter out
counts per year and the final counts for each data type are calculated and clearly irrelevant studies by 4 independent researchers. The eligibility
saved into CSV files for further analysis. phase involves a detailed examination process, where full-text articles of
Six individual database queries were designed and executed with a the screened records were assessed against predefined criteria to ensure
focus on the retrieval of studies which are related to the use/generation that only the studies that truly fit the review’s scope and quality re
of: (i) synthetic tabular data (or virtual data or virtual population) by quirements are included.
excluding papers related to imaging, text, videos, and time series data, to To this end, the number of full-text articles assessed was 124; tabular
better capture advances in the healthcare domain focusing on clinical data = 29, imaging data = 30, radiomics data = 6, time-series data
and lifestyle data (e.g. demographics, conditions, therapies, patient = 20, omics data = 24, multimodal data = 15). From those, 42 were
history), (ii) synthetic imaging data with a focus on the GANs or similar excluded by filtering out articles which were: (i) not related to the fields
deep-learning architectures for the generation of synthetic medical im of engineering, mathematics, and computer science, (ii) written in a
ages while excluding text, tabular data, videos, time series to tailor the non-English language, (iii) pre-prints. The final phase lists the 82 studies
query for medical imaging applications, (iii) synthetic radiomics data by that passed the eligibility criteria. Those studies are presented as part of
exploring studies within the radiomics field involving the extraction of the qualitative synthesis of this review.
large amounts of features from medical images using data-
characterization algorithms, (iv) synthetic time-series data by identi
fying papers with a focus on the generation and use of longitudinal, 2.3. The synthetic data generation workflow
temporal data, and various biosignals like EEG, ECG, PPG, MEG, wear
ables, and vital sensors, (v) synthetic omics data by excluding text, Fig. 2 depicts the core stages of the synthetic data generation
tabular data, demographics, videos, imaging, and time series data with a workflow. It consists of four stages, including: (i) data acquisition, (ii)
focus on diverse biological fields like genomics, proteomics, and data preparation, (iii) data modeling, and (iv) data quality evaluation.
metabolomics, among others, and (vi) synthetic multimodal data by The first stage involves the retrieval and management of real data. This
specifically targeting papers that combines multiple data modalities includes ensuring proper permissions, data governance, and privacy
such as omics and imaging, time series and clinical data, imaging and compliance to handle sensitive information responsibly. The second
clinical data, time series and imaging. stage involves the curation and transformation of the real data to make
them suitable for modeling. This stage is crucial to make the data suit
able for subsequent modeling. It involves handling missing values,
Fig. 1. PRISMA flowchart for the systematic review including the database searches, the number of abstracts screened, and the full texts retrieved.
2894
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
normalizing data, and possibly augmenting the dataset to enhance its generation technologies in healthcare, from 2015 to mid. 2024 which
quality and representativeness. The third stage involves the develop highlights the growing interest within the field. More specifically, Fig. 3
ment of models (statistical, probabilistic, machine learning, deep (A) illustrates the total number (counts) of publications per year, from
learning) to generate synthetic data that mimic the properties of the real 2015 to 2024, as indexed by PubMed and Scopus. It shows a significant
data. The objective is to create synthetic datasets that retain the essential increase in the number of publications over the years, where a signifi
characteristics and patterns of the original data without revealing any cant rise is observed in 2023. Fig. 3(B) presents the distribution of the
sensitive information. The final stage focuses on the assessment of the publications across various data types which are involved in synthetic
generated synthetic data quality to ensure that they meet the required data generation, including tabular, imaging, radiomics, time-series,
standards of fidelity, privacy, and utility. omics, and multimodal cases. Each axis shows the count of publica
tions from Scopus and PubMed, suggesting that synthetic imaging and
3. Results tabular data generation are the most researched areas, whereas the time-
series, omics, radiomics and multimodal data generation studies are
3.1. Summary of trends in the field fewer.
On the other hand, the types of methods and the programming lan
Fig. 3 summarizes the trends of the existing studies on synthetic data guages which are used for synthetic data generation are depicted in
Fig. 3. An overview of: (A) the total number of synthetic data generation studies in healthcare per year by PubMed and Scopus, and (B) the final number of studies
across different data types (five main data types and multimodal data cases) by PubMed and Scopus.
2895
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Fig. 4. Overview of methods and programming languages used for synthetic data generation in healthcare: (A) Types of methods used in the studies, (B) Pro
gramming languages used for the implementation.
Fig. 4 for the studies presented in Fig. 3(B), including publication trends, Synthetic GAN (ADS-GAN), the Conditional GAN (CGAN), the Wasser
data type usage, methodological approaches, and programming lan stein GAN (WGAN) [24,25], the Copula GAN (CopulaGAN), the Condi
guages. Deep learning appears to be the predominant method for syn tional Tabular GAN (CTGAN), the Medical GAN (MedGAN), and the
thetic data generation at 72.6 % of the studies, followed by statistical RadialGAN (radial basis functions within a GAN), as well as, the Tabular
methods at 15.1 %, machine learning at 9.6 %, and probabilistic Variational Autoencoder (TVAE), the Variational Autoencoders (VAEs),
methods at 2.7 %. According to Fig. 4(B) Python is the most widely used and the Tabular Denoising Diffusion Probabilistic Model (TabDDPM)
language for 75.3 % of the studies, followed by R for 14.8 %, and other [27,43,44,46,47]. These methods (Table 1) have been used for: (i)
languages like C+ +, Java, and Matlab for 9.9 %. privacy-conscious synthetic data generation for clinical decision sup
port, (ii) generating synthetic populations and digital twins, and (iii)
3.2. Synthetic data generation methods and implementations per data improving the predictive performance on minority groups.
type The metrics which are used to measure synthetic tabular data fidelity
and quality, include descriptive statistics (mean, median, standard de
3.2.1. Tabular data viation, variance-covariance, range, and proportions for categorical
The current methods for synthetic tabular data generation (including data), and more straightforward metrics, such as, the relative predictor
deep demographics, laboratory examinations, medical conditions, error (RPE), the relative bias (RB), the Wasserstein distance (WD), the
therapies, lifestyle data) can be grouped into statistical- and Pearson’s correlation coefficient (CC), the Spearman correlation (SC),
probabilistic-based, machine learning (ML)-based, and deep learning the Kendall’s rank coefficient (KRC), the goodness of fit (GOF), the KL
(DL)-based. The statistical- and probabilistic-based methods utilize sta divergence (KLD), the relative error (RE), the polynomial regression
tistical or probabilistic models to generate synthetic data based on the coefficients (PRC), and the density plots (DPs). These metrics assess how
statistical distributions and relationships of the variables in the real well the synthetic data preserve the statistical properties, feature re
data. Examples of such methods (Table 1) include bootstrapping, the lationships and context of the real data. Additional statistical measures
multivariate normal distribution (MVND) and the log MVND [13,18, like the KS test (Kolmogorov-Smirnov test), the CS test (Chi-Squared
35], the Bayesian models [36–38], the vine copula models [39], the test), the cosine similarity distance (CSD), the Jaccard similarity index
probabilistic Bayesian networks [18,38,40], and the Bayesian (hierar (JSI), the pairwise correlation difference (PCD), the maximum mean
chical) generalized linear models (hGLM) [37]. According to Table 1, discrepancy (MMD), the coefficient of variation (cV), the Jensen-
these methods have been used for: (i) the simulation of covariates in Shannon distance (JSD), and the bias-eliminated coverage (BEC) are
clinical trials, (ii) the generation of high-fidelity, large scale patient data, employed to ensure the synthetic data fidelity. Privacy metrics focus on
(iii) disease progression modeling, (iv) data augmentation to enhance ensuring the synthetic data does not compromise individual privacy,
the performance of disease classification and risk stratification models, using measures, such as, the ε-identifiability, the K-anonymity, the K-
and (v) the simulation of augmented clinical trials. The ML-based map, and the L-diversity to evaluate re-identification risks. The majority
methods can overcome the statistical assumptions for specific distribu of the metrics for the evaluation of the tabular data fidelity and quality,
tions in the real data by capturing complex patterns. Examples of such as well as, for the other types of data which are described next, are
methods (Table 1) include the supervised and unsupervised tree en presented in [7].
sembles, the radial basis function (RBF)-based artificial neural networks
(ANNs) [18,36], the state-transition machines [41,42], the sequential 3.2.2. Imaging data
decision tree-based synthesizers [27,43–45], the Gaussian Mixture The current advances in synthetic medical imaging data generation
Models (GMM), the Gaussian Mixture Models with Bayesian inference mainly rely on the deployment of GANs and several proposed variations
(BGMM) and the BGMM with optimal components estimation of the GANs, as well as, on DL-oriented, specialized algorithms. GANs
(BGMM-OCE) [19,20]. These methods have been widely used (Table 1) play a critical role in image synthesis. Examples (Table 2) include the
for: (i) data augmentation for disease progression, (ii) transforming Enhanced Balancing GAN, which is utilized for generating minority class
clinical patient data and modeling of disease progression, which are images in imbalanced datasets [48], and other forms of GANs such as the
applied in various contexts including digital twin generation and repli Attention-based GAN [49], the CycleGAN [30–33], and the
cability evaluation, (iii) large scale virtual population generation for in Dual-Discriminator Conditional GAN (DDcGAN) [34] applied in tasks
silico clinical trials. The DL-based methods leverage multi-layer artificial ranging from medical image enhancement to cross-modality image
neural network (ANN) architectures to better capture nonlinearities and synthesis. Other specialized GAN variants, such as, the Progressively
complex data interactions. Examples of such methods (Table 1) include Growing GANs [50] and the Style Distribution GAN (SD-GAN) [51]
different variations of the GANs), such as, the Adaptive Deconfounding focus on generating clinically realistic X-rays and transferring style
2896
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 1
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic tabular data.
Study Scope Statistical approaches / Programming Open-source codes or libraries used Metrics used to
algorithm (s) used language/ measure synthetic
Software data fidelity/privacy
[13] Simulation of covariates for clinical Bootstrapping, MVND R https://cran.r-project.org/web/ Summary statistics,
trials packages/mice/index.html RPE, RB, CC
[24] Privacy-conscious synthetic data ADS-GAN Python https://github.com/tensorflow/ WD, CC, SC, KRC,
generation for causal effect estimation tensorflow ε-identifiability
in treatment analysis
[36] Data augmentation for disease Bayesian models, tree ensembles, R, Python https://cran.r-project.org/web/ GOF, KLD, CC
classification and risk stratification RBF-based ANNs packages/semiArtificial/index.html
[39] To generate realistic virtual patient Vine copula models R https://github.com/vanhasseltlab/ RECC, mean, standard
data in pharmacometrics copula_vps deviation, median RE,
PRC, DPs
[20] Virtual population generation for in- BGMM Python https://github.com/scikit-learn/ CC, GOF, KLD
silico clinical trials in HCM scikit-learn
[40] Synthetic dataset generation using Probabilistic Bayesian networks OpenMarkov - -
Bayesian methods for clinical software
applications
[19] To generate high-quality, large-scale BGMMO-CE Python https://github.com/vpz4/BGMM- cV, GOF, KLD, CC
synthetic data at reduced OCE
computational complexity
[27] Digital twin generation for TabularSimulationBase, Python https://github.com/RyanWangZf/ CC, WD
personalized clinical trials GaussianCopula, CopulaGAN, PyTrial
TVAE, CTGAN, MedGAN
[43] A comparative analysis of five distinct Tabular Preset, Gaussian Copula, Python https://github.com/sdv-dev/SDV KS test, CS test, CC, CSD
approaches GANs, CTGAN, VAEs
for creating virtual data populations
from individuals suffering from
chronic coronary disorders
[37] A Bayesian hierarchical method for Bayesian (hierarchical) R https://cran.r-project.org/web/ KS test, CS test, CC
combining in silico and in vivo data generalised linear models (hGLM) packages/rstan/index.html
onto an augmented clinical trial with
binary endpoints.
[41] To develop a pipeline for transforming State-transition machines Java https://github.com/synthetichealth/ GOF, CC, KS test, CS test
clinical patient data to conform with a synthea
model designed using OBO Foundry
ontologies using synthetic data
[18] To predict disease progression for MVND, log-MVND, RBF-based R, Python https://cran.r-project.org/web/ CC, KS test, CS test
patients diagnosed with HCM during a ANNs, tree ensembles, Bayesian packages/deal/index.html, https://
10-year period using synthetic data networks cran.r-project.org/web/packages/
semiArtificial/index.html, scipy
[25] To develop realistic synthetic datasets CGAN, WGAN Python - CC, JSI, GOF
suitable for validating digital health
applications with a focus on clinical
decision support systems.
[46] To examine the usability of synthetic CTGAN Python https://github.com/sdv-dev/SDV -
data in decision support systems, with
a focus on data quality and security
[44] To overcome the lack of high-fidelity CTGAN, Gaussian Copula Python sklearn, imblearn, sdv PCD, MMD, KLD
datasets and ensure patient’s privacy
[42] To develop a model of novel State-transition machines Java https://github.com/synthetichealth/ CC, KS test, CS test
coronavirus (COVID-19) disease synthea
progression and treatment
[47] To improve predictive performance on RadialGAN, TabDDPM, CTGAN, Python https://github.com/vanderschaarlab/ JSD, WD, KLD, KS test,
minority groups TVAE synthcity MMD, K-anonymity, K-
map, L-diversity
[38] To generate high-fidelity synthetic Bayesian networks R https://github.com/zhenchenwang/ -
patient data based on UK primary care latent_model
patient data
[45] To evaluate the replicability of Sequential decision tree-based Python, R https://osf.io/vsku2/ BEC
analyses using synthetic data synthesizer, GANs
distributions in images, respectively. Other DL-oriented approaches Networks [55], which is a combination of DL approaches, enhances the
include a variety of neural network architectures beyond traditional detection of out-of-distribution objects in imaging data, which is crucial
GANs, such as, the Conditional Variational Autoencoder [52] and the for reliable medical diagnosis. Similarly, Normalizing Flows [56] have
Contrastive Diffusion Model [53] which are notable for their perfor been employed to mitigate the effects of CT acquisition and recon
mance in generating realistic, high-resolution images and fine-detail struction anomalies, providing more accurate and consistent imaging
PET reconstruction. Furthermore, Vision Transformers [54], have outputs. Finally, a pythonic library containing multiple pre-trained
shown their potential in fast MRI reconstruction by harnessing the ca GAN-based models (CT-GAN, WGAN, SinGAN, PGGAN, FastGAN,
pabilities of transformer models, which have been successful in natural pix2pix) has been also reported, named medigan [57], to allow re
language processing but lately used on image classification and seg searchers to access, generate, and benefit from synthetic medical im
mentation tasks. Furthermore, the Ensemble of Convolutional Neural aging data, including mammographies, brain MRI, endoscopy, chest
2897
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 2
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic imaging data.
Study Scope Statistical approaches / algorithm Programming Open-source codes or Metrics used to
(s) used language/ libraries used measure synthetic
Software data fidelity/
privacy
[48] Enhanced Balancing GAN for Minority Class Image Enhanced Balancing GAN Python https://github.com/ FID, IS
Generation GH920/improved-
bagan-gp
[55] Efficient Data Augmentation Network for Out-of- Ensemble of Convolutional Neural Python https://github.com/ -
Distribution Image Detection Networks majic0626/Data-
Augmentation-Network
[49] Blind Degradation Modelling for High-Resolution Attention-based GAN Python https://github.com/ -
Medical Images (BliMSR) Samiran-Dey/BliMSR
[52] Conditional Variational Autoencoder with Conditional Variational Autoencoder, Python https://github.com/ FID, SSIM
Balanced Pre-training for GANs GAN alibraytee/CAPGAN
[53] Contrastive Diffusion Model with Auxiliary Contrastive Diffusion Model Python https://github.com/ PSNR, SSIM, NMSE
Guidance for Coarse-to-Fine PET Reconstruction Show-han/PET-
Reconstruction
[30] Correction of Out-of-Focus Microscopic Images by CycleGAN Python https://github.com/ PSNR, SSIM, CC
Deep Learning jiangdat/COMI
[56] CTFlow: Mitigating Effects of CT Acquisition and Normalizing Flows Python https://github.com/hsu- PSNR, SSIM, LPIPS
Reconstruction with Normalizing Flows lab/ctflow
[34] Dual-Discriminator Conditional GAN for Multi- Dual-Discriminator Conditional GAN Python https://github.com/ entropy, mean
Resolution Image Fusion (DDcGAN) jiayi-ma/DDcGAN gradient, spatial
frequency, PSNR,
SSIM, CC, VIF
[31] Endoscopic Ultrasound Image Synthesis Using a Cycle-Consistent Adversarial Network - https://ebonmati.github. FID
Cycle-Consistent Adversarial Network io/
[32] DC-cycleGAN: Bidirectional CT-to-MR synthesis CycleGAN Python https://github.com/ PSNR, SSIM, MAE
from unpaired data JiayuanWang-JW/DC-
cycleGAN
[54] Fast MRI Reconstruction: How Powerful Vision Transformer Python https://github.com/ PSNR, SSIM, FID
Transformers Are? ayanglab/SwinGANMR
[58] Flow-Based Visual Quality Enhancer for Super- Flow-Based Network Python https://github.com/ PSNR, SSIM, LPIPS
Resolution Magnetic Resonance Spectroscopic dsy199610/Flow-
Imaging Enhancer-SR-MRSI
[59] HQG-Net: Unpaired Medical Image Enhancement Combination of Enlighten & Still Python https://github.com/ PSNR, average
with High-Quality Guidance GANs ChunmingHe/HQG-Net gradient, ENIQE,
BRISQUE
[60] Image Augmentation Using a Task-Guided Task-Guided GAN Python https://github.com/ MSE, MAE
Generative Adversarial Network for Age ruizhe-l/tgb-gan
Estimation on Brain MRI
[61] On Data Augmentation for GAN Training Data Augmentation for GANs Python https://github.com/ FID, IS, KLD
sutd-visual-computing-
group/dag-gans
[50] Evaluating the Clinical Realism of Synthetic Chest Progressively Growing GANs Python https://github.com/ FID, Human eYe
X‑Rays Generated Using Progressively Growing BradSegal/CXR_PGGAN Perceptual
GANs Evaluation
[51] SD-GAN: A Style Distribution Transfer Generative Style Distribution GAN (SD-GAN) Python https://github.com/ PSNR, SSIM
Adversarial Network tasleem-hello/SD-GAN/
tree/SD-GAN
[62] Self-Supervised Visual Representation Learning for CS-CO: hybrid self-supervised visual Python https://github.com/ -
Histopathological Images representation learning method easonyang1996/CS-CO
tailored for H&E-stained
histopathological images
[63] Slice Profile Estimation From 2D MRI Acquisition GAN Python, Docker https://github.com/ MAE, PSNR, SSIM
Using Generative Adversarial Networks shuohan/espreso
[33] StainGAN: Stain Style Transfer for Digital CycleGAN Python https://github.com/ PSNR, SSIM, FSIM,
Histological Images xtarx/StainGAN CC
[57] medigan: A complete pythonic library with CDGAN, CycleGAN, WGAN-GP, C- Python https://github.com/ FID
multiple pre-trained GANs for the generation of DCGAN, PGGAN, FastGAN, SinGAN, RichardObi/medigan
synthetic medical imaging data (mamographies, pix2pix
brain MRI, endoscopy, chest X-ray, cardiac MRI,
breast DCE-MRI)
X-ray, cardiac MRI, and breast DCE-MRI, among others. between synthetic and real images. Furthermore, the Learned Perceptual
The metrics which are widely deployed to assess the synthetic im Image Patch Similarity (LPIPS), the entropy, the mean gradient, the
aging data fidelity and quality include the Frechet Inception Distance spatial frequency, the correlation coefficient, and the Visual Information
(FID) and the Inception Score (IS), which assess the similarity of syn Fidelity (VIF) are further used to assess the perceptual and statistical
thetic data to real data by comparing feature distributions and evalu properties of the synthetic imaging data. Additional metrics, such as, the
ating the performance of image classifiers. The Structural similarity Natural Image Quality Evaluator (ENIQE), the Blind Reference Image
index measure (SSIM), the peak signal to noise ratio (PSNR), the Spatial Quality Evaluator (BRISQUE), the mean squared error (MSE),
normalized mean squared error (NMSE), and the mean average error and the feature similarity index for image quality assessment (FSIM) can
(MAE) are also used to quantify the visual and statistical similarity provide more detailed evaluations of the synthetic image quality.
2898
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
3.2.2.1. Radiomics data (Image-based quantitative features). Radiomics generating synthetic time series data. According to Table 4, these
data consist of quantitative features which are extracted by medical methods have been widely used for: (i) adversarial learning on biosignal
images. They formulate a critical subfield of medical imaging data. data, (ii) generating synthetic data considering metadata as part of the
According to Table 3, most of the studies which focus on synthetic generation process, (iii) augmenting sensor-based data, (iv) synthesizing
radiomics data generation are mainly DL-based using methods, such as, time series EHR data and tackling the imbalance of uncommon diseases,
the WGAN-GP [26], the CTGAN [52], the TVAE and the Copula GAN to (v) multivariate time series generation, (vi) employing existing gener
offer enhanced flexibility and capacity to capture complex data distri ative models to produce medical time series, (vii) generating realistic
butions. On the other hand, the tabular Preset and the Gaussian Copula synthetic time series data sequences of arbitrary length and (viii)
are the two statistical methods that have been used for synthetic generating ECG data.
radiomics data generation, relying on the statistical properties of the The ML-based methods for synthetic time series data generation vary
real-world training data [64]. These methods harness the power of from conventional supervised learning algorithms to advanced AI
adversarial networks to learn the underlying data distribution and modeling. An example of such a method (Table 4) is the two-level
generate synthetic data that closely resemble real-world radiomic fea Hidden Markov Models (HMMs) with regression learners [22], where
tures. Several attempts have been also reported towards the generation the first-level HMM generates realistic sequences of activities, while the
of synthetic radiomic images like the RadSynth [64] which is a deep second one creates sensor events reflective of those activities. Regression
CNN-based model that produces synthetic GLCM (Grey Level learners apply statistical regression to capture time gaps and the dura
Co-occurrence Matrix) entropy images. tion of each activity, ensuring accurate representation of time series
The metrics which are used to measure the fidelity and quality of the data. The latter is typically more flexible and adaptive compared to
synthetic radiomics data, include the Distributed Stochastic Neighbor statistical-based methods. ML-based methods are widely used for
Embedding (t-SNE), which is a dimensionality reduction technique used generating synthetic time series data composed of nested sequences. On
to visualize high-dimensional data and assess clustering and distribution the other hand, DL-based methods lie in the core of synthetic generation
similarities between synthetic and real data. The correlation coefficient of healthcare related time-series data. They often rely on GANs [69],
(CC) is also used to measure the linear relationship between real and which consist of a trained generator on the real dataset that produces the
synthetic data. The Bland-Altman (BA) plot is used to compare two synthetic data and a discriminator that evaluates its reliability. One
measurement techniques by plotting the differences between synthetic notable approach is the Wasserstein Generative Adversarial Network
and real data against their averages, helping to identify any systematic with Gradient Penalty (WGAN-GP) [70], which enhances traditional
differences. In addition, the Chi-Square (CS) test is often deployed to GANs by stabilizing training and improving convergence. Doppel
compare the distributions of categorical variables in synthetic and real GANger (DGAN) [70] introduces a unique approach by generating
data, assessing how well the synthetic data matches the distribution of metadata with a Multi-Layer Perceptron (MLP). Time Series Generative
real data. Basic statistical correlation tests further evaluate the preser Adversarial Network (TS-GAN) [71] focuses on Long Short-Term
vation of statistical properties in synthetic data. Memory (LSTM) networks to maintain temporal dependencies. Other
GAN-based methods include the Multi-label Time series Generative
3.2.3. Time series data Adversarial Network (MTGAN) [28], designed to generate synthetic
The methods for synthetic time series data generation (including data with multiple labels, and the COmmon Source CoordInated
electrocardiogram (ECG), photoplethysmographic (PPG), sensor-based Generative Adversarial Network (COSCI-GAN) [72], which manages
measurements, longitudinal observations, and other biosignals) can be inter-channel correlations to preserve relationships between time series.
split into statistical- and probabilistic-based, ML-based, and DL-based. HealthGAN [73], built on the Wasserstein GAN architecture, targets
The statistical-based methods rely on several statistical principles and healthcare applications, while the Transformer-Based Time Series
probabilistic models. One noticeable approach is the Guided Evolu Generative Adversarial Network (TTS-GAN) [29] employs the trans
tionary Synthesizer (GES), which integrates genetic algorithms, concept former model’s self-attention mechanism. The Modality Transfer
maps, and randomness operators [67]. Another significant Generative Adversarial Network [69] uses GANs to generate synthetic
statistical-based method is the statistical feature space selection, which time series data by transferring modalities. In addition to GAN-based
involves identifying critical features and using them for representative approaches, other DL algorithms contribute to synthetic time series
sampling [23]. The Synthetic Acute Syndromes Creator (SASC) utilizes data generation such as the diffusion-based conditional models, com
summary statistics and internal correlations, maintaining cross-patient bined with structured state space models (SSSMs) [74], the causal
consistency [68]. Additionally, SASC utilizes random generation under recurrent variational autoencoder (CR-VAE) [75], the Variational
constraints focusing on single-parameter distributions and their relative Autoencoders (VAEs) [23] and the Adversarial Autoencoders (AAEs)
correlations [68]. The above-mentioned approaches (Table 4) demon [69]. The above-mentioned DL-based methods showcase the adapt
strate the adaptability and robustness of statistical-based methods in ability and potential of DL in synthetic time series data generation.
Table 3
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic imaging data.
Study Scope Statistical approaches / Programming Open-source codes or libraries Metrics used to
algorithm (s) used language/ used measure synthetic
Software data fidelity/privacy
[26] To apply the WGAN-GP algorithm to generate WGAN-GP Python https://github.com/ t-SNE
radiomics data. EmilienDupont/wgan-gp
[66] Developed a CNN model to efficiently generate RadSynth - - CC, BA plot
radiomics data.
[65] To combine MRI-Based Radiomics with DL-based CTGAN R, Python https://github.com/sdv-dev/ -
data augmentation for differentiating IDH-mutant CTGAN, https://github.com/
grade 4 astrocytomas from IDH-wild-type kasaai/ctgan?tab=readme-ov-file
glioblastomas.
[64] To evaluate the potential of synthetic radiomic Tabular Preset, Gaussian Python https://github.com/sdv-dev/SDV CS test, basic statistical
data generation in addressing data scarcity in Copula, TVAE, CTGAN, correlation test
radiomics/ radiogenomics models. Copula GAN
2899
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 4
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic time-series data.
Study Scope Statistical approaches / Programming Open-source codes or libraries Metrics used to measure
algorithm (s) used language/ used synthetic data fidelity/
Software privacy
[70] Develop a platform for providing WGAN-GP, DGAN - - PRD plots, DLA,
synthetic data considering metadata as Autocorrelation, MAE, CC
part of the time series generation
process.
[74] Generate synthetic ECG data utilizing SSSD-ECG model based on the Python https://github.com/ Utilizing a reference model for
diffusion-based techniques. DiffWave architecture, AI4HealthUOL/SSSD-ECG assessing the realism of the
WaveGAN* , Pulse2Pulse synthetic data
[75] Novel generative model for medical Causal Recurrent Variational Python https://github.com/ MMD, MSE
time series generation. AutoEncoder (CRVAE) hongmingli1995/CR-VAE
[67] Framework for bias analysis in Guided Evolutionary - - Bias score for bias mitigation
healthcare time series data Synthesizer (GES)
[71] A Generative Adversarial Network TS-GAN - - Discriminator loss, MMD, t-
(GAN) architecture for sensor-based SNE and PCA
health data augmentation
[23] Generation of synthetic PPG data using Variational Autoencoder - - Mainly based on classification/
an in-silico cardiac model (VAE) prediction performance
[68] An efficient approach for generating Classical statistical R https://github.com/Fraunhofer- Correlation plots between the
longitudinal observational patients distribution, Summary ITMP/SASC correlations of real and
cohorts statistics, Internal correlations synthetic data
[28] Generate time series EHR data and Multi-label Time series GAN Python https://github.com/LuChang-CS/ GT, JSD, ND
imbalance uncommon diseases. (MTGAN) MTGAN
[72] A novel framework for multivariate COmmon Source CoordInated Python https://github.com/aliseyfi75/ AED, WD, MAE, Frobenius
time series generation GAN (COSCI-GAN) COSCI-GAN norm, SC, KRC
[73] Employing existing generative models HealthGAN, Wasserstein GAN, Python https://bitbucket.org/ AHEC, Welsch t-test
to produce medical time series TimeGAN mvdschaar/mlforhealthlabpub/
src/master/alg/timegan/
[29] A transformer-based GAN generating TTS-GAN Python https://github.com/imics-lab/tts- t-SNE, PCA, ACS, JSD
realistic synthetic time series data gan
sequences of arbitrary length
[69] A broad analysis on adversarial GAN, Adversarial Python https://github.com/ Mainly based on classification/
learning on biosignal data AutoEncoder, Modality theekshanadis/biosignalGANs prediction performance
Transfer GAN
[22] Synthetic time series data generation Combination of HMM and Python https://github.com/jb3dahmen/ AED, DTW
that is composed of nested sequences regression algorithms, Time SynSys-Updated
series distance measures
The utilized metrics to assess the synthetic time-series generated data omics generation approaches rely heavily on established statistical
quality and fidelity include a variety of statistical, visual, and principles and models to simulate multi-omics data (e.g. tran
performance-based measures. Metrics and visualization techniques, scriptomics, metabolomics, proteomics, gene expression). Examples of
including the Distribution (PRD) plots, the Data Labelling Analysis such methods (Table 5) include the randomly selected and randomly
(DLA), the Autocorrelation, the Mean Absolute Error (MAE), and the permuted enriched pathways [76], causal feature clusters [77], the
correlation coefficient are used to evaluate how well the synthetic data random covariance method (RCM) and the Cascade method [78],
preserves the distribution and relationships present in the real data. probabilistic modeling [79], random generation from uniform distri
Utilizing a reference model assesses the realism of synthetic data by butions [80], MVND [81], power law degree distribution [76], random
comparing model performance on synthetic versus real data. The perturbations [82], the simulated linear test (s-test) [83], the stochastic
Maximum Mean Discrepancy (MMD) and the Mean Squared Error (MSE) Block Models (SBM) [14] and the time-evolving graphs with meta
are used to measure the difference in distributions and errors between stability based on stochastic differential equations [15]. According to
synthetic and real data. The bias score evaluates the effectiveness of Table 5, these methods have been used to: (i) produce semi-synthetic
synthetic data in mitigating biases. The discriminator loss in generative metabolomics data preserving underlying distributions, the statistical
models, the visual inspection using t-SNE and PCA, and the correlation assumptions based on the number of pathways, clusters, (ii) validate
plots provide insights into the synthetic data’s visual and structural stratified causal discovery approaches in synthetic omics data, (iii)
quality. Additional metrics, such as, the Generated Disease Types (GT), simulate gene expression data, accounting for additive biases, (iv) to
the Jensen-Shannon Divergence (JSD), the Normalized Distance (ND), model real data distributions in metabolomics and other omics data, (v)
the Average Euclidean Distance (AED), the Wasserstein Distance (WD), generate network topologies for tumor and normal cells in co-expression
the Frobenius norm, the Spearman’s ρ, and the Kendall’s τ are used to networks, (vi) mimic realistic complexities in multi-omics heteroge
further quantify the similarity in statistical properties. Moreover, met neous data analysis, (vii) improve proteomics data analysis through
rics like the Average Hourly Energy Consumption (AHEC), the Welsch t- synthetic data generation, (viii) overcome challenges in multi-omics
test, the Average Cosine Similarity (ACS), and the Dynamic Time data integration, (ix) study human microbiome dynamics, (x) generate
Warping (DTW) are also deployed in specific domain applications. synthetic transcriptomics data reflecting specific trends, and (xi) model
Furthermore, classification and prediction performance-based metrics complex multi-omics data related to cancer. DL-based methods have
are crucial for evaluating the practical utility of synthetic data in pre been also deployed (Table 5), but to a smaller extent, including the
dictive modeling. WGAN-GP [84], the omicsGAN [85], the virtual Autoencoders (VAEs),
and the Deep Boltzmann Machines (DBMs) [86] to: (i) address class
3.2.4. Omics data imbalance problems in high-dimensional microarray and lipidomics
According to the literature, the majority of the existing synthetic data, (ii) enhance disease phenotype predictions, and (iii) enhance the
2900
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 5
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic omics data.
Study Scope Statistical approaches / Programming Open-source codes or Metrics used to measure synthetic
algorithm (s) used language/ libraries used data fidelity/privacy
Software
[87] To evaluate the performance of Randomly selected and randomly Python, R https://github.com/cwieder/ Classification performance/
single-sample pathway analysis permuted enriched pathways to py-ssPA prediction metrics, OC
(ssPA) methods on semi-synthetic produce semi-synthetic
COVID-19 metabolomics data metabolomics data preserving
[88] To demonstrate the benefit of the underlying distributions Python, R https://github.com/cwieder/ Classification performance/
grouping molecules into pathways (both joint and marginal) PathIntegrate prediction metrics, Sensitivity to
using semi-synthetic COPD and Low Signal-to-Noise Signals,
COVID-19 metabolomics, Significance of Pathway Feature VIP
proteomics and transcriptomics or MB-VIP Value
data
[84] To address the class imbalance WGAN-GP Python https://github.com/ Welch’s t-test, standard deviation,
problem in high-dimensional sjcusworth/GAN_Scripts mean difference in scores, distance
microarray and lipidomics data metric on generator loss
using synthetic data
[77] To validate a stratified causal Statistical assumptions based on Matlab https://github.com/ Classification performance/
discovery approach using synthetic the number of pathways, clusters, MehrdadMansouri/Aristotle prediction metrics
omics data causal feature clusters
[14] To overcome the challenges posed Stochastic Block Model (SBM) Matlab https://github.com/ Classification performance/
by the integration of multi-omics hamas200/MVCPM prediction metrics
data (miRNA, DNA methylation,
gene expression) in five different
types of cancer using synthetic data
for validation
[15] To study the dynamic processes of Time-evolving graphs with C+ +, Python https://github.com/k-melnyk/ Classification performance/
the human microbiome using metastability using a model graphKKE prediction metrics, CC, visual
synthetic data for validation based on stochastic differential inspection of temporal patterns
equations
[78] To simulate real-world gene Random covariance method Matlab https://github.com/evcphd/C- Classification performance/
expression data, including the (RCM), Cascade method SHIFT prediction metrics, CC
effects of additive biases
[85] To enhance the prediction of omicsGAN (uses two Wasserstein Python https://github.com/ Classification performance/
disease phenotypes by generating GANs with a gradient penalty CompbioLabUCF/omicsGAN prediction metrics, Student’s t-test,
synthetic data that better reflect the (wGAN-GP)) Kaplan-Meier Survival Plots and
underlying biological mechanisms Log-Rank Test P-values, heat maps
and bar graphs, comparing the
empirical correlations and
normalization performance
[79] To identify the biological relevance Probabilistic modeling of the real R https://bitbucket.org/ Classification performance/
of different variables in data distributions given mass-to- cesaremov/ prediction metrics
metabolomics, transcriptomics and charge ratios, peak intensities targetdecoy_mining/src/
proteomics data using synthetic and noise levels master/
data for validation
[89] To identify and analyze gene Image based (uses a black and Python https://github.com/almaan/ Diffusion time, entropy, pattern
expression profiles with distinct white image to create a sepal families
spatial patterns based on synthetic structured grid of gene expression
spatial transcriptomics data values) and Turing based (uses
mathematical models to simulate
Turing patterns)
[80] To recover significant circadian and Random generation from uniform R https://github.com/ Distance between correlation
non-circadian trends from distributions with given delosh653/MOSAIC matrices, heat maps to visualize the
transcriptomic data using synthetic parameters (e.g., slope, phase relative error between the
data for validation sight, growth rate, equilibrium correlation matrices of real and
shift) and value ranges synthetic data after normalization
[86] To analyze patterns and Variational Autoencoders Python https://github.com/ Discrimination ability between
interactions of complex omics data (VAEs), Deep Boltzmann ssehztirom/Exploring- different cell types by varying the
(single-cell RNA-Seq data) using Machines (DBMs), log-linear generative-deep-learning-for- number of selected genes for
synthetic data to enhance the models omics-data-by-using-log- annotation, DBI, Robustness Against
interpretability of biological linear-models Dichotomization, NMF)
processes
[81] To enable multi-insight data Multivariate normal distribution R https://cran.r-project.org/ Classification performance/
visualization using synthetic and to model methylation, gene and web/packages/InterSIM/ prediction metrics, Student’s t-test
simulated multi-omics protein expression data index.html
data (mRNA expression, DNA
methylation) related to ovarian and
breast cancer
[76] To generate gene/protein co- A power law degree distribution R https://github.com/petraf01/ Classification performance/
expression networks specifically for is used to randomly generate TSNet prediction metrics, standard
tumor cells tumor and normal network deviation, Welch’s t-test
topologies
[83] To improve the analysis of A simulated linear test (s-test) R, Matlab https://tvpham.github.io/ s-test, RMSE, Log-Likelihood Ratio
proteomics data, particularly in using adaptive Gauss-Hermite stest/ Test, cV, Gauss-Hermite Quadrature
(continued on next page)
2901
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 5 (continued )
Study Scope Statistical approaches / Programming Open-source codes or Metrics used to measure synthetic
algorithm (s) used language/ libraries used data fidelity/privacy
Software
interpretability of complex omics data patterns and interactions. real data distributions. Correlation coefficients are crucial for preserving
The metrics which are used to measure the fidelity and quality of linear relationships between variables. Visual inspection techniques,
synthetic omics data include a wide range of performance-based, sta including heat maps and bar graphs, are often employed to compare
tistical, and visual techniques to ensure the synthetic data closely mir empirical correlations and normalization performance. Advanced met
rors the real data. Performance metrics such as recall, precision, AUC rics, such as, the distance between correlation matrices, the Frobenius
(Area Under the Curve), adjusted rand index (ARI), overlap coefficient norm, the diffusion time, the entropy, and the pattern families are used
(OC), and variable importance in projection (VIP) are used to evaluate to assess temporal and structural fidelity. Additional metrics like the
classification performance, clustering similarity, and feature signifi Davies-Bouldin index (DBI), the robustness against dichotomization,
cance. The sensitivity to low signal-to-noise signals and the significance and comparisons with non-negative matrix factorization (NMF) are
of pathway features are assessed to ensure robustness. Statistical tests deployed to measure clustering quality and robustness. The module
like the Welch’s t-test, the standard deviation, the mean difference in detection score (MDS) is used to evaluate the detection of similar pat
scores, the Student’s t-test, the Kaplan-Meier survival plots, and the Log- terns in the synthetic data.
Rank Test P-values provide comparative analysis between synthetic and
Table 6
A summary of the scope, algorithms, programming languages, open-source codes or libraries, and metrics to measure synthetic data quality which are used by the
studies that focus on the generation of synthetic multimodal data.
Study Scope Statistical approaches / Programming Open-source codes or Metrics used to measure
algorithm (s) used language/ libraries used synthetic data fidelity/
Software privacy
[90] To generate synthetic patient-level data using Multimodal Neural Ordinary Python https://github.com/SCAI- JSD, CSA, MTC,
a novel approach which integrates both static Differential Equations BIO/MultiNODEs Classification
and longitudinal data (MultiNODEs) performance/
prediction metrics
[91] To overcome the limitation of sparse CycleGAN - - MAE, SSIM, FSIM, EPR,
annotated data in medical image registration EGR, NPS, CC, NM, HistCC,
by synthesizing multimodal 4D datasets (CT, DSC
CBCT, and MR images)
[92] To generate synthetic free-text and tabular Encoder-decoder models based Python https://github.com/ Classification
data in electronic health records (EHRs) using on LSTM RNNs scotthlee/nrc performance/
deep learning algorithms to enhance data prediction metrics, COR
sharing and privacy
[93] To generate missing MRI modalities (T1, T1ce, RAGAN, Modified U-Net, Multi- Python tensorflow and keras libraries PSNR, SSIM, FSIM, EPR,
FLAIR) from existing T2 modality images to Branch Convolutional Neural EGR, NPS, NCC, DSC
address the issue of incomplete multimodal Network
datasets in clinical settings
[94] To generate synthetic clinical, laboratory, CTAB-GAN+ and normalizing Python https://github.com/ Summary statistics, log-
genetic data mimicking real AML patient data flows (NFlow) waldemar93/ transformed correlation
from clinical trials synthetic_data_pipeline score, Kaplan-Meier-
Divergence, PLC
[95] Synthetic data generation of real-time Temporally Correlated Python https://github.com/ WD, KS test, JSD, PCD
multimodal electronic health and physical Multimodal GAN (TC- GATEKEEPER-OU/synthetic-
records (MHR, wearable biometric and MultiGAN), Document Sequence data
behavioral data, and self-assessment surveys Generator (DSG)
in the standard FHIR format)
[96] MRI synchronous construction from a single CMSG-Net compared against Python pytorch MAE, NRMSE, PSNR, SSIM
T1-weight (T1) image for MRIgRT synthetic Pix2pix, CUT, TransUNet,
CT (sCT) image generation ResViT, SE2SD-Net
[97] To synthesize pseudo-medical images between TGAN (cGAN and CycleGAN) Python tensorflow PSNR, SSIM, MAE, NMI,
multimodal datasets (CBCT -> CT, CBCT -> Dose Distribution and
MRI, MRI -> CT) Gamma Analysis
[98] To generate synthetic X-ray images and End-to-end MultImodal X-ray Python pytorch Classification
corresponding text reports genERative performance/
model (EMIXER) prediction metrics, BLEU
1-4, CIDEr Score, FID
[99] To generate synthetic EHRs (including PromptEHR (based on language Python https://github.com/ Perplexity, Recall@ 10 and
numerical and categorical data as well as text) models) compared against RyanWangZf/PromptEHR Recall@ 20, t-test,
LSTM+MedGA, SynTEG, Wilcoxon test, Fisher’s
LSTM+MLP and GPT-2 exact test
2902
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
3.2.5. Multimodal data dependencies, effectiveness in addressing class imbalance, and suit
Table 6 presents significant efforts that have been made in the ability for the healthcare domain. The evaluations draw on insights from
literature towards synthetic multimodal data generation. Most of these recent literature reviews and empirical studies, highlighting both the
efforts focused on the development of AI-based methods including the potential and limitations of various synthetic data generation methods
Multimodal Neural Ordinary Differential Equations (MultiNODEs) [90], [9,100–102]. In the case of probabilistic-based models, bootstrapping
the CycleGAN [91], LSTM-based encoder-decoder models [92], the and MVND offer a straightforward implementation but might not cap
RAGAN combined with Modified U-Net and Multi-Branch Convolutional ture complex data dependencies adequately. Vine Copula Models stand
Neural Network [93], the CTAB-GAN+ alongside normalizing flows out for their ability to model intricate dependencies between variables,
(NFlow) [94], the Temporally Correlated Multimodal Generative although they are complex to set up and interpret. SSM are well-suited
Adversarial Networks (TC-MultiGAN) with Document Sequence Gener for modeling sequential data and transitions, particularly in applica
ators [95], CMSG-Net in comparison with Pix2pix [96], the TGAN [97] tions with clear state definitions, but are limited to such specific sce
which combines cGAN and CycleGAN, an End-to-end MultImodal X-ray narios. Bayesian Networks are characterized by their powerful
genERative model (EMIXER) [98], and the PromptEHR [99] compared probabilistic modeling and inference capabilities which incorporate
against a suite of LSTM and GPT-2 based models. The applications of causal relationships, though they may struggle with big data and require
these methods are diverse and focused on enhancing the utility and complex structuring. In the field of omics, the SBM effectively models
privacy of healthcare data. Those include the generation of: (i) synthetic complex relationships and community structures yet demands precise
patient-level data that integrate static and longitudinal elements, (ii) parameter tuning and can be computationally demanding. Similarly, the
multimodal 4D datasets for medical image registration, the generation RCM and the Cascade Method aim to simulate realistic gene expression
of synthetic text and tabular data for electronic health records, (iii) data, including various biases, but might oversimplify and not capture
missing MRI modalities to complete clinical datasets, mimicking real all underlying biological complexities.
clinical trial data, (iv) real-time multimodal electronic health records, Bayesian Models and Tree Ensembles are noted for their flexibility
(v) MRI synchronous images from single modalities, (vi) pseudo-medical and effectiveness in handling non-linear data patterns. They excel at
images between various imaging modalities, (vii) synthetic X-ray images incorporating uncertainty into predictions, which is crucial for decision-
and corresponding textual reports, and (viii) synthetic EHRs. These making processes, where risk assessment is significant. However, their
advancements underscore the pivotal role of synthetic data in improving performance is restricted by the scale of the data, and they are compu
data availability while ensuring privacy in healthcare settings. tationally demanding, a fact that limits their use in real-time or resource-
The metrics used to measure synthetic multimodal data fidelity and constrained environments. On the other hand, the Radial Basis Function
quality, in this context, include a diverse array of performance, statis (RBF)-based ANNs are designed to handle complex, non-linear in
tical, and visual measures. The Jensen-Shannon divergence (JSD) and teractions within data. They offer a powerful mechanism for pattern
the correlation structure analysis (CSA) are used to evaluate the distri recognition and classification tasks but require significant computa
butional similarities and correlations between synthetic and real data, tional resources, particularly in tuning and training phases. The BGMM
while performance metrics like the AUC and the median trajectory algorithm is efficient for clustering and for density estimation. It offers a
comparison (MTC) provide insights into the overall predictive perfor probabilistic framework that helps to determine the number of compo
mance and temporal alignment. The Mean Absolute Error (MAE), the nents (clusters) in a dataset. The primary challenges with BGMM
structural Similarity Index Measure (SSIM), the Feature Similarity Index involve: (i) the sensitivity to initial parameters, and (ii) the selection of
Measure (FSIM), the Edge Preservation Ratio (EPR), the Edge Genera the number of Gaussian components, which can significantly affect the
tion Ratio (EGR), the Noise Power Spectrum (NPS), the Noise Magnitude model’s performance.
(NM), the Histogram Correlation Coefficient (HistCC), and the Dice The Probabilistic BNs, which are built on probabilistic reasoning, are
Similarity Coefficient (DSC) are used to assess various aspects of image excellent for causal inference and they are particularly useful in fields
quality and feature preservation. Classification/prediction performance like epidemiology and genetics where understanding causal relation
metrics, such as, the recall, the F1 score, the accuracy, the crude odds ships is crucial. The downside lies in their complexity in structure and
ratios (COR), and the removal of Personally Identifiable Information computational demands, especially in the case of big data which may
(PII) are crucial to ensure both accuracy and privacy. The PSNR, the slow down the inference process. In omics studies, the SBM effectively
SSIM, the FSIM, and other noise-related metrics evaluate the visual and models complex relationships and community structures within bio
structural fidelity of synthetic data. Statistical measures, including logical data. It requires precise parameter tuning and substantial
mean, median, standard deviation, log-transformed correlation scores, computational power, which may limit its practical application in
and Kaplan-Meier-Divergence, alongside the Privacy Leakage Coeffi resource-constrained settings. In addition, the RCM and the Cascade
cient (PLC), provide an indication of data integrity and privacy. The method aim to simulate realistic gene expression data, considering
Wasserstein distance, the KS test, and the distance pairwise correlation various biases to enhance the realism of synthetic datasets. They might
further measure the statistical similarity between datasets. Additional simplify complex biological interactions, but they might miss some un
metrics like normalized mutual information (NMI), the dose distribu derlying dynamics.
tion, the gamma analysis, the BLEU scores, the CIDEr score, the Fréchet Table 7 also presents various DL-based generators, each designed to
Inception Distance (FID), the perplexity, and the recall@ 10 and handle specific challenges. These methods leverage the capabilities of
recall@ 20 are also used to assess both the fidelity and utility of syn DL to learn complex patterns, and to generate synthetic data in an
thetic data along with statistical tests, such as, the t-test, the Wilcoxon effective way. To this end, the ADS-GAN, the CTGAN, and other GAN-
test, and the Fisher’s exact test. based variations are particularly effective for generating synthetic
tabular data that preserve privacy. However, although they are known
4. Discussion for their ability to handle high-dimensional data, they require hyper
parameter tuning to avoid issues like mode collapse. The Enhanced
A thorough overview of the above-mentioned synthetic data gener Balancing GAN and the Attention-GAN are designed for imaging data to
ators utilized in the assessed studies are presented in Table 7. The table tackle crucial problems, such as, class imbalance and contrast
presents also the advantages and weaknesses of each methodological enhancement during image synthesis. Although they are powerful for
approach. The advantages and weaknesses of each synthetic data gen capturing intricate details, they may be prone to overfitting, especially
eration approach are defined on the basis of diverse criteria, such as, in small datasets. The Contrastive Diffusion Model and the Flow-Based
implementation simplicity, computational efficiency, flexibility in Network model excel in generating high-resolution and fine-grained
handling non-linear data, robustness in modeling complex images. They offer precise likelihood computation and are effective in
2903
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 7
A thorough report of the advantages and weaknesses of the synthetic data generation algorithms deployed in the studies from Tables 1–6.
No Algorithm [Indicative study] Type of Supported Advantages Weaknesses Programming
method type (s) of language
data
1 Bootstrapping, MVND[13] Statistical Tabular Simple to implement, robust statistical May not capture complex dependencies R, Python
foundations. in data.
2 ADS-GAN[24] Deep Tabular Good for generating privacy-preserving Requires careful tuning to prevent Python
learning synthetic data. mode collapse.
3 Bayesian models[40] Machine Tabular Flexible, good for non-linear data, Computationally intensive, requires R, Python
learning incorporates uncertainty. substantial data.
4 Tree ensembles[16] Machine Tabular Combine multiple decision trees to Training can be computationally R, Python
learning improve the robustness of the expensive, especially with large
generated data. datasets and a high number of trees,
leading to longer processing times and
higher resource usage.
5 RBF-based ANNs[103] Deep Tabular Suitable for generating high-quality Scalability issues as the number of data R, Python
learning synthetic data that accurately reflects points increases, leading to higher
the underlying patterns in the original computational costs and potential
dataset. difficulties in managing large datasets.
6 Vine Copula Models[39] Statistical Tabular Excellent at modeling complex Complex to set up and interpret. R
dependencies between variables.
7 BGMM[19] Machine Tabular Efficient at clustering and density Sensitive to the initialization and Python
learning estimation. number of components.
8 BGMMO-CE[19] Machine Tabular Optimized for computational May lose some nuances of data Python
learning efficiency. complexity.
9 TabularSimulationBase[27] Deep Tabular Versatile and capable of generating Can be challenging to tune multiple Python
learning diverse synthetic datasets. models effectively.
10 GaussianCopula[43] Statistical Tabular Effectively captures complex Assumption of normality. Python
dependencies between multiple
variables, allowing for a more accurate
representation of multivariate
relationships.
11 CopulaGAN[27] Deep Tabular Leverages the flexibility of copula Training can be computationally Python
learning models to capture complex intensive, requiring significant
dependencies between variables and computational resources and time,
the generative power of GANs to especially for high-dimensional
produce realistic synthetic data. datasets.
12 TVAE[47] Deep Tabular Specifically designed to model tabular Training can be complex and Python
learning data, capturing complex relationships. computationally intensive, requiring
careful tuning of hyperparameters and
sufficient computational resources to
achieve optimal performance.
13 MedGAN[57] Deep Tabular It can generate realistic synthetic Need for substantial computational Python
learning healthcare data, including high- resources and expertise in fine-tuning
dimensional EHRs. GAN models.
14 Tabular Preset[64] Deep Tabular, Handles high-dimensional data well. High complexity and computational Python
learning Radiomics demand.
15 Bayesian (hierarchical) Machine Tabular Excellent for data with hierarchical Requires extensive computational R
Generalized Linear Models learning structures. resources.
(hGLM)[37]
16 State-transition Machines[41] Statistical Tabular Good for modeling sequential data and Limited to applications with clear state Java
transitions. transitions.
17 CGAN[25] Deep Tabular Advanced GAN models capable of Training stability can be an issue. Python
learning generating highly realistic data.
18 CTGAN[46] Deep Tabular Specialized for tabular data, helps Requires careful hyperparameter Python
learning mitigate class imbalance. tuning.
19 RadialGAN[47] Deep Tabular Cutting-edge methods for detailed Complex architectures that require Python
learning synthetic data generation. significant training.
20 TabDDPM[47] Deep Tabular Models the data generation process Iterative nature of denoising diffusion Python
learning through a series of diffusion steps, models, requires significant
capturing complex data distributions computational resources and time to
and dependencies accurately. train, especially on large datasets.
21 Bayesian Networks[38] Machine Tabular Powerful for probabilistic modeling Graph structure may be hard to specify R, Software
learning and inference. with limited data.
22 Sequential Decision Tree- Deep Tabular Flexible and scalable to different data Complexity increases with data R, Python
Synthesizer[45] learning types. dimensionality.
23 Enhanced Balancing GAN[48] Deep Imaging Specifically designed to address class Potentially limited to specific image- Python
learning imbalance in image data. related tasks.
24 Ensemble of Convolutional Deep Imaging Effective for robust image analysis and Requires significant computational Python
Neural Networks[55] learning out-of-distribution data. power and data for training.
25 Attention-GAN[49] Deep Imaging Capable of capturing intricate details in May be prone to overfitting on small Python
learning image synthesis. datasets.
26 Conditional Variational Deep Imaging Combines the strengths of CVAE and Complex to implement and tune. Python
Autoencoder[52] learning GAN for improved generation.
27 Contrastive Diffusion Model Deep Imaging Excels at generating high-resolution, Computationally demanding and Python
[53] learning fine-grained images. requires tuning.
(continued on next page)
2904
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 7 (continued )
No Algorithm [Indicative study] Type of Supported Advantages Weaknesses Programming
method type (s) of language
data
28 Normalizing Flows[56] Deep Imaging Offers exact likelihood computation Requires careful design to ensure Python
learning and invertibility. effective flow architectures.
29 Dual-Discriminator Conditional Deep Imaging Enhances detail and realism in multi- May introduce high complexity and Python
GAN[34] learning resolution image fusion. training difficulty.
30 Vision Transformer[54] Deep Imaging Harnesses the power of transformers for Requires large datasets and extensive Python
learning image processing. training time.
31 Flow-Based Network[58] Deep Imaging Useful for enhancing visual quality in Relatively new with potentially Python
learning super-resolution tasks. unexplored limitations.
32 Task-Guided GAN[60] Deep Imaging Tailors the generation process to Task-specific tuning can limit general Python
learning specific tasks, enhancing utility. application.
33 Progressively Growing GANs Deep Imaging Allows for gradual building of image High resource consumption and Python
[50] learning resolution, enhancing detail. complex training dynamics.
34 Style Distribution GAN (SD- Deep Imaging Focuses on transferring and blending Managing style variations effectively Python
GAN)[51] learning diverse style features. can be challenging.
35 CS-CO (Self-Supervised Deep Imaging Self-supervised learning method for Limited by the quality and variation of Python
Learning)[62] learning histopathological images. unlabeled data.
36 SinGAN[57] Deep Imaging Capable of generating high-quality May not perform well with complex Python
learning images from a single training image. scenes containing multiple objects.
37 FastGAN[57] Deep Imaging Faster and more efficient training Limited research and applications Python
learning compared to traditional GANs. compared to more established GAN
models.
38 PGGAN[57] Deep Imaging Can generate very high-resolution Training can be computationally Python
learning images (e.g., 1024 ×1024). intensive and time-consuming.
39 pix2pix[57] Deep Imaging Effective for image-to-image Can suffer from mode collapse, where Python
learning translation tasks. the generator produces limited
diversity in outputs.
40 WGAN-GP[57] Deep Radiomics Effective at generating realistic samples May require extensive computational Python
learning and stable training. resources.
41 RadSynth[66] Deep Radiomics Specifically designed for radiomic Limited information available; Software
learning image synthesis. potential specificity to tasks.
42 SSSD-ECG[74] Deep Time-series Specifically tailored for synthetic ECG Specifically tailored for synthetic ECG Python
learning data generation. data generation.
43 DiffWave[74] Deep Time-series Particularly effective in generating Requires significant computational Python
learning high-fidelity synthetic data by power and expertise in deep learning
leveraging the power of diffusion and diffusion model techniques to
models to produce realistic and high- achieve optimal results.
quality outputs.
44 WaveGAN[74] Deep Time-series Effective for applications that require Can be unstable and requires careful Python
learning realistic and coherent audio data tuning of hyperparameters.
generation, such as speech and music
synthesis.
45 Pulse2Pulse[74] Deep Time-series Tailored for generating realistic Requires extensive hyperparameter Python
learning physiological pulse signals, such as ECG tuning and significant computational
or PPG data. resources to accurately capture the
nuances of physiological signals.
46 Causal Recurrent Variational Deep Time-series Excels at generating time-series data Potentially complex to implement and Python
AutoEncoder (CRVAE)[75] learning with underlying causality. requires substantial data.
47 Guided Evolutionary Deep Time-series Adaptable to different bias scenarios in May require expert knowledge to Python
Synthesizer (GES)[67] learning time-series data. configure and operate.
48 TS-GAN[71] Deep Time-series Tailored for sensor- health data Specific to sensor data, may not Python
learning augmentation. generalize across domains.
49 Variational Autoencoder (VAE) Deep Time-series Good for modeling distribution of data Sometimes struggles with the quality of Python
[23] learning for simulation. generated samples.
50 Multi-label Time series GAN Deep Time-series Effective for handling time series data Requires careful tuning and extensive Python
(MTGAN)[28] learning with multiple labels. dataset preparation.
51 COmmon Source CoordInated Deep Time-series Innovative for generating multivariate New approach with potential untested Python
GAN (COSCI-GAN)[72] learning time series. scenarios.
52 HealthGAN, Wasserstein GAN, Deep Time-series Advanced suite of models for High computational demand and Python
TimeGAN[73] learning comprehensive time series generation. complexity.
53 Transformer-Based GAN (TTS- Deep Time-series Utilizes transformer architectures for Requires extensive computational Python
GAN)[29] learning high fidelity synthesis. resources and data.
54 HMM and Regression Machine Time-series Effective for capturing sequences and Complex integration of multiple Python
Algorithms[22] learning transitions in time series data. modeling techniques.
55 Randomly Selected and Statistical Omics Efficiently preserves the statistical Limited to the statistical properties R, Python
Randomly Permuted Enriched distributions for semi-synthetic available in the data; may not
Pathways[87],[88] metabolomics data analysis. introduce novel biological insights.
56 Stochastic Block Model (SBM) Probabilistic Omics Effectively models complex Requires careful parameter tuning and Matlab
[14] relationships and community structures can be computationally intensive.
within multi-omics data.
57 Time-evolving Graphs with Probabilistic Omics Captures dynamic processes effectively, Complex to implement and requires C+ +, Python
Metastability[15] useful for studying temporal changes in understanding of differential equations
microbiomes. and graph theory.
(continued on next page)
2905
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
Table 7 (continued )
No Algorithm [Indicative study] Type of Supported Advantages Weaknesses Programming
method type (s) of language
data
58 Random Covariance Method Statistical Omics Simulates real-world gene expression Potentially oversimplified, might not Matlab
(RCM)[78] data including various biases, capture all underlying biological
enhancing realism. complexities.
59 Cascade Method[78] Statistical Omics Effective in handling hierarchical or Errors can accumulate and propagate Matlab
sequential processes by breaking down through the stages of the cascade,
complex problems into simpler, smaller potentially leading to reduced overall
stages, which can improve the accuracy accuracy and reliability in the final
and manageability of modeling efforts. synthetic data generation if not
carefully managed.
60 omicsGAN[85] Deep Omics Utilizes advanced GAN technology to GANs can be challenging to train and Python
learning generate high-fidelity omics data, require large amounts of data to avoid
improving phenotype prediction. mode collapse.
61 Image-based and Turing-based Deep Omics Innovative use of visual and May require specific expertise in both Python
Methods[89] learning mathematical models to simulate image processing and mathematical
spatial gene expression patterns. modeling.
62 Random Generation from Statistical Omics Simple and effective for generating data Lacks complexity, might not be suitable R
Uniform Distributions[80] with specified statistical properties. for capturing non-linear relationships
or interactions.
63 Deep Boltzmann Machines Deep Omics Capable of capturing complex and high- Computationally intensive and Python
(DBMs)[86] learning dimensional data distributions. challenging due to the need for layer-
wise pre-training and fine-tuning.
64 Power Law Degree Distribution Statistical Omics Useful for generating network Assumes network connectivity that R
[76] topologies that mimic natural follows a power law, which might not
biological networks. be appropriate for all types of
biological data.
65 Simulated Linear Test (s-test) Statistical Omics Adapts well to small sample sizes and Specific to scenarios with small sample R, Matlab
[83] can handle technical variations in sizes and may not generalize to larger
proteomics data. or different datasets.
66 Structured and Random Statistical Omics Allows for the generation of complex Requires careful calibration to ensure Python
Perturbations[82] multi-omics data, enhancing the the perturbations reflect realistic
realism and applicability of synthetic biological variability.
datasets.
67 Multimodal Neural Ordinary Deep Multimodal Integrates static and longitudinal data Requires careful configuration and Python
Differential Equations learning effectively for patient-level data understanding of both differential
(MultiNODEs)[90] synthesis. equations and neural networks.
68 CycleGAN[91] Deep Multimodal Excellent for image-to-image Can struggle with maintaining Python
learning translation tasks without needing consistency in synthesized images
paired data, useful in medical imaging. where there is a large variation
between input modalities.
69 Encoder-decoder models based Deep Multimodal Effective for generating coherent and May face challenges with very long Python
on LSTM RNNs[92] learning contextually relevant text and tabular sequences or extremely diverse
data. datasets.
70 RAGAN, Modified U-Net, Multi- Deep Multimodal Combines multiple advanced Complex to train and requires Python
Branch Convolutional Neural learning techniques to fill missing MRI substantial computational resources.
Network[93] modalities, enhancing dataset
completeness.
71 CTAB-GAN+ and Normalizing Deep Multimodal Allows for detailed control over the Configuration and tuning can be Python
Flows (NFlow)[94] learning statistical properties of synthetic data, complex, and understanding statistical
suitable for clinical and laboratory data underpinnings is essential.
simulation.
72 Temporally Correlated Deep Multimodal Tailored for generating time-correlated Can be challenging to synchronize Python
Multimodal Generative learning multimodal datasets, particularly in multiple data streams effectively.
Adversarial Network (TC- dynamic and real-time environments.
MultiGAN)[95]
73 Document Sequence Generator Deep Multimodal Particularly useful for tasks involving Requires substantial computational Python
(DSG)[95] learning document and text data generation by power and careful tuning of model
capturing complex temporal parameters to achieve high-quality and
dependencies within sequences. coherent text generation.
74 CMSG-Net[96] Deep Multimodal A robust set of tools for MRIgRT Each model brings its own set of Python
learning synthetic CT image generation, parameters and complexities,
utilizing both established and cutting- potentially complicating integration
edge techniques. and optimization.
75 TGAN[97] Deep Multimodal Enables effective synthesis of medical Requires careful adjustment to ensure Python
learning images between different modalities, high fidelity and avoid artifacts
addressing the scarcity of annotated common in synthesized images.
medical images.
76 End-to-end MultImodal X-ray Deep Multimodal Specifically designed to generate Integrating text and image generation Python
generative model (EMIXER) learning synthetic X-ray images along with smoothly can be technically
[98] corresponding textual reports, challenging and requires extensive data
enhancing data utility for training AI for training.
models.
77 PromptEHR (based on language Deep Multimodal Utilizes advanced language models to Balancing the generation of coherent Python
models)[99] learning generate synthetic EHRs, enabling a and realistic EHRs while ensuring
high degree of realism and complexity. privacy can be difficult.
2906
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
tasks like super-resolution, though they require significant computa and omics data is essential for the development of robust AI models
tional resources and careful tuning to perform effectively. that can deliver more accurate and personalized healthcare solutions.
VAEs and Transformer-Based Models are widely used for modeling Towards this direction, there has been a reported increase in the use of
time-series and imaging data. Transformer-based models, like the Vision statistical and probabilistic methods, machine learning methods, and
Transformers, utilize attention mechanisms to handle big data but they deep learning methods for generating synthetic data with improved fi
require extensive training time and resources. Multimodal approaches, delity and utility. For tabular data, statistical methods like MVND and
such as, the CycleGAN, and the Encoder-decoder models are ideal for bootstrapping, and probabilistic methods like Bayesian Models are
image-to-image translation tasks without the need of paired data. widely used for generating synthetic distributions that preserve the
Furthermore, encoder-decoder models based on LSTM RNNs can effec underlying statistical properties of the real data. These methods are
tively generate coherent text and tabular data. Although these methods valuable for simulations in clinical trials and disease progression
manage to synthesize data robustly across different modalities, they modeling. Machine learning methods, such as, GMM and tree ensembles
often struggle to maintain consistency or handle very long sequences. In can effectively capture complex patterns within the data, aiding in the
medical imaging, models like TGAN and CMSG-Net are widely used to generation of large-scale virtual populations for in silico clinical trials.
synthesize medical images that can address the scarcity of annotated DL-based methods like GANs and VAEs have been utilized to enhance
images and thus enhance data utility for training AI models. In omics, privacy-conscious data generation, supporting applications such as
methods like omicsGAN and Probabilistic Modeling utilize advanced clinical decision support and predictive modeling. For imaging data,
GAN technology and probabilistic approaches to simulate complex CycleGAN and Enhanced Balancing GAN are instrumental in generating
biological data patterns, improving phenotype prediction but requiring synthetic medical images, including functionalities to address minority
large datasets to avoid overfitting. classes in datasets or to perform style transfer between different imaging
There is no doubt that synthetic data generation has been the point of modalities. Conditional Variational Autoencoder (CVAE) and Attention-
interest in a broad spectrum of studies under the healthcare domain. based GANs are deployed for specific tasks like image augmentation and
However, they often require significant computational resources and high-resolution image synthesis, showcasing their adaptability in
configuration to optimize their performance and utility. The DL-based handling varied imaging data challenges. In radiomics data, WGAN-GP
generators demonstrate a broad capability to generate, enhance, and and CTGAN have been employed to generate synthetic radiomic data,
analyze data in healthcare. They are marked by their ability to handle which are crucial for training models to differentiate between various
complex and high-dimensional data, but often at the cost of high medical conditions using radiomic features. Copula GAN and Diffusion-
computational demand and the need for extensive data and model based Models are cutting-edge methods enhancing the capacity to
tuning. The ML-based generators are robust and capable of modeling generate realistic and statistically coherent radiomic images and
complex, non-linear relationships and are computational efficient. Their features.
effectiveness often comes at the cost of increased computational re For time-series data, Wasserstein GAN with Gradient Penalty
quirements and complexity in tuning and operation which necessitates (WGAN-GP) and Multi-label Time Series GAN (MTGAN) offer robust
their optimized implementation to maximize their potential by effec solutions for generating realistic time-series data, crucial for medical
tively reducing resource constraints. Probabilistic models are charac applications where temporal dynamics are essential. Transformer-Based
terized by their ability to incorporate uncertainty into the modeling GANs and Causal Recurrent Variational Autoencoders (CR-VAEs) un
process. However, they often require careful design and parameter derscore the evolution towards using complex architectures to maintain
tuning and can be computationally intensive with limited scalability temporal dependencies and enhance the fidelity of synthetic time-series
when handling complex data. datasets. In omics data, Randomly Selected Pathways and Causal
In addition, synthetic data can significantly contribute to the prin Feature Clusters are statistical-based methods which are used for
ciples of trustworthy AI (TwAI) by enhancing privacy, fairness, and generating synthetic omics data, which are vital for addressing issues
robustness. The generation of synthetic data that can “mimic” the real like class imbalance and enhancing disease phenotype predictions.
data without containing any personal or sensitive information, safe OmicsGAN and DBMs are advanced deep learning methods focusing on
guards individuals’ privacy and mitigates the risks of data breaches. the generation of complex omics datasets, facilitating better in
Synthetic data can also be used to correct biases which are present in terpretations of intricate biological processes. As for multimodal data
real-world data (e.g. by populating unprivileged groups to reduce de generation, MultiNODEs and TC-MultiGAN illustrate the integration of
mographic disparities), thereby promoting fairness and reducing various data types through advanced neural networks, tackling chal
discrimination in AI models. Moreover, synthetic data can enable the lenges in multimodal data synthesis like generating comprehensive
development of robust AI models by offering diverse and high-quality electronic health records or synthetic MRI images. CycleGAN and End-
data for augmentation, as well as reducing vulnerabilities to adversa to-End Multimodal X-ray Generative Model (EMIXER) demonstrate the
rial attacks. versatility of GANs in creating synthetic datasets that span multiple
medical imaging modalities and integrating imaging with textual data.
5. Conclusion and future directions The variety of methods that has been discussed highlights a signifi
cant advancement in the field of synthetic data generation, tailored to
The current review reveals a noteworthy and exponentially diverse needs across different types of medical data. Each algorithm or
increasing number of studies which focus on the development and tool brings specific strengths to the table, addressing the challenges
deployment of synthetic data generation technologies in healthcare posed by the vast and varied data landscape in healthcare. Despite the
across various data modalities, including tabular, imaging, radiomics, advancements, there are ongoing challenges which are related to the
time-series, and omics. These studies make use of synthetic data to not quality, representativeness, and ethical use of synthetic data. Chal
only address privacy concerns but also to enhance the availability and lenges, such as, data fidelity, potential biases introduced in the gener
diversity of the real data which is crucial for training AI-driven diag ated data, and the need for big, diverse data for AI model training remain
nostic and predictive models to improve patient outcomes and to sup critical areas for improvement. Future research in the field is needed to
port healthcare research. In addition, the current work presents the continue to explore these technologies, particularly focusing on
advantages and weaknesses of a variety of statistical, probabilistic, ML improving the accuracy, reliability, and ethical aspects of synthetic data
and DL based synthetic data generators. Great emphasis was given to generation. This will not only enhance the robustness of the AI models
reporting open-source tools to promote collaborative efforts within the but also ensure their applicability in real-world medical settings, ulti
research community to accelerate advancements in the field. mately leading to better patient outcomes and more efficient healthcare
The ability to synthesize tabular, imaging, radiomics, time-series, systems. Furthermore, future research should focus on improving the
2907
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
fidelity of synthetic data to ensure that they can mimic real-world data. [13] Smania G, Jonsson EN. Conditional distribution modeling as an alternative
method for covariates simulation: Comparison with joint multivariate normal and
This includes the development of more sophisticated models that can
bootstrap techniques. CPT Pharmacomet Syst Pharmacol 2021;vol. 10(4):330–9.
capture complex dependencies and interactions within the real data. https://doi.org/10.1002/psp4.12613.
Addressing biases in synthetic data generation is another critical factor [14] AL-kuhali HA, et al. Multiview clustering of multi-omics data integration by using
to ensure fairness and equity in the AI models. Emphasis should be given a penalty model. BMC Bioinforma 2022;vol. 23(1):288. https://doi.org/10.1186/
s12859-022-04826-4.
to identifying and mitigating potential biases, particularly in data with [15] Melnyk K, Klus S, Montavon G, Conrad TOF. GraphKKE: graph Kernel Koopman
underrepresented populations. As healthcare data continues to grow, embedding for human microbiome analysis. Appl Netw Sci 2020;vol. 5(1):96.
scalable and efficient synthetic data generation methods are needed https://doi.org/10.1007/s41109-020-00339-2.
[16] Pezoulas VC, Grigoriadis GI, Tachos NS, Barlocco F, Olivotto I, Fotiadis DI.
with reduced computational complexity while maintaining high-quality Generation of virtual patient data for in-silico cardiomyopathies drug
outcomes. Emphasis should also be given on the improvement and development using tree ensembles: a comparative study. 2020 42nd Annual
refinement of ethical guidelines and regulatory frameworks for the use International Conference of the IEEE Engineering in Medicine & Biology Society
(EMBC). IEEE; 2020. p. 5343–6.
of synthetic data in healthcare to ensure transparency in data generation [17] Robnik-Šikonja M. Dataset comparison workflows. Int J Data Sci 2018;vol. 3(2):
and strict adherence to privacy standards. 126–45.
[18] Pičulin M, et al. Disease progression of hypertrophic cardiomyopathy: modeling
using machine learning. JMIR Med Inform 2022;vol. 10(2):e30483. https://doi.
CRediT authorship contribution statement org/10.2196/30483.
[19] Pezoulas VC, Tachos NS, Gkois G, Olivotto I, Barlocco F, Fotiadis DI. Bayesian
Vasileios C. Pezoulas: Writing – review & editing, Writing – original inference-based gaussian mixture models with optimal components estimation
towards large-scale synthetic data generation for in silico clinical trials. IEEE
draft, Visualization, Validation, Methodology, Investigation, Conceptu
Open J Eng Med Biol 2022.
alization. Dimitrios Zaridis: Writing – original draft, Methodology, [20] Pezoulas VC, Grigoriadis GI, Tachos NS, Barlocco F, Olivotto I, Fotiadis DI.
Investigation. Eugenia Mylona: Writing – original draft, Methodology, Variational Gaussian Mixture Models with robust Dirichlet concentration priors
for virtual population generation in hypertrophic cardiomyopathy: a comparison
Investigation. Christos Androutsos: Writing – original draft, Method
study. 2021 43rd Annual International Conference of the IEEE Engineering in
ology, Investigation. Dimitrios Fotiadis: Writing – review & editing, Medicine & Biology Society (EMBC). IEEE; 2021. p. 1674–7.
Writing – original draft, Supervision, Funding acquisition, Conceptual [21] Amudala S, Ali S, Najar F, Bouguila N. Variational Inference of Finite Generalized
ization. Kosmas Apostolidis: Writing – original draft, Methodology, Gaussian Mixture Models. 2019 IEEE Symposium Series on Computational
Intelligence (SSCI), Xiamen. China: IEEE; 2019. p. 2433–9. https://doi.org/
Investigation. Nikolaos S. Tachos: Writing – review & editing, Writing 10.1109/SSCI44817.2019.9002852.
– original draft, Methodology, Investigation, Conceptualization. [22] Dahmen J, Cook D. SynSys: a synthetic data generation system for healthcare
applications. Sensors 2019;vol. 19(5):1181. https://doi.org/10.3390/s19051181.
[23] Mazumder O, Banerjee R, Roy D, Bhattacharya S, Ghose A, Sinha A. Synthetic
Declaration of Competing Interest PPG signal generation to improve coronary artery disease classification: study
with physical model of cardiovascular system. IEEE J Biomed Health Inform
The authors declare that the research was conducted in the absence 2022;vol. 26(5):2136–46. https://doi.org/10.1109/JBHI.2022.3147383.
[24] Shi J, Wang D, Tesei G, Norgeot B. Generating high-fidelity privacy-conscious
of any commercial or financial relationships that could be construed as a synthetic patient data for causal effect estimation with multiple treatments. Front
potential conflict of interest. Artif Intell 2022;vol. 5:918813. https://doi.org/10.3389/frai.2022.918813.
[25] Arvanitis TN, White S, Harrison S, Chaplin R, Despotou G. A method for machine
learning generation of realistic synthetic datasets for validating healthcare
Acknowledgements applications. 146045822210770 Health Inform J 2022;vol. 28(2). https://doi.
org/10.1177/14604582221077000.
The research work has received funding from the European Com [26] Zhang Y, et al. GAN-based one dimensional medical data augmentation. Soft
Comput 2023;vol. 27(15):10481–91. https://doi.org/10.1007/s00500-023-
mission under GA 101135932 (FAITH Project). 08345-z.
[27] Das T, Wang Z, Sun J. TWIN: Personalized Clinical Trial Digital Twin Generation.
References in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining. Long Beach CA USA: ACM; 2023. p. 402–13. https://doi.org/
10.1145/3580305.3599534.
[1] Shilo S, Rossman H, Segal E. Axes of a revolution: challenges and promises of big
[28] Lu C, Reddy CK, Wang P, Nie D, Ning Y. Multi-label clinical time-series generation
data in healthcare. Nat Med 2020;vol. 26(1):29–38. https://doi.org/10.1038/
via conditional GAN. IEEE Trans Knowl Data Eng 2024;vol. 36(4):1728–40.
s41591-019-0727-5.
https://doi.org/10.1109/TKDE.2023.3310909.
[2] Agrawal R, Prabakaran S. Big data in digital healthcare: lessons learnt and
[29] X. Li, V. Metsis, H. Wang, A.H.H. Ngu, TTS-GAN: A Transformer-based Time-
recommendations for general practice. Heredity 2020;vol. 124(4):525–34.
Series Generative Adversarial Network. arXiv, Jun. 26, 2022. Accessed: May 23,
https://doi.org/10.1038/s41437-020-0303-2.
2024. [Online]. Available: 〈http://arxiv.org/abs/2202.02691〉.
[3] Appenzeller A, Leitner M, Philipp P, Krempel E, Beyerer J. Privacy and utility of
[30] Zhang C, et al. Correction of out-of-focus microscopic images by deep learning.
private synthetic data for medical data analyses. Appl Sci 2022;vol. 12(23):
Comput Struct Biotechnol J 2022;vol. 20:1957–66. https://doi.org/10.1016/j.
12320. https://doi.org/10.3390/app122312320.
csbj.2022.04.003.
[4] S.M. Bellovin, P.K. Dutta, N. Reitinger, Privacy and Synthetic Datasets, vol. 22.
[31] Grimwood A, et al. Endoscopic Ultrasound Image Synthesis Using a Cycle-
[5] Yale A, Dash S, Dutta R, Guyon I, Pavao A, Bennett KP. Generation and evaluation
Consistent Adversarial Network, in Simplifying Medical Ultrasound. vol. 12967.
of privacy preserving synthetic health data. Neurocomputing 2020;vol. 416:
In: Noble JA, Aylward S, Grimwood A, Min Z, Lee S-L, Hu Y, editors. Lecture
244–55. https://doi.org/10.1016/j.neucom.2019.12.136.
Notes in Computer Science, vol. 12967. Cham: Springer International Publishing;
[6] Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: a narrative
2021. p. 169–78. https://doi.org/10.1007/978-3-030-87583-1_17. vol. 12967.
review. PLOS Digit Health 2023;vol. 2(1):e0000082. https://doi.org/10.1371/
[32] Wang J, Wu QMJ, Pourpanah F. DC-cycleGAN: bidirectional CT-to-MR synthesis
journal.pdig.0000082.
from unpaired data. Comput Med Imaging Graph 2023;vol. 108:102249. https://
[7] Murtaza H, Ahmed M, Khan NF, Murtaza G, Zafar S, Bano A. Synthetic data
doi.org/10.1016/j.compmedimag.2023.102249.
generation: state of the art in health care domain. Comput Sci Rev 2023;vol. 48:
[33] Shaban MT, Baur C, Navab N, Albarqouni S. Staingan: Stain Style Transfer for
100546. https://doi.org/10.1016/j.cosrev.2023.100546.
Digital Histological Images. 2019 IEEE 16th International Symposium on
[8] J. Jordon et al., “Synthetic Data – what, why and how?” arXiv, May 06, 2022.
Biomedical Imaging (ISBI 2019). Venice, Italy: IEEE; 2019. p. 953–6. https://doi.
Accessed: May 28, 2024. [Online]. Available: 〈http://arxiv.org/abs/2205.0
org/10.1109/ISBI.2019.8759152.
3257〉.
[34] Ma J, Xu H, Jiang J, Mei X, Zhang X-P. DDcGAN: a dual-discriminator conditional
[9] Figueira A, Vaz B. Survey on synthetic data generation, evaluation methods and
generative adversarial network for multi-resolution image fusion. IEEE Trans
GANs. Mathematics 2022;vol. 10(15):2733. https://doi.org/10.3390/
Image Process 2020;vol. 29:4980–95. https://doi.org/10.1109/
math10152733.
TIP.2020.2977573.
[10] O. Mendelevitch, “Review of Methods and Experimental Results”.
[35] Pezoulas V, Tachos N, Fotiadis D. Generation of virtual patients for in silico
[11] Cheng V, Suriyakumar VM, Dullerud N, Joshi S, Ghassemi M. Can You Fake It
cardiomyopathies drug development. 2019 IEEE 19th Int Conf Bioinforma Bioeng
Until You Make It?: Impacts of Differentially Private Synthetic Data on
(BIBE) 2019:671–4. https://doi.org/10.1109/BIBE.2019.00126.
Downstream Classification Fairness. Proceedings of the 2021 ACM Conference on
[36] Pezoulas VC, et al. A computational pipeline for data augmentation towards the
Fairness, Accountability, and Transparency, Virtual Event. Canada: ACM; 2021.
improvement of disease classification and risk stratification models: a case study
p. 149–60. https://doi.org/10.1145/3442188.3445879.
in two clinical domains. Comput Biol Med 2021;vol. 134:104520. https://doi.
[12] Ferrara E. Fairness and bias in artificial intelligence: a brief survey of sources,
org/10.1016/j.compbiomed.2021.104520.
impacts, and mitigation strategies. Sci 2023;vol. 6(1):3. https://doi.org/10.3390/
sci6010003.
2908
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
[37] Kiagias D, Russo G, Sgroi G, Pappalardo F, Juárez MA. Bayesian augmented [60] Li R, Bastiani M, Auer D, Wagner C, Chen X. Image Augmentation Using a Task
clinical trials in TB therapeutic vaccination. Front Med Technol 2021;vol. 3: Guided Generative Adversarial Network for Age Estimation on Brain MRI.
719380. https://doi.org/10.3389/fmedt.2021.719380. Medical Image Understanding and Analysis, vol. 12722. In: Papież BW, Yaqub M,
[38] Tucker A, Wang Z, Rotalinti Y, Myles P. Generating high-fidelity synthetic patient Jiao J, Namburete AIL, Noble JA, editors. Lecture Notes in Computer Science, vol.
data for assessing machine learning healthcare software. Npj Digit Med 2020;vol. 12722. Cham: Springer International Publishing; 2021. p. 350–60. https://doi.
3(1):147. https://doi.org/10.1038/s41746-020-00353-9. org/10.1007/978-3-030-80432-9_27. Medical Image Understanding and
[39] Zwep LB, Guo T, Nagler T, Knibbe CAJ, Meulman JJ, Van Hasselt JGC. Virtual Analysis, vol. 12722.
patient simulation using copula modeling. Clin Pharmacol Ther 2024;vol. 115(4): [61] Tran N-T, Tran V-H, Nguyen N-B, Nguyen T-K, Cheung N-M. On data
795–804. https://doi.org/10.1002/cpt.3099. augmentation for GAN training. IEEE Trans Image Process 2021;vol. 30:1882–97.
[40] Kharya S, Soni S, Swarnkar T. Generation of synthetic datasets using weighted https://doi.org/10.1109/TIP.2021.3049346.
bayesian association rules in clinical world. Int J Inf Technol 2022;vol. 14(6): [62] Yang P, Hong Z, Yin X, Zhu C, Jiang R. Self-supervised Visual Representation
3245–51. https://doi.org/10.1007/s41870-022-01081-x. Learning for Histopathological Images. Medical Image Computing and Computer
[41] H. Freedman, M.A. Miller, H. Williams, C. J. S. Jr, “Scaling and Querying a Assisted Intervention – MICCAI 2021, vol. 12902. In: De Bruijne M, Cattin PC,
Semantically Rich, Electronic Healthcare Graph”. Cotin S, Padoy N, Speidel S, Zheng Y, Essert C, editors. in Lecture Notes in
[42] Walonoski J, et al. Synthea™ Novel coronavirus (COVID-19) model and synthetic Computer Science, vol. 12902. Cham: Springer International Publishing; 2021.
data set. Intell -Based Med 2020;vol. 1–2:100007. https://doi.org/10.1016/j. p. 47–57. https://doi.org/10.1007/978-3-030-87196-3_5. Medical Image
ibmed.2020.100007. Computing and Computer Assisted Intervention – MICCAI 2021, vol. 12902.
[43] Koloi A, et al. A comparison study on creating simulated patient data for [63] Han S, Carass A, Schar M, Calabresi PA, Prince JL. Slice Profile Estimation From
individuals suffering from chronic coronary disorders. 2023 45th Annual 2D MRI Acquisition Using Generative Adversarial Networks. 2021 IEEE 18th
International Conference of the IEEE Engineering in Medicine & Biology Society International Symposium on Biomedical Imaging (ISBI). Nice, France: IEEE; 2021.
(EMBC). Sydney, Australia: IEEE; 2023. p. 1–4. https://doi.org/10.1109/ p. 145–9. https://doi.org/10.1109/ISBI48211.2021.9434137.
EMBC40787.2023.10340194. [64] Ahmadian M, et al. Overcoming data scarcity in radiomics/radiogenomics using
[44] Rodriguez-Almeida AJ, et al. Synthetic patient data generation and evaluation in synthetic radiomic features. Comput Biol Med 2024;vol. 174:108389. https://doi.
disease prediction using small and imbalanced datasets. IEEE J Biomed Health org/10.1016/j.compbiomed.2024.108389.
Inform 2023;vol. 27(6):2670–80. https://doi.org/10.1109/JBHI.2022.3196697. [65] Hosseini S, et al. MRI-based radiomics combined with deep learning for
[45] El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability distinguishing IDH-mutant WHO grade 4 astrocytomas from IDH-wild-type
of analyses using synthetic health data. Sci Rep Mar. 2024;vol. 14(1):6978. glioblastomas. Cancers 2023;vol. 15(3):951. https://doi.org/10.3390/
https://doi.org/10.1038/s41598-024-57207-7. cancers15030951.
[46] Lohaj O, Paralič J, Kushnir D, Vanko JI. Usability of a synthetically generated [66] Parekh VS, Jacobs MA. Radiomic Synthesis Using Deep Convolutional Neural
dataset for decision support. 2024 IEEE 22nd World Symposium on Applied Networks. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI
Machine Intelligence and Informatics (SAMI). Stará Lesná, Slovakia: IEEE; 2024. 2019). Venice, Italy: IEEE; 2019. p. 1114–7. https://doi.org/10.1109/
p. 000435–40. https://doi.org/10.1109/SAMI60510.2024.10432913. ISBI.2019.8759491.
[47] Z. Qian and R. Davis, Synthcity: a benchmark framework for diverse use cases of [67] Dakshit S, Dakshit S, Khargonkar N, Prabhakaran B. Bias analysis in healthcare
tabular synthetic data. time series (BAHT) decision support systems from meta data. J Healthc Inform
[48] Huang G, Jafari AH. Enhanced balancing GAN: minority-class image generation. Res 2023;vol. 7(2):225–53. https://doi.org/10.1007/s41666-023-00133-6.
Neural Comput Appl 2023;vol. 35(7):5145–54. https://doi.org/10.1007/s00521- [68] Khorchani T, Gadiya Y, Witt G, Lanzillotta D, Claussen C, Zaliani A. SASC: a
021-06163-8. simple approach to synthetic cohorts for generating longitudinal observational
[49] Dey S, Basuchowdhuri P, Mitra D, Augustine R, Saha SK, Chakraborti T. BliMSR: patient cohorts from COVID-19 clinical data. Patterns 2022;vol. 3(4):100453.
Blind Degradation Modelling for Generating High-Resolution Medical Images. https://doi.org/10.1016/j.patter.2022.100453.
Medical Image Understanding and Analysis, vol. 14122. In: Waiter G, Lambrou T, [69] Dissanayake T, Fernando T, Denman S, Sridharan S, Fookes C. Generalized
Leontidis G, Oren N, Morris T, Gordon S, editors. in Lecture Notes in Computer generative deep learning models for biosignal synthesis and modality transfer.
Science, vol. 14122. Cham: Springer Nature Switzerland; 2024. p. 64–78. https:// IEEE J Biomed Health Inform 2023;vol. 27(2):968–79. https://doi.org/10.1109/
doi.org/10.1007/978-3-031-48593-0_5. Medical Image Understanding and JBHI.2022.3223777.
Analysis, vol. 14122. [70] Isasa I, et al. Effect of incorporating metadata to the generation of synthetic time
[50] Segal B, Rubin DM, Rubin G, Pantanowitz A. Evaluating the clinical realism of series in a healthcare context. 2023 IEEE 36th Int Symp Comput-Based Med Syst
synthetic chest X-rays generated using progressively growing GANs. SN Comput (CBMS) 2023:910–6. https://doi.org/10.1109/CBMS58004.2023.00341.
Sci 2021;vol. 2(4):321. https://doi.org/10.1007/s42979-021-00720-7. [71] Yang Z, Li Y, Zhou G. TS-GAN: time-series GAN for sensor-based health data
[51] Kausar T, Lu Y, Kausar A, Ali M, Yousaf A. SD-GAN: a style distribution transfer augmentation. ACM Trans Comput Healthc 2023;vol. 4(2):1–21. https://doi.org/
generative adversarial network for covid-19 detection through X-ray images. IEEE 10.1145/3583593.
Access 2023;vol. 11:24545–60. https://doi.org/10.1109/ACCESS.2023.3253282. [72] A. Seyfi, J.-F. Rajotte,R.T. Ng, Generating multivariate time series with COmmon
[52] Yao Y, et al. Conditional Variational Autoencoder with Balanced Pre-training for Source CoordInated GAN (COSCI-GAN).
Generative Adversarial Networks. 2022 IEEE 9th International Conference on [73] Dash S, Yale A, Guyon I, Bennett KP. Medical Time-Series Data Generation Using
Data Science and Advanced Analytics (DSAA). Shenzhen, China: IEEE; 2022. Generative Adversarial Networks. Artificial Intelligence in Medicine, vol. 12299.
p. 1–10. https://doi.org/10.1109/DSAA54385.2022.10032367. In: Michalowski M, Moskovitch R, editors. in Lecture Notes in Computer Science,
[53] Han Z, et al. Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to- vol. 12299. Cham: Springer International Publishing; 2020. p. 382–91. https://
Fine PET Reconstruction, in Medical Image Computing and Computer Assisted doi.org/10.1007/978-3-030-59137-3_34. Artificial Intelligence in Medicine, vol.
Intervention – MICCAI 2023. vol. 14229. In: Greenspan H, Madabhushi A, 12299.
Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R, editors. Lecture [74] Alcaraz JML, Strodthoff N. Diffusion-based conditional ECG generation with
Notes in Computer Science, vol. 14229. Cham: Springer Nature Switzerland; structured state space models. Comput Biol Med 2023;vol. 163:107115. https://
2023. p. 239–49. https://doi.org/10.1007/978-3-031-43999-5_23. vol. 14229. doi.org/10.1016/j.compbiomed.2023.107115.
[54] Huang J, Wu Y, Wu H, Yang G. Fast MRI Reconstruction: How Powerful [75] Li H, Yu S, Principe J. Causal recurrent variational autoencoder for medical time
Transformers Are?. 2022 44th Annual International Conference of the IEEE series generation. Proc AAAI Conf Artif Intell 2023;vol. 37(7):8562–70. https://
Engineering in Medicine & Biology Society (EMBC). Glasgow, Scotland, United doi.org/10.1609/aaai.v37i7.26031.
Kingdom: IEEE; 2022. p. 2066–70. https://doi.org/10.1109/ [76] Petralia F, Wang L, Peng J, Yan A, Zhu J, Wang P. A new method for constructing
EMBC48229.2022.9871475. tumor specific gene co-expression networks based on samples with tumor purity
[55] Lin C-H, Lin C-S, Chou P-Y, Hsu C-C. An efficient data augmentation network for heterogeneity. Bioinformatics 2018;vol. 34(13):i528–36. https://doi.org/
out-of-distribution image detection. IEEE Access 2021;vol. 9:35313–23. https:// 10.1093/bioinformatics/bty280.
doi.org/10.1109/ACCESS.2021.3062187. [77] Mansouri M, Khakabimamaghani S, Chindelevitch L, Ester M. Aristotle: stratified
[56] Wei L, Yadav A, Hsu W. CTFlow: mitigating effects of computed tomography causal discovery for omics data. BMC Bioinforma 2022;vol. 23(1):42. https://doi.
acquisition and reconstruction with normalizing flows. Medical Image Computing org/10.1186/s12859-021-04521-w.
and Computer Assisted Intervention – MICCAI 2023, vol. 14226. In: Greenspan H, [78] Chunikhina E, Logan P, Kovchegov Y, Yambartsev A, Mondal D, Morgun A. The C-
Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R, SHIFT algorithm for normalizing covariances. IEEE/ACM Trans Comput Biol
editors. Lecture Notes in Computer Science, vol. 14226. Cham: Springer Nature Bioinform 2023;vol. 20(1):720–30. https://doi.org/10.1109/
Switzerland; 2023. p. 413–22. https://doi.org/10.1007/978-3-031-43990-2_39. TCBB.2022.3151840.
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, [79] Ovando-Vázquez C, Cázarez-García D, Winkler R. Target–Decoy MineR for
vol. 14226. determining the biological relevance of variables in noisy datasets. Bioinformatics
[57] Osuala R, et al. medigan: a Python library of pretrained generative models for 2021;vol. 37(20):3595–603. https://doi.org/10.1093/bioinformatics/btab369.
medical image synthesis. J Med Imaging 2023;vol. 10. [80] De Los Santos H, Bennett KP, Hurley JM. MOSAIC: a joint modeling methodology
[58] Dong S, et al. Flow-Based Visual Quality Enhancer for Super-Resolution Magnetic for combined circadian and non-circadian analysis of multi-omics data.
Resonance Spectroscopic Imaging, in Deep Generative Models. vol. 13609. In: Bioinformatics 2021;vol. 37(6):767–74. https://doi.org/10.1093/bioinformatics/
Mukhopadhyay A, Oksuz I, Engelhardt S, Zhu D, Yuan Y, editors. Lecture Notes in btaa877.
Computer Science, vol. 13609. Cham: Springer Nature Switzerland; 2022. [81] Fanaee-T H, Thoresen M. Multi-insight visualization of multi-omics data via
p. 3–13. https://doi.org/10.1007/978-3-031-18576-2_1. vol. 13609. ensemble dimension reduction and tensor factorization. Bioinformatics 2019;vol.
[59] He C, et al. HQG-Net: unpaired medical image enhancement with high-quality 35(10):1625–33. https://doi.org/10.1093/bioinformatics/bty847.
guidance. IEEE Trans Neural Netw Learn Syst 2024:1–15. https://doi.org/
10.1109/TNNLS.2023.3315307.
2909
V.C. Pezoulas et al. Computational and Structural Biotechnology Journal 23 (2024) 2892–2910
[82] Yang Z, Michailidis G. A non-negative matrix factorization method for detecting [93] Jiang Y, Zhang S, Chi J. Multi-modal brain tumor data completion based on
modules in heterogeneous omics multi-modal data. Bioinformatics 2016;vol. 32 reconstruction consistency loss. J Digit Imaging 2023;vol. 36(4):1794–807.
(1):1–8. https://doi.org/10.1093/bioinformatics/btv544. https://doi.org/10.1007/s10278-022-00697-6.
[83] Pham T, Jimenez C. Simulated linear test applied to quantitative proteomics. [94] Eckardt J-N, et al. Mimicking clinical trials with synthetic acute myeloid
Bioinformatics 2016;vol. 32(17):i702–9. https://doi.org/10.1093/ leukemia patients using generative artificial intelligence. Npj Digit Med 2024;vol.
bioinformatics/btw440. 7(1):76. https://doi.org/10.1038/s41746-024-01076-x.
[84] Cusworth S, Gkoutos GV, Acharjee A. A novel generative adversarial networks [95] Haleem MS, Ekuban A, Antonini A, Pagliara S, Pecchia L, Allocca C. Deep-
modelling for the class imbalance problem in high dimensional omics data. BMC learning-driven techniques for real-time multimodal health and physical data
Med Inform Decis Mak 2024;vol. 24(1):90. https://doi.org/10.1186/s12911-024- synthesis. Electronics 2023;vol. 12(9):1989. https://doi.org/10.3390/
02487-2. electronics12091989.
[85] Ahmed KT, Sun J, Cheng S, Yong J, Zhang W. Multi-omics data integration by [96] Zhou X, et al. Multimodality MRI synchronous construction based deep learning
generative adversarial network. Bioinformatics 2021;vol. 38(1):179–86. https:// framework for MRI-guided radiotherapy synthetic CT generation. Comput Biol
doi.org/10.1093/bioinformatics/btab608. Med 2023;vol. 162:107054. https://doi.org/10.1016/j.
[86] Hess M, Hackenberg M, Binder H. Exploring generative deep learning for omics compbiomed.2023.107054.
data using log-linear models. Bioinformatics 2020;vol. 36(20):5045–53. https:// [97] Sun H, et al. Research on new treatment mode of radiotherapy based on pseudo-
doi.org/10.1093/bioinformatics/btaa623. medical images. Comput Methods Prog Biomed 2022;vol. 221:106932. https://
[87] Wieder C, Lai RPJ, Ebbels TMD. Single sample pathway analysis in metabolomics: doi.org/10.1016/j.cmpb.2022.106932.
performance evaluation and application. BMC Bioinforma 2022;vol. 23(1):481. [98] S. Biswal, P. Zhuang, A. Pyrros, N. Siddiqui, S. Koyejo, J. Sun, EMIXER: End-to-
https://doi.org/10.1186/s12859-022-05005-1. end Multimodal X-ray Generation via Self-supervision. arXiv, Jan. 15, 2021.
[88] Wieder C, et al. PathIntegrate: Multivariate modelling approaches for pathway- Accessed: May 23, 2024. [Online]. Available: 〈http://arxiv.org/abs/2007.0
based multi-omics data integration. PLOS Comput Biol 2024;vol. 20(3): 5597〉.
e1011814. https://doi.org/10.1371/journal.pcbi.1011814. [99] Z. Wang and J. Sun, “PromptEHR: Conditional Electronic Healthcare Records
[89] Andersson A, Lundeberg J. sepal: identifying transcript profiles with spatial Generation with Prompt Learning.” arXiv, Oct. 11, 2022. Accessed: May 23, 2024.
patterns by diffusion-based modeling. Bioinformatics 2021;vol. 37(17):2644–50. [Online]. Available: 〈http://arxiv.org/abs/2211.01761〉.
https://doi.org/10.1093/bioinformatics/btab164. [100] Paulin G, Ivasic-Kos M. Review and analysis of synthetic dataset generation
[90] Wendland P, Birkenbihl C, Gomez-Freixa M, Sood M, Kschischo M, Fröhlich H. methods and techniques for application in computer vision. Artif Intell Rev 2023;
Generation of realistic synthetic data using multimodal neural ordinary vol. 56(9):9221–65. https://doi.org/10.1007/s10462-022-10358-3.
differential equations. Npj Digit Med 2022;vol. 5(1):122. https://doi.org/ [101] Y. Lu et al., Machine Learning for Synthetic Data Generation: A Review. arXiv,
10.1038/s41746-022-00666-x. Jun. 30, 2024. Accessed: Jul. 03, 2024. [Online]. Available: 〈http://arxiv.
[91] Bauer DF, et al. Generation of annotated multimodal ground truth datasets for org/abs/2302.04062〉.
abdominal medical image registration. Int J Comput Assist Radiol Surg 2021;vol. [102] X. Guo and Y. Chen, Generative AI for Synthetic Data Generation: Methods,
16(8):1277–85. https://doi.org/10.1007/s11548-021-02372-7. Challenges and the Future.” arXiv, Mar. 06, 2024. Accessed: Jul. 03, 2024.
[92] Lee SH. Natural language generation for electronic health records. Npj Digit Med [Online]. Available: 〈http://arxiv.org/abs/2403.04190〉.
2018;vol. 1(1):63. https://doi.org/10.1038/s41746-018-0070-0. [103] Robnik-Sikonja M. Data generators for learning systems based on RBF networks.
IEEE Trans Neural Netw Learn Syst 2016;vol. 27(5):926–38. https://doi.org/
10.1109/TNNLS.2015.2429711.
2910