Briefings in Bioinformatics, 2022, 23(5), 1–14
https://doi.org/10.1093/bib/bbac313
Problem Solving Protocol
GE-Impute: graph embedding-based imputation for
single-cell RNA-seq data
Xiaobin Wu and Yuan Zhou
Corresponding author: Yuan Zhou, Department of Biomedical Informatics, Center for Noncoding RNA Medicine, School of Basic Medical Sciences, Peking
University, 38 Xueyuan Rd, Haidian District, Beijing 100191, China. Tel.: 86-10-82801585; E-mail:
[email protected] Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Abstract
Single-cell RNA-sequencing (scRNA-seq) has been widely used to depict gene expression profiles at the single-cell resolution. However,
its relatively high dropout rate often results in artificial zero expressions of genes and therefore compromised reliability of results.
To overcome such unwanted sparsity of scRNA-seq data, several imputation algorithms have been developed to recover the single-
cell expression profiles. Here, we propose a novel approach, GE-Impute, to impute the dropout zeros in scRNA-seq data with graph
embedding-based neural network model. GE-Impute learns the neural graph representation for each cell and reconstructs the cell–cell
similarity network accordingly, which enables better imputation of dropout zeros based on the more accurately allocated neighbors
in the similarity network. Gene expression correlation analysis between true expression data and simulated dropout data suggests
significantly better performance of GE-Impute on recovering dropout zeros for both droplet- and plated-based scRNA-seq data. GE-
Impute also outperforms other imputation methods in identifying differentially expressed genes and improving the unsupervised
clustering on datasets from various scRNA-seq techniques. Moreover, GE-Impute enhances the identification of marker genes,
facilitating the cell type assignment of clusters. In trajectory analysis, GE-Impute improves time-course scRNA-seq data analysis and
reconstructing differentiation trajectory. The above results together demonstrate that GE-Impute could be a useful method to recover
the single-cell expression profiles, thus enabling better biological interpretation of scRNA-seq data. GE-Impute is implemented in
Python and is freely available at https://github.com/wxbCaterpillar/GE-Impute.
Keywords: single-cell RNA-sequencing, imputation, graph embedding, neural graph representation, similarity network
Introduction scRNA-seq methods [7, 8]. Therefore, it is necessary to
Single-cell RNA-sequencing (scRNA-seq) has emerged develop efficient algorithms to overcome this unwanted
as a powerful technique to characterize cellular hetero- sparsity in single-cell expression matrix and recover the
geneity, advancing our understanding of human disease incomplete expression profiles.
by measuring gene expression and transcriptome states Recently, several computational methods have been
at the single-cell resolution [1–3]. Based on the protocols established to impute the dropout zeros in scRNA-seq
for single-cell library generation, the methods for scRNA- data. Generally, these computational methods can be
seq can be summarized into two categories: 1) the plate- categorized into three classes [9]. The first class con-
based methods, which sort one single cell into one well of sists of methods that focus on smoothing all expression
multiple-well plate, such as Fluidigm C1 [4] and Smart- values among the cells with similar expression profiles,
Seq2 [5]; 2) the droplet-based methods, which distribute such as MAGIC [10], kNN-smoothing [11] and DrImpute
each cell into a tiny droplet containing reagents and a [12]. MAGIC imputes the dropout values of the scRNA-
specific barcode to uniquely quantify the transcriptome, seq count matrix through data diffusion across similar
such as 10x Genomics [6]. The plate-based methods are cells. kNN-smoothing reconstructs the count matrix for
often of lower throughput but higher sensitivity that each cell by smoothing the expression values of its k-
enables the detection of more genes for each cell, while nearest neighbors. DrImpute first performs cell cluster-
the droplet-based are of higher throughput but lower ing to identify similar cells and further imputes data
sensitivity in comparison with the plate-based methods. by averaging the expression values from similar cells.
Despite rapid growth in the scale and robustness of The second class of methods reconstructs the expression
the scRNA-seq protocols, drop-out events (i.e. missed matrix from the latent spaces estimated by low-rank
detection of gene expression which results in artificial matrix-based methods or deep-learning methods, like
zero expressions of many genes) in scRNA-seq data WEDGE [13], scScope [14], DeepImpute [15], scVI [16] and
have remained as the major obstacle in downstream scGNN [17]. WEDGE is a recently proposed algorithm
functional analysis for either plate- or droplet-based to impute gene expression matrix by using biased low-
Xiaobin Wu is a PhD candidate at Department of Biomedical Informatics, Peking University. His research interest is single-cell omics data modeling and analysis.
Yuan Zhou is an associate investigator at Department of Biomedical Informatics, Peking University. His research interest includes transcriptomic and
epitranscriptomic bioinformatics.
Received: April 13, 2022. Revised: June 27, 2022. Accepted: July 11, 2022
© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected]2 | Wu and Zhou
rank matrix decomposition method. scScope is a deep- SAVER [18], and found GE-Impute outperformed other
learning-based method that employs a recurrent net- imputation methods across multiple evaluations on data
work layer to iteratively impute scRNA-seq data matrix. quality and analysis feasibility. Sections below will firstly
DeepImpute and scVI both apply a deep neural network describe the method framework of GE-Impute, and then
to learn expression patterns of the scRNA-seq data, thus provide the detailed performance evaluation results.
allowing for fast imputation of missing values. scGNN is
a graph neural network-based imputation method which
learns cell–cell relationships in the graph autoencoder. Methods
The third class of methods models the sparsity using The design of GE-Impute pipeline
probabilistic models, such as SAVER [18]. SAVER imputes Graph embedding (or graph representation learning)
the gene expression of cells by estimating prior param- emerges as a promising technique in various machine
eters for an empirical Bayes-like method with Poisson learning tasks, such as node classification, link pre-
least absolute shrinkage and selection operator (LASSO) diction and community detection [22]. In recent years,
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
regression model. the graph embedding method has been further applied
It is noteworthy that there are multiple aspects when to several important biological issues. For example,
assessing the performance of an imputation method. Zhao et al. [23] performed graph embedding on a
First, the primary objective of the scRNA-seq imputation heterogeneous network to predict novel drug-disease
methods is to recover the real expression profiles. associations. Zhang et al. [24] utilized a deep learning
Since there is no golden-standard single-cell expression model of graph convolution network and graph factor-
matrix, one alternative approach is to randomly mask ization to predict the potential association of circRNA
non-zero expression values to simulate the dropout and disease. In this study, we applied node2vec [21] graph
zeros in scRNA-seq datasets and assess the similarity embedding algorithm which simultaneously considering
between imputation data and true background data. breadth-first sampling (BFS) and depth-first sampling
Moreover, there are several important tasks when (DFS) search strategies for random walks sampling.
analyzing scRNA-seq data, including identification of The skip-gram model is a neural network to create a
differentially expressed genes, unsupervised clustering word vector and is widely used in natural language
of cells, marker genes-based cell-type annotation and processing. We further applied skip-gram model to
trajectory analysis. Therefore, how one imputation learn continuous feature representations for cells
method could facilitate these downstream analysis tasks in the raw cell–cell similarity network based on the
is also of prominent biological significance. One recent sampling walks. Since there are multiple effective graph
benchmarking study [19] has systematically evaluated representation learning methods, to test if other models
the imputation methods for recovering biological signals work better than node2vec, we have also tried four other
in downstream analysis. In this evaluation, MAGIC, commonly used graph embedding methods including
SAVER and kNN-smoothing have outperformed other DeepWalk [22], LINE [25], Struct2vec [26] and Snore
imputation methods in denoising scRNA-seq data. [27]. We compared the Pearson correlation coefficients
Nonetheless, the ability of these methods to improve calculated by different models using 10x Genomics cell
the quality of analysis in all aspects is still lacking. lines dataset (see sections below) and found node2vec
Therefore, more robust methods should be developed to performs best in missing values recovering analysis
improve the feasibility of data analysis while preserving (Supplementary Figure 1A). To further evaluate the
the original information of single-cell data as much as accuracy of new links predicted by different methods
possible. in the cell–cell similarity network, we calculated the
Here, we propose a new method, GE-Impute, to impute ratio of true links (i.e. links within the same cell type)
singe-cell data matrix based on cell–cell similarity links to total predicted links (Supplementary Figure 1B). The
predicted by graph embedding neural network. We first result shows that node2vec and DeepWalk predict 100%
constructed a raw cell–cell similarity network (graph) links of the same cell type while other methods may
and embedded all cells into low-dimension vectors using predict the false links (i.e. links between the different
biased random walks and skip-gram model [20, 21]. After cell types). Since DeepWalk has taken more time and
training feature embeddings, the new links between cells memory cost than node2vec when learning feature
were predicted based on the embedded low-dimension representation. We have decided to select node2vec as
features to obtain a reconstructed cell–cell similarity net- the graph representation learning neural network model
work. Finally, dropout zero values for each cell were esti- to develop our imputation task. Taking advantage of
mated by smoothing the expression values of all neigh- the node2vec algorithm, GE-Impute mapped cells into
bors in the reconstructed cell–cell similarity network. a low-dimension space and maximized the likelihood
We applied GE-Impute on computer-simulated, droplet- of co-occurrence of their neighbors in network. The
and plate-based scRNA-seq data and compared it with similarities among cells could be re-calculated from
other nine state-of-the-art methods, including MAGIC low-dimension feature representations to predict new
[10], kNN-smoothing [11], DrImpute [12], WEDGE [13], link-neighbors for the cells and reconstruct cell–cell
scScope [14], DeepImpute [15], scVI [16], scGNN [17] and similarity network. Finally, imputation for scRNA-seq
Graph embedding-based imputation | 3
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Figure 1. Overall workf low of GE-Impute algorithm pipeline. GE-Impute constructs a raw cell–cell similarity network based on Euclidean distance. For
each cell, it simulates a random walk of fixed length using BFS and DFS strategy. Next, graph embedding-based neural network model was employed to
train the embedding matrix for each cell based on sampling walks. The similarity among cells could be re-calculated from embedding matrix to predict
new link-neighbors for the cells and reconstruct cell–cell similarity network. Finally, GE-Impute imputes the dropout zeros for each cell by averaging
the expression value of its neighbors in reconstructed similarity network.
expression data matrix was implemented based on the each cell:
reconstructed cell–cell similarity network. The workflow
of GE-Impute is summarized in Figure 1.
0 Ci ∈
/ KNN Cj and Cj ∈
/ KNN (Ci )
More specifically, we firstly normalized the raw scRNA- Wij = Wji =
seq count matrix using Freeman-Tukey transform to 1 Ci ∈ KNN Cj or Cj ∈ KNN (Ci )
reduce the technical variance.
√ √
Y= X+ X+1 As for the sampling strategy, the biased random
walk was used to explore the neighbors considering
where X denotes raw expression value and Y represents both breadth-first and depth-first sampling strategy
normalized value. The Freeman-Tukey transform [28] (Figure 1). Let G = (V, E) be the raw cell–cell similarity
was proposed to stabilize the variance of Poisson- network. Given a source node u, the IMi was defined
distributed variables and was verified to outperform as the ith intermediate cell in sampling walks of given
the regular logarithm transcript per million (log-TPM) length L, the transition probability is defined as follows:
transform when calculating cell–cell distance [11]. The
Euclidean distances between cells were calculated and
wvx
, if(v, x) ∈ E
adjacency matrix of raw cell–cell similarity network was P IMi = x | IMi−1 = v = Z
0, otherwise
established based on the k nearest neighbors (KNN) of
4 | Wu and Zhou
where wvx denotes transition probability between cell v where Cn ∈ Nfeature ∪ Vj Wi,j = 1
and cell x, and Z is the normalizing constant. To achieve
moderate sampling strategy, two parameters p and q
were used to calculate the bias of walk. Let t be the upper To demonstrate the improvement of GE-Impute by
cell of v and suppose walks just traversed edge (t, v). From the graph neural network model, we compared the
t to x, the α is defined as follows: performance of GE-Impute with the original similarity
network derived from KNN algorithm as well as the raw
⎧
⎨ 1/p if dtx = 0 features from node2vec. The result shows that the raw
αpq (t, x) = 1 if dtx = 1 KNN similarity network or the raw learning features
⎩
1/q if dtx = 2 obtained by node2vec individually cannot perform
as well as GE-Impute in missing values recovering
where the wvx = α pq and dtx denotes the distance between analysis on either 10x Genomics dataset or Fludigm
cell t and cell x. The values p and q control the bias of C1 dataset (Supplementary Figure 2). Moreover, to test if
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
walks. If p > 1, the walk strategy is biased toward search- exclusion outlier cells when averaging expression values
ing cells away from t. If p < 1, the walk strategy is biased would help improve the imputation performance. The
toward revisiting t. If q > 1, the walks strategy is biased interquartile range metric (IQR) is used to define outlier
toward breadth-first sampling. If q < 1, the walks strategy cells. Instead of averaging all the similar cells, the cells
is biased toward depth-first sampling. By default, p and q whose expression levels are more than Median + 1.5∗ IQR
are set to 0.25 and 4, respectively, according to the prelim- or less than Median - 1.5∗ IQR are removed and the
inary optimization. After sampling walks from similarity average expression value is calculated by the remain-
network, GE-Impute trained features representation for ing cells. However, we cannot observe a significant
each cell using skip-gram model (Figure 1). The model improvement in the missing value recovering analysis
aims to optimize the following objective function: (Supplementary Figure 3). Therefore, to simplify the
algorithm, we did not consider adding this procedure
to our imputation model.
max log Pr Ng (u) | S(u)
S For parameter setting in GE-Impute, we have run an
u∈V
optimization for p (i.e. the bias of walks), q (i.e. the bias
of walks), L (i.e. the length of each random walk), NW (i.e.
where S(u) is the mapping function from cells to feature
the number of random walks) and WS (i.e. the window
representations, Ng (u) is defined as the neighborhood set
size). For p and q, different combinations of values (range
of node u deriving from sampling function g. To make this
of 0.25, 0.5, 1, 2, 4) are used to explore the most suit-
optimization solvable, the assumptions of conditional
able combination of p and q (Supplementary Table 1). For
independence and symmetry in feature space were pro-
parameters L, NW and WS , we calculate performance in
posed, which could simplify the objective function [29] as
missing value recovering analysis when considering dif-
follows:
ferent range of values (Supplementary Table 1). Accord-
⎡ ⎤ ing to the optimization results, we have determined the
max ⎣−logZu + S (vi ) • S(u)⎦ , values of those parameters in GE-Impute model, with
S p = 0.25, q = 4, L = 5, NW = 20, WS = 3.
u∈V vi ∈Ng (u)
where Zu = eS(u)·S(w) Dataset collection and imputation
w∈V The scRNA-seq and bulk RNA-seq datasets used to
test the performance of GE-Impute are summarized
The learning features were used to predict new links in Supplementary Table 2. Notably, these datasets have
and therefore reconstruct cell–cell similarity network. In been commonly used in previous studies and have
a detailed manner, for one cell i, the distances to all proved to be effective for imputation methods bench-
linked cells were calculated. Let Ei represent the num- marking [19, 30]. To comprehensively evaluate the
ber of its neighbor links with other cells in the initial performance of GE-Impute for different scRNA-seq
adjacency matrix W. We ranked the distance scores in protocols, we considered datasets from droplet-based
ascending order and the top Ei neighbors of cell i were methods (e.g. 10x Genomics) and plate-based methods
described as ‘features-related neighbors’ (Nfeature ). By fur- (e.g. Fluidigm C1 and Smart-Seq2). Several datasets were
ther combining with W, a union of neighbor links of cell i used to perform missing value recovering analysis and
was used to impute its dropout values. For a specific gene differentially expressed gene identification, including a
y that got 0 value in the raw expression data matrix, GE- 10x Genomics scRNA-seq data of five cell lines (i.e. A549,
Impute filled it by averaging the expression data of gene H1975, H2228, H838 and HCC828) [31] and a Fluidigm
y in its neighbors: C1 scRNA-seq data of five cell lines (i.e. A549, GM12878,
H1, K562 and IMR90) [32]. For the 10x Genomics scRNA-
seq data, the corresponding bulk RNA-seq data including
y=0 y
Expression Ci = average Expression Cn
A549, H1975, H2228, H838 and HCC828 was downloaded
Graph embedding-based imputation | 5
from GEO database [33], and for the Fluidigm C1 scRNA- DEGs. We applied the Wilcoxon Rank-Sum test [39] to cal-
seq data, the corresponding bulk RNA-seq data including culate P-value for all genes and further corrected them
A549, GM12878, H1, K562 and IMR90 was downloaded using Benjamini-Hochberg method. Genes with absolute
from ENCODE database [34] (Supplementary Table 2). value of log2-fold change >0.5 or 1 and FDR < 0.05 were
One dataset containing six cell types of peripheral blood identified as the single-cell DEG sets. To evaluate the
mononuclear cells (PBMC) [6] from 10x Genomics and similarity between bulk-derived gold-standard DEGs and
one dataset including four conventional dendritic cell single-cell DEGs, the Jaccard index [40] was used to mea-
subtypes from Smart-Seq2 [35] were utilized to perform sure the amount of overlap between these two gene sets
clustering analysis and marker genes visualization. One and was defined as follows:
dataset which contains 1529 cells from five stages of
human preimplantation embryonic development from |B ∩ S|
E3 to E7 was used to perform trajectory analysis [36]. Jaccard (B, S) =
|B ∪ S|
In addition to experimentally derived datasets, we also
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
considered several scRNA-seq datasets simulated by
splatSimulate function in Splatter R package [37]. One where the B and S denote bulk-derived gold standard
dataset including five cell groups without dropout rate DEGs and single-cell DEGs, respectively.
was simulated to perform missing value recovering
analysis and two other datasets with batch effects Evaluation of GE-Impute for unsupervised
were simulated to evaluate batch effect benchmark. For clustering of cells
scRNA-seq data, only cells with at least 500 detected To investigate whether GE-Impute can outperform other
genes were retained and genes that expressed at least methods in improving clustering of cells that belong
10% of cells were retained to ensure the quality of data. to the same cell type or subtypes, we considered two
We compared GE-Impute with several extensively used datasets from 10x Genomics and Smart-Seq2 platform,
imputation methods, including DeepImpute, DrImpute, respectively. The 10x Genomics dataset contains 61,213
MAGIC, kNN-smoothing, SAVER, scGNN, scScope, scVI sorted PBMCs [6] including CD14+ monocytes, CD19+
and WEDGE. All methods were implemented in R version B cells, CD34+ cells, CD4+ T cells and CD8+ cytotoxic T
4.0.3 or Python version 3.8.8 with respective default cells; while the Smart-Seq2 dataset contains 957 conven-
parameters. tional dendritic cells [35] with four predefined subtypes
(i.e. blood pre-cDCs, cord pre-cDCs, CD141+ cDC and
CD1c + cDC). We employed Seurat 4.0 pipeline [41], the
Comparison of imputation methods for dropout most commonly used scRNA-seq data analysis pipeline,
zero recovering and differentially expressed gene to perform cell clustering for the imputed expression
identification matrix of each method. Briefly, the expression profiles
To evaluate our imputation method for recovering the were first normalized using NormalizeData function
dropout zeros in scRNA-seq data, the similarity between with default parameters. Then highly variant genes were
imputation data and true background data was calcu- identified using FindVariableGenes function and scaled
lated based on Pearson correlation analysis. Firstly, we by ScaleData function. The top 30 significant principal
randomly mask 10%, 20% and 30% of non-zero values for components were selected to perform Louvain clus-
each cell in 10x Genomics dataset, Fluidigm C1 dataset tering using FindNeighbors function and FindClusters
and Splatter-generated dataset to simulate the dropout function. For a comparable configuration, we adjusted
events in scRNA-seq data. After imputation for the sim- the resolution parameter of FindClusters function until
ulated dropout data, the raw data and imputed data the number of clusters reaches the same number of
are both adjusted for library size with NormalizeData the predefined cell types or subtypes. For different
function in Seurat 4.0 R package. The Pearson correlation imputation methods, the expression characteristics of
coefficients for each cell between imputation data and imputed data are different, so their final resolutions
true background data were calculated. To test the ability to get the same number of clusters are also different.
of GE-Impute on capturing and identifying differentially The exact resolution parameters of clustering for each
expressed genes (DEGs) among different cell states, we method are summarized in Supplementary Table 4.
regarded DEGs identified by bulk RNA-seq data as the Purity, Adjust Rand Index (ARI) and Normalized
‘gold standard’ gene set following the idea of the previous Mutual Information (NMI) are commonly used indices
benchmarking [19]. We first identified DEGs between all to compare clustering results against known labels.
pairs of cell types for bulk RNA-seq data using pack- Therefore, the cluster labels and known cell (sub)type
age DESeq2 [38] in R version 4.0.3. Genes with absolute labels were employed to evaluate the performance of
value of log2-fold change >1 and adjusted P-value <0.05 imputation method in improving unsupervised cluster-
were retained and considered as ‘gold standard’ DEGs ing. The Purity was defined as the percent of the total
sets from bulk data. For each pair of cell types in the number of cells that were classified correctly and was
(imputed) scRNA-seq data, the Seurat normalized log2- implemented by purity function in NMF package [42]. Let
transformed expression profiles were used to identify K be the number of clusters inferring by N cells, Pi be the
6 | Wu and Zhou
cluster i and Tj be the true cell (sub)type j. Formally: datasets. We observed that DrImpute performed better
on 10x Genomics dataset imputation than on Fluidigm
1 C1 and simulated dataset, while DeepImpute performed
K
Purity (P, T) = maxj Pi ∩ Tj better on Fluidigm C1 and simulated dataset than on
N
i=1 10x Genomics dataset, indicating these two methods are
applicable to scRNA-seq data from different protocols.
The Adjust Rand Index aims to calculate similarity We also found that several methods show more compro-
measure between two clustering results by counting mised performance in missing value imputation such
pairs of cells that are assigned to the same or different
clusters in the predicted and true clustering results: as scGNN and scScope. scGNN utilized the imputation
autoencoder and pre-processed matrix to recover gene
N
Nij Ni j N expression matrix which may lead to an exaggerated
i,j − i j 2
2 2 2
ARI (P, T) = N deviation between raw data matrix and imputation data
1 Ni Nj Ni j N
2 i + j − i j 2 matrix. scScope allows the recurrent network layer to
2 2 2 2
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
perform imputation on dropout entries iteratively, which
where Nij denotes the number of cells of the cell type may overcorrect the raw expression data. Overall, GE-
label Tj assigned to cluster Pi . Ni is the number of cells Impute can successfully recover missing value in scRNA-
in cluster Pi , while Nj is the number of cells in cell type seq data and obtain an imputed matrix similar to real
Tj . The ARI was calculated using the adjustedRandIndex data matrix.
function in mclust package [43]. The Normalized Mutual
Information (NMI) is also a good measure for estimating GE-Impute promotes correct identification of
clustering quality and is implemented by NMI function in differentially expressed genes in downstream
aricode package (https://github.com/jchiquet/aricode). analysis
Let L be the number of true cell (sub)type labels and the One of the important tasks in scRNA-seq downstream
NMI is defined as: analysis is to identify cell type-specific DEGs under vari-
ous conditions [44] (i.e. healthy versus disease samples).
L K N•Nij
i=1 j=1 Nij log Ni •Nj
Through DEG analysis, one can further explore which
NMI (P, T) = biological pathways related to the variation between cells
Ni Nj
max − Li=1 Ni • log N
, − Kj=1 Nj • log N
under different conditions. Therefore, accurate acquisi-
tion of DEGs in the context of dropout noises is one
of the hallmarks to demonstrate the biological signifi-
where the numerator denotes the mutual information
cance of imputation results. Here, considering the higher
between P and T and the denominator denotes the
sensitivity of bulk RNA-seq technology in detecting dif-
entropy of P and T.
ferential expression at the transcriptome scale, DEGs
that were calculated based on bulk RNA-seq data were
Results treated as the “gold standard”. The DEGs of bulk RNA-
GE-Impute shows effective improvement in seq and scRNA-seq data were determined following the
recovering missing value in scRNA-seq data method described above (see Methods). Also, we com-
To evaluate the ability of GE-Impute in imputing the pared GE-Impute with other imputation methods for
missing value of scRNA-seq expression data, nine capturing DEGs of bulk RNA-seq. The Jaccard index was
other state-of-the-art imputation methods including used to measure the overlap between DEGs from scRNA-
DeepImpute, MAGIC, kNN-smoothing, SAVER, scGNN, seq data and bulk RNA-seq data. We also measured the
scVI and WEDGE, DrImpute and scScope were used to performance of raw data (no imputation) in identify-
perform comparison analysis. Three datasets generated ing DEGs of bulk RNA-seq as the baseline. As a result,
from droplet-based (10x Genomics), plate-based (Flu- GE-Impute can significantly improve the performance
idigm C1) and Splatter-generated protocols are used to of DEGs identification compared with other imputation
systematically evaluate the performance of GE-Impute methods as well as the no imputation baseline when
on various scRNA-seq data. We simulated the dropout considering different fold change thresholds (Figure 3
events in scRNA-seq data by randomly masking 10%, and Supplementary Figure 4). In 10x Genomics dataset,
20% and 30% of non-zero values for each cell. The GE-Impute can improve the identification of DEGs in all
three simulated dropout datasets were first imputed to pairs of cell types compared with no imputation baseline.
follow each method’s guideline (see Methods). Pearson While kNN-smoothing and DrImpute can only improve
correlation coefficients (PCCs) between true background several pairs of cell types. In Fluidigm C1 dataset, in
data and imputation data were calculated to measure addition to GE-Impute, scVI can also facilitate DEGs iden-
the difference, where larger PCCs indicate the better tification compared with the no imputation baseline.
performance of the imputation method. As shown in Since the identification of DEGs has a great impact on
Figure 2, GE-Impute has shown excellent performance in downstream analysis, it is crucial to reduce false pos-
recovering the missing value and provides higher PCCs itives and false negatives due to the technical noises.
than any other methods in all cell lines (groups) of three In our results, GE-Impute can significantly promote the
Graph embedding-based imputation | 7
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Figure 2. Performance comparison of GE-Impute with other imputation methods in recovering missing values in scRNA-seq data. The barplot shows a
comparison of Pearson correlation coefficients between real and imputed expression profiles between different imputation methods on 10x Genomics
dataset (A), Fluidigm C1 dataset (B) and simulated dataset (C), respectively. Different colors represent different imputation methods and each cell line
is grouped accordingly.
identification of DEGs from bulk RNA-seq, indicating its prominent discrepancy between unsupervised clustering
potential in single cell RNA-seq analysis. labels and true cell types labels (Figure 4A). Whereas
in GE-Impute imputation’s UMAP plot, the same cell
GE-Impute significantly improves the types are more cohesively distributed and show better
performance of unsupervised clustering of cells consistency between unsupervised clustering results
Unsupervised clustering is essential for defining cell type and true cell types labels (Figure 4B). The Cluster 1
heterogeneity and cell type annotation in scRNA-seq and Cluster 5 are dominated by CD8 T cells and CD4
data analysis [45]. Mapping unbiased clusters to known T cells, respectively, though they distribute very closely
cell types is one of the commonly used methods for cell to each other on UMAP plot. Moreover, the clustering
type annotation, thus the clustering result would directly results of all imputation methods on 10x Genomics
affect the accuracy of downstream interpretation [46]. data were quantitatively evaluated using the above-
Accordingly, we used the standard Seurat pipeline mentioned three indices (Figure 4C and Table 1). In
to cluster cells, and compared GE-Impute with other general, most imputation methods can improve the
imputation methods on improving the performance of unsupervised clustering compared with no imputation
unsupervised clustering (see Methods). In this part of (Purity = 0.757, ARI = 0.563, NMI = 0.674), except kNN-
analysis, the known cell types or cell subtypes were smoothing (Purity = 0.809, ARI = 0.425, NMI = 0.552) and
regarded as the true labels and the clustering results scScope (Purity = 0.755, ARI = 0.578, NMI = 0.649) which
were treated as predicted labels. Three indices were show reduced performance for one or more indices. The
introduced to evaluate the consistency between the result also shows that GE-Impute could achieve the best
true and predicted labels, including Purity, ARI and NMI clustering accuracies among all imputation methods,
(see Methods). We first explored the effect of GE-Impute with Purity = 0.972, ARI = 0.936 and NMI = 0.894. Mean-
on droplet-based 10x Genomic PBMC dataset including while, WEDGE (Purity = 0.968, ARI = 0.927, NMI = 0.886)
CD14+ monocytes, CD19+ B cells, CD34+ cells, CD56+ and DrImpute (Purity = 0.965, ARI = 0.899, NMI = 0.854)
cells, CD4+ T cells and CD8+ /cytotoxic T cells. The also performed well in improving the clustering accuracy,
clustering results were visualized by uniform manifold although their performances in missing value recovering
approximation and projection (UMAP) method. In the and differential gene identification are not such satisfac-
result of no imputation data, several CD8 T cells (Cluster tory, suggesting these methods are particularly suitable
1) are dispersedly distributed on UMAP plot and show for cell clustering analysis. The clustering accuracy can
8 | Wu and Zhou
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Figure 3. Comparison of different imputation methods on identifying differential expressed genes in 10x Genomics dataset and Fluidigm C1 dataset.
Heatmap showing the value of Jaccard index between bulk (as the golden standard) and single-cell DEGs (log2-fold change >0.5) on the 10x Genomics
dataset (A) and Fluidigm C1 dataset (B), respectively. Each row represents pair of cell types and each column represents an imputation method.
also be intuitively ref lected by the UMAP plots. For clustering analysis. Here, a Smart-Seq2 dataset which
example, in the clustering results of DeepImpute, MAGIC, contains four dendritic cell subtypes was introduced
SAVER, scGNN, scScope and scVI, a substantial fraction of for the clustering accuracy assessment. Similar to the
CD4 T cells were wrongly assigned to the clusters of CD8 aforementioned method, we treated the clustering
T cells, while kNN-smoothing showed a more dispersed results as predicted labels and the known cell sub-
distribution of clusters in UMAP plot, suggesting that types as true labels. Notably, GE-Impute (Purity = 0.727,
this algorithm significantly changed the cell clustering ARI = 0.272, NMI = 0.391) showed better clustering accu-
topology. racy than the raw data (Purity = 0.601, ARI = 0.147,
In comparison with droplet-based scRNA-seq method NMI = 0.343) and other nine imputation methods for
like 10x Genomics, plated-based scRNA-seq platform like nearly all cases except the NMI index of DrImpute
Smart-Seq2 often results in an scRNA-seq dataset with (Supplementary Figure 5 and Table 1). DrImpute (Purity
much fewer cells and therefore more challenging for = 0.704, ARI = 0.205, NMI = 0.440) performed well in
Graph embedding-based imputation | 9
Figure 4. Performance comparison of unsupervised clustering on 10x Genomics PBMC dataset. The UMAP plot showing unsupervised clustering of Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
PBMC raw data (A) or GE-Impute data (B). The left subpanel represents clusters information of unsupervised clustering. The right sub-panel presents
the known cell type labels information. Each color represents a cell type. (C) Unsupervised clustering for other imputation methods. The upper sub-panel
represents the unsupervised clustering information and the lower sub-panel represents the known cell labels. Intuitively, higher consistency between
clustering labels and known cell type labels indicates a better cell clustering result.
both 10x Genomics dataset and Smart-Seq2 dataset, all, GE-Impute can significantly improve the clustering
which would be attributed to its cell clustering-based analysis accuracy of scRNA-seq data and make the
expression imputation nature that enhances the intra- expression characteristics of different cell types more
cluster expression homogeneity [12]. On the other hand, straightforward, no matter on droplet- or plate-based
DeepImpute, MAGIC, scGNN, scVI and WEDGE could scRNA-seq datasets.
not perform as well on Smart-Seq2 dataset as they did Batch effect is common when analyzing scRNA-seq
on 10x Genomics dataset, suggesting these methods dataset and emerges as an obstacle in downstream
were not the recommended choice for performing analysis. Therefore, effective batch correction is vital
clustering analysis of plated-based scRNA-seq data. In in scRNA-seq data analysis. To investigate if GE-Impute
10 | Wu and Zhou
Table 1. Performance comparison of different imputation with stably expressed in CD14+ cell cluster and CD8+ T cell
multiple clustering evaluation indices
cluster, respectively (Figure 5B). The feature plot shows
Index Purity ARI NMI that the expression characteristic of these two marker
10x Genomics_PBMC dataset genes is also more distinguishable between different
GE-Impute 0.972 0.936 0.894 clusters. Furthermore, the expression of several other
DeepImpute 0.874 0.666 0.801 marker genes is found to be enhanced after GE-Impute
DrImpute 0.965 0.899 0.854
processing, such as CD1C marker for CD19+ B cell,
kNN-smoothing 0.809 0.425 0.552
MAGIC 0.865 0.657 0.794 GZMH and PTGD5 markers for CD56+ NK cell, EGFL7
SAVER 0.866 0.651 0.772 maker for CD34+ cell, and CORO1B maker for CD4+ T
scGNN 0.865 0.522 0.701 cell (Figure 5C). In addition to 10x Genomics dataset,
scScope 0.755 0.578 0.649 we also observed significantly elevated expression of
scVI 0.874 0.664 0.802
marker genes in dendritic cell subtypes in scRNA-
WEDGE 0.968 0.927 0.886
Raw 0.757 0.563 0.674 seq data from Smart-Seq2, such as GBP1 for blood
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Smart-Seq2_cDC dataset pre-cDC, CCL23 for cord pre-cDCs, PPY and ERICH5
GE-Impute 0.727 0.272 0.391 for CD1c + dendritic cells (Supplementary Figure 7).
DeepImpute 0.596 0.095 0.250 Although there are many outstanding methods and
DrImpute 0.704 0.205 0.440
software available to automatically annotate cell types
kNN-smoothing 0.587 0.147 0.211
MAGIC 0.581 0.029 0.240 [49–51], clear expression of marker genes is still an
SAVER 0.615 0.157 0.332 essential feature for cell annotation in scRNA-seq data
scGNN 0.575 0.061 0.223 analysis [52]. These results suggest that GE-Impute can
scScope 0.531 0.028 0.032 help identify cell types for scRNA-seq data by enhancing
scVI 0.527 0.029 0.207
the expression of marker genes, thereby improving the
WEDGE 0.553 0.120 0.222
Raw 0.601 0.147 0.343 efficiency of cell type annotation analysis.
GE-Impute improves the performance of cell
affects the batch effect benchmarks in scRNA-seq trajectory inference
data analysis, we applied FindIntegrationAnchors and Trajectory analysis is also one of the important tasks
IntegrateData function of Seurat 4.0 R package to in scRNA-seq data analysis. To evaluate if GE-Impute
perform batch correction and evaluate the performance can improve the accuracy of trajectory inference and
of raw and imputed data. We have considered both pseudotime ordering, we utilized a dataset containing
experimentally derived [6] and Splatter-simulated [37] 1529 cells from five stages of human preimplantation
datasets with different batches (Supplementary Figure 6 embryonic development from E3 to E7 [36]. We applied
and Supplementary Table 2). The performance was GE-Impute and other nine imputation methods to the
evaluated using local inverse Simpson’s index (LISI) [47] raw data and then reconstructed the trajectory using
and adjusted rand index (ARI) [48]. LISI and ARI are two SlingShot R package [53]. The results demonstrate that
commonly used metrics to measure batch mixing. A GE-Impute can improve cell trajectory reconstruction
higher LISI indicates superior batch correction while a compared to the raw data in both t-SNE and UMAP
low ARI denotes superior batch mixing. The result shows reduction plots (Figure 6A and Supplementary Figure 8).
that GE-Impute does not significantly affect the batch Besides GE-Impute, some other imputation methods (but
correction of scRNA-seq dataset. LISI and ARI calculated not all the methods) can also help to reconstruct the
by GE-Impute data are almost equivalent to the raw non- cell trajectory such as DrImpute, SAVER and scGNN in t-
imputed data (Supplementary Table 5). SNE plot. Whereas in UMAP reduction plot, MAGIC, scVI
and WEDGE can also improve the trajectory inference but
GE-Impute enhances the identification and DrImpute fails to reconstruct the trajectory. To quantita-
visualization of cell type marker genes tively compare their performance in improving the accu-
To investigate whether GE-impute could help facilitate racy of pseudotime inference, the consistency between
cell type annotation through enhancing cell type marker the true-time labels (i.e. E3 to E7) and pseudotime order-
genes expression, we used Seurat package [41] to identify ing was measured by the Pearson correlation coefficients.
the expression of several key marker genes. We first Two widely used methods, SlingShot [53] and PAGA [54],
explored the expression of CD14 (marker gene for CD14+ were used to predict the pseudotime labels. We found
cells) and CD8A (marker gene for CD8+ T cells) in GE-Impute outperforms other methods in pseudotime
10x_Genomics PBMC raw data (Figure 5A). As the result inference when the analysis was conducted by SlingShot,
shown in violin plot, the CD14 was identified as marker and it ranks only second to scGNN when using PAGA.
gene in Cluster 5 while CD8A was not significantly While the pseudotime ordering of imputed data from
enriched in any clusters, and their expression signatures MAGIC and kNN-smoothing cannot be inferred by PAGA,
were not obvious in the feature plot. After GE-Impute suggesting these two methods overcorrect the transcrip-
processing, both CD14 and CD8A were found to be tome dynamics along the time course. In summary, these
Graph embedding-based imputation | 11
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Figure 5. GE-Impute facilitates identification of marker genes in specific cell types. Feature plot and violin plot showing CD14 (top panel) and CD8A
(lower panel) expression in (A) raw 10x_PBMC data or (B) GEImpute 10x_PBMC data. CD14 is the marker for CD14+ cell type and CD8A is the marker for
CD8+ T cell type. (C) Violin plot showing several marker genes for specific cell types in raw (top red panel) and GEImpute 10x_PBMC data (lower purple
panel), including CD1C (marker for CD19+ B cell), GZMH and PTGDS (marker for CD56+ NK cell), EGFL7 (marker for CD34+ stem cell), CORO1B (marker
for CD4+ T cell).
results demonstrate that GE-Impute can improve the Discussion
performance of pseudotime inference. Compared with bulk RNA-seq, the dropout events are
much more prevalent in scRNA-seq, resulting in a non-
Time and memory cost evaluation negligible impact on the accuracy of scRNA-seq analy-
To evaluate the efficiency of GE-Impute and other impu- sis results. Generally, there are two approaches to solve
tation algorithms, we have counted the time and peak this issue. The first is to capture more transcripts in
memory usage when imputing the aforementioned 10x scRNA-seq by improving the sensitivity of sequencing
Genomics scRNA-seq data of five cell lines (containing platform, such as switching to the plate-based platforms
3817 cells and 11 786 genes). We found GE-Impute only like Fluidigm C1 and Smart-Seq2. However, the cost per
cost 39 s and 2242 MiB memory to finish the imputation sample of these plate-based methods is much higher
work (Supplementary Figure 9A), which was comparable than droplet-based method, and the library preparation
to kNN-smoothing and MAGIC methods. Moreover, we of plate-based protocols is very complex. Besides, as also
applied GE-Impute to impute datasets of various sizes, shown by the above analysis, plate-based scRNA-seq data
ranging in size from 5 k to 50 k cells, which were sampled are also more challenging for downstream cell cluster-
from 10x Genomics PBMC dataset. The computational ing analysis. Therefore, another promising solution is
cost of GE-Impute is at a moderate level among all the to develop new bioinformatics methods to handle the
methods (Supplementary Figure 9B and C), which indi- sparsity and technical noises in scRNA-seq data, where
cates its effectiveness in imputation task. the simplest and most effective way is to impute the
12 | Wu and Zhou
Figure 6. GE-Impute enhances the inference of cell trajectory. The trajectories reconstructed by SlingShot from raw and imputed scRNA-seq data (A). Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Each color represents a specific time point. The barplot shows a comparison of Pearson correlation coefficients between true-time points and predicted
pseudotime labels which were calculated by SlingShot (B) and PAGA (C).
dropout zeros in scRNA-seq data matrix, thus improving individual cells and show good performance in the
the flexibility and accuracy of downstream analysis. clustering results, revealing the significance of similarity
In this study, we propose a novel imputation method matrix in scRNA-seq data analysis. Compared with the
GE-Impute based on graph embeddings. GE-Impute original KNN cell similarity network, GE-Impute has
first constructs a similarity matrix to learn feature significantly improved the performance of imputation,
representation using graph embedding neural network which indicates the advantage of the predicted new links
model and then predicts new links for the similarity between connected cells with similar characteristics.
network based on the learning features. Indeed, previous Unlike other graph-based imputation methods, GE-
studies have applied similarity matrix to perform scRNA- Impute only imputes the dropout values and retains
seq clustering such as spectral clustering [55, 56]. Those the original expression characteristics as much as
methods rely on the similarity metrics for categorizing possible, while other imputation cells may overcorrect
Graph embedding-based imputation | 13
the data such as kNN-smoothing and MAGIC. In missing Supplementary data
value recovering analysis, GE-Impute provides higher Supplementary data are available online at http://bib.ox
Pearson correlation coefficients than the other nine fordjournals.org/.
methods on both experimentally derived and computer-
simulated datasets. Through differential expression
analysis, we compared the degree of overlap between Funding
DEGs from scRNA-seq and bulk RNA-seq. The results
National Key Research and Development Program of
demonstrate that GE-Impute performs best in identifying
China (2021YFF1201201).
DEGs in downstream analysis whether in 10x Genomics
dataset or Fluidigm C1 dataset. Moreover, GE-Impute
can significantly improve the unsupervised clustering Data availability
of cells and promote cell type annotation through
The source code of GE-Impute is freely available at
enhancing visualization of the cell type marker genes.
https://github.com/wxbCaterpillar/GE-Impute.
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Finally, GE-Impute can also improve the performance of
cell trajectory analysis. During the above performance
assessment, we also note that expression smoothing- References
based methods like kNN-smoothing and MAGIC often
1. Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-
show better expression recovering performance, but are
transcriptome analysis of a single cell. Nat Methods 2009;6(5):
not very effective in delating with cell clustering tasks 377–82.
since such methods do not emphasize the latent topology 2. Maynard A, McCoach CE, Rotow JK, et al. Therapy-induced evolu-
of cell clusters. On the contrary, machine learning and tion of human lung cancer revealed by single-cell RNA sequenc-
deep learning methods can better depict the cell cluster ing. Cell 2020;182(5):1232–1251.e22.
topology and improve the cell clustering results, but 3. Paik DT, Cho S, Tian L, et al. Single-cell RNA sequencing in car-
often show compromised performance in expression diovascular development, disease and medicine. Nat Rev Cardiol
recovering test, perhaps due to its overestimation of 2020;17(8):457–73.
cell heterogenicity and topology complexity. While the 4. Xin Y, Kim J, Ni M, et al. Use of the Fluidigm C1 platform for RNA
sequencing of single mouse pancreatic islet cells. Proc Natl Acad
pipeline of GE-Impute is somewhat in-between, the
Sci U S A 2016;113(12):3293–8.
overall cell–cell similarity network is reconstructed by
5. Picelli S, Faridani OR, Björklund AK, et al. Full-length RNA-seq
a sophisticated neural graph representation model,
from single cells using Smart-seq2. Nat Protoc 2014;9(1):171–81.
which is conceptually similar to the methods that 6. Zheng GX, Terry JM, Belgrader P, et al. Massively parallel
depend on latent topology of cell clusters. But after digital transcriptional profiling of single cells. Nat Commun
reconstruction of cell–cell similarity network, a simple 2017;8(1):14049.
KNN-like approach was used for expression smoothing. 7. Hicks SC, Townes FW, Teng M, et al. Missing data and technical
Therefore, it is plausible that GE-Impute takes advantage variability in single-cell RNA-sequencing experiments. Biostatis-
of the traits of these two distant categories of impu- tics 2018;19(4):562–78.
tation methods to achieve more robust performances. 8. Ding J, Adiconis X, Simmons SK, et al. Systematic comparison
We believe future improvement of either network of single-cell and single-nucleus RNA-sequencing methods. Nat
Biotechnol 2020;38(6):737–46.
embedding models or expression smoothing algorithms
9. Lähnemann D, Köster J, Szczurek E, et al. Eleven grand challenges
is likely to further improve GE-Impute and methods
in single-cell data science. Genome Biol 2020;21(1):31.
alike.
10. van Dijk D, Sharma R, Nainys J, et al. Recovering gene
interactions from single-cell data using data diffusion. Cell
2018;174(3):716–729.e27.
11. Wagner F, Yan Y, Yanai I. K-nearest neighbor smoothing for high-
Key Points
throughput single-cell RNA-Seq data. bioRxiv 2017. https://doi.
• GE-Impute is an imputation method for scRNA-seq data org/10.1101/217737.
based on graph embedding. 12. Gong W, Kwak IY, Pota P, et al. DrImpute: imputing dropout
• GE-Impute has significantly better performance on events in single cell RNA sequencing data. BMC Bioinform
recovering dropout zeros in both droplet- and plated- 2018;19(1):220.
based scRNA-seq data than other imputation methods. 13. Hu Y, Li B, Zhang W, et al. WEDGE: imputation of gene expression
• GE-Impute outperforms other imputation methods in values from single-cell RNA-seq datasets using biased matrix
identifying biological differentially expressed genes and decomposition. Brief Bioinform 2021;22(5):bbab085.
improving the accuracy of unsupervised clustering anal- 14. Deng Y, Bao F, Dai Q, et al. Scalable analysis of cell-type com-
ysis. position from single-cell transcriptomics using deep recurrent
• GE-Impute enhances the identification and visualization learning. Nat Methods 2019;16(4):311–4.
of cell type-specific marker genes. 15. Arisdakessian C, Poirion O, Yunits B, et al. DeepImpute: an accu-
• GE-Impute improves the performance of cell trajectory rate, fast, and scalable deep neural network method to impute
inference. single-cell RNA-seq data. Genome Biol 2019;20(1):211.
16. Lopez R, Regier J, Cole MB, et al. Deep generative modeling for
single-cell transcriptomics. Nat Methods 2018;15(12):1053–8.
14 | Wu and Zhou
17. Wang J, Ma A, Chang Y, et al. scGNN is a novel graph neural net- 35. Breton G, Zheng S, Valieris R, et al. Human dendritic cells (DCs)
work framework for single-cell RNA-Seq analyses. Nat Commun are derived from distinct circulating precursors that are precom-
2021;12(1):1882. mitted to become CD1c+ or CD141+ DCs. J Exp Med 2016;213(13):
18. Huang M, Wang J, Torre E, et al. SAVER: gene expression recovery 2861–70.
for single-cell RNA sequencing. Nat Methods 2018;15(7):539–42. 36. Petropoulos S, Edsgärd D, Reinius B, et al. Single-Cell RNA-
19. Hou W, Ji Z, Ji H, et al. A systematic evaluation of single-cell RNA- Seq Reveals Lineage and X Chromosome Dynamics in Human
sequencing imputation methods. Genome Biol 2020;21(1):218. Preimplantation Embryos. Cell 2016;165(4):1012–26.
20. Perozzi B, Al-Rfou R, Skiena S. DeepWalk: online learning of 37. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-
social representations. In: Proceedings of the 20th ACM SIGKDD cell RNA sequencing data. Genome Biol 2017;18(1):174.
International Conference on Knowledge Discovery and Data Mining. 38. Love MI, Huber W, Anders S. Moderated estimation of fold
New York, NY, USA: Association for Computing Machinery, 2014, change and dispersion for RNA-seq data with DESeq2. Genome
701–10. Biol 2014;15(12):550.
21. Grover A, Leskovec J. Node2vec: scalable feature learning for 39. Bauer DF. Constructing Confidence Sets Using Rank Statistics. J
networks. In: Proceedings of the 22nd ACM SIGKDD International Am Stat Assoc 1972;67(339):687–90.
Downloaded from https://academic.oup.com/bib/article/23/5/bbac313/6651303 by guest on 14 March 2025
Conference on Knowledge Discovery and Data Mining, San Francisco, 40. Levandowsky M, Winter D. Distance between Sets. Nature
CA, USA, 2016, 855–64. New York, NY, USA: Association for 1971;234(5323):34–5.
Computing Machinery. 41. Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of
22. Palash G, Emilio F. Graph embedding techniques, applications, multimodal single-cell data. Cell 2021;184(13):3573–3587.e29.
and performance: A survey. Knowl Based Syst 2018;151:78–94. 42. Gaujoux R, Seoighe C. A flexible R package for nonnegative
23. Zhao BW, Hu L, You ZH, et al. HINGRL: predicting drug-disease matrix factorization. BMC Bioinform 2010;11(1):367.
associations with graph representation learning on heteroge- 43. Scrucca L, Fop M, Murphy TB, et al. mclust 5: Clustering, Classi-
neous information networks. Brief Bioinform 2022;23(1):bbab515. fication and Density Estimation Using Gaussian Finite Mixture
24. Zhang HY, Wang L, You ZH, et al. iGRLCDA: identifying circRNA- Models. R j 2016;8(1):289–317.
disease association based on graph representation learning. Brief 44. Wu X, Zhao X, Xiong Y, et al. Deciphering Cell-Type-Specific
Bioinform 2022;23(3):bbac083. Gene Expression Signatures of Cardiac Diseases Through Recon-
25. Tang J, Qu M, Wang M et al. LINE: large-scale information net- struction of Bulk Transcriptomes. Front Cell Dev Biol 2022;10:
work embedding. In: Proceedings of the 24th International Conference 792774.
on World Wide Web. International World Wide Web Conferences 45. Lukowski SW, Lo CY, Sharov AA, et al. A single-cell transcriptome
Steering Committee, Florence, Italy, 2015, 1067–77. Republic and atlas of the adult human retina. EMBO J 2019;38(18):e100811.
Canton of Geneva, Switzerland: International World Wide Web 46. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsuper-
Conferences Steering Committee. vised clustering of single-cell RNA-seq data. Nat Rev Genet
26. Ribeiro LFR, Savarese PHP, Figueiredo DR. struc2vec: learning 2019;20(5):273–82.
node representations from structural identity. In: Proceedings of 47. Büttner M, Miao Z, Wolf FA, et al. A test metric for assessing
the 23rd ACM SIGKDD international conference on knowledge discov- single-cell RNA-seq batch correction. Nat Methods 2019;16(1):
ery and data mining, Halifax, NS, Canada, 2017, 385–94. New York, 43–9.
NY, USA: Association for Computing Machinery. 48. Hubert L, Arabie P. Comparing partitions. Journal of Classification
27. Mežnar S, Lavrač N, Škrlj B. SNoRe: Scalable Unsupervised 1985;2(1):193–218.
Learning of Symbolic Node Representations. IEEE Access 2020;8: 49. Xu Y, Baumgart SJ, Stegmann CM, et al. MACA: marker-based
212568–88. automatic cell-type annotation for single-cell expression data.
28. Freeman MF, Tukey JW. Transformations Related to the Angular Bioinformatics 2021;38:1756–60.
and the Square Root. Annals of Mathematical Statistics 1950;21: 50. Wei Z, Zhang S. CALLR: a semi-supervised cell-type annota-
607–11. tion method for single-cell RNA sequencing data. Bioinformatics
29. Mikolov T, Sutskever I, Chen K et al. Distributed representations 2021;37:i51–8.
of words and phrases and their compositionality. In: Proceedings 51. Shao X, Yang H, Zhuang X, et al. scDeepSort: a pre-trained cell-
of the 26th International Conference on Neural Information Processing type annotation method for single-cell transcriptomics using
Systems - Volume 2, Lake Tahoe, NV, USA, 2013, 3111–9. Red Hook, deep learning with a weighted graph neural network. Nucleic
NY, USA: Curran Associates Inc. Acids Res 2021;49(21):e122.
30. Li X, Li S, Huang L, et al. High-throughput single-cell RNA-seq 52. Zhang X, Lan Y, Xu J, et al. CellMarker: a manually curated
data imputation and characterization with surrogate-assisted resource of cell markers in human and mouse. Nucleic Acids Res
automated deep learning. Brief Bioinform 2022;23(1):bbab368. 2019;47(D1):D721–d728.
31. Tian L, Su S, Dong X, et al. scPipe: A flexible R/Bioconductor 53. Street K, Risso D, Fletcher RB, et al. Slingshot: cell lineage
preprocessing pipeline for single-cell RNA-sequencing data. PLoS and pseudotime inference for single-cell transcriptomics. BMC
Comput Biol 2018;14(8):e1006361. Genomics 2018;19(1):477.
32. Li H, Courtois ET, Sengupta D, et al. Reference component 54. Wolf FA, Hamey FK, Plass M, et al. PAGA: graph abstraction rec-
analysis of single-cell transcriptomes elucidates cellular het- onciles clustering with trajectory inference through a topology
erogeneity in human colorectal tumors. Nat Genet 2017;49(5): preserving map of single cells. Genome Biol 2019;20(1):59.
708–18. 55. Qi R, Wu J, Guo F, et al. A spectral clustering with self-weighted
33. Barrett T, Wilhite SE, Ledoux P, et al. NCBI GEO: archive multiple kernel learning method for single-cell RNA-seq data.
for functional genomics data sets–update. Nucleic Acids Res Brief Bioinform 2021;22(4):bbaa216.
2013;41(Database issue):D991–5. 56. Li Y, Luo P, Lu Y, et al. Identifying cell types from single-cell
34. Davis CA, Hitz BC, Sloan CA, et al. The Encyclopedia of data based on similarities and dissimilarities between cells. BMC
DNA elements (ENCODE): data portal update. Nucleic Acids Res Bioinform 2021;22(S3):255.
2018;46(D1):D794–d801.