Using Multi-Scale Genomics To Associate Poorly Annotated Genes With Rare Diseases
Using Multi-Scale Genomics To Associate Poorly Annotated Genes With Rare Diseases
Abstract
Background Next-generation sequencing (NGS) has significantly transformed the landscape of identifying disease-
causing genes associated with genetic disorders. However, a substantial portion of sequenced patients remains undi‑
agnosed. This may be attributed not only to the challenges posed by harder-to-detect variants, such as non-coding
and structural variations but also to the existence of variants in genes not previously associated with the patient’s
clinical phenotype. This study introduces EvORanker, an algorithm that integrates unbiased data from 1,028 eukary‑
otic genomes to link mutated genes to clinical phenotypes.
Methods EvORanker utilizes clinical data, multi-scale phylogenetic profiling, and other omics data to prioritize dis‑
ease-associated genes. It was evaluated on solved exomes and simulated genomes, compared with existing methods,
and applied to 6260 knockout genes with mouse phenotypes lacking human associations. Additionally, EvORanker
was made accessible as a user-friendly web tool.
Results In the analyzed exomic cohort, EvORanker accurately identified the “true” disease gene as the top candidate
in 69% of cases and within the top 5 candidates in 95% of cases, consistent with results from the simulated dataset.
Notably, EvORanker outperformed existing methods, particularly for poorly annotated genes. In the case of the 6260
knockout genes with mouse phenotypes, EvORanker linked 41% of these genes to observed human disease pheno‑
types. Furthermore, in two unsolved cases, EvORanker successfully identified DLGAP2 and LPCAT3 as disease candi‑
dates for previously uncharacterized genetic syndromes.
Conclusions We highlight clade-based phylogenetic profiling as a powerful systematic approach for prioritizing
potential disease genes. Our study showcases the efficacy of EvORanker in associating poorly annotated genes
to disease phenotypes observed in patients. The EvORanker server is freely available at [Link]
EvORanker/.
Keywords EvORanker, Gene-based prioritization, DLGAP2, LPCAT3
†
Dana Sherill-Rofe and Lara Kamal contributed equally to this work.
†
Ephrat Levy-Lahad, Moien Kanaan and Yuval Tabach contributed equally to
this work.
*Correspondence:
Yuval Tabach
[Link]@[Link]
Full list of author information is available at the end of the article
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit [Link] The Creative Commons Public Domain Dedication waiver ([Link]
[Link]/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Canavati et al. Genome Medicine (2024) 16:4 Page 2 of 22
Fig. 1 Graphical abstract of the EvORanker pipeline. Starting from a list of annotated variants obtained from a patient’s exome/genome sequencing
data and following variant filtering, a list of predicted patient candidate genes harboring putatively pathogenic variants are input to EvORanker.
The second input is the HPO terms corresponding to the patient’s phenotypes. The first step of the pipeline is to rank the genes listed in the HPO
database according to the input HPO terms using the OntologySimilarity tool. If any of the patient candidate genes is a known disease-causing
gene or ranked high using OntologySimilarity, then a genetic diagnosis is achieved. If not, then each patient candidate gene in addition
to the ranked HPO gene list is input into a co-evolution and STRING-based algorithm. The algorithm analyzes two lists of genes, the co-evolving
and STRING-interacting genes with each patient candidate gene. A one-sided Kolmogorov-Smirnov (K-S) test is then used to test if the co-evolving
and interacting genes rank significantly high within the patient’s phenotype-related genes. The p-values obtained from running the K-S test using
each dataset are combined using Fisher’s combined test. The output is a list of patient candidate genes ranked based on Fisher’s combined test
p-values (from more significant to less significant). A disease-causing candidate is identified among the patient genes where a significant number
of co-evolving and/or interacting genes are enriched towards the genes highly related to the patient’s input phenotypes relative to the genes
that are unrelated
with limited information. The results of our analysis DLGAP2 and LPCAT3 as potential candidates for dis-
showed that EvORanker was able to identify disease ease-causing genes. Notably, only clade-based NPP anal-
genes that were ranked low by other gene-based meth- ysis was able to detect LPCAT3 as a disease candidate.
ods, demonstrating the complementarity of EvORanker To enhance its practical utility, we designed EvORanker
to those of other gene-based tools. Moreover, we as a user-friendly gene-prioritization web tool that can
employed EvORanker to investigate two patients with be used by researchers and clinicians studying genetic
unresolved genetic syndromes. Our analysis revealed disorders.
Canavati et al. Genome Medicine (2024) 16:4 Page 4 of 22
Fig. 2 Phenotypic diversity in A a cohort of 109 patients from the exome database and B a simulated dataset of 300 individuals with 300
pathogenic variants from ClinVar inserted into their genomes. The patients exhibit a wide range of phenotypes. Notably, various shared phenotypes,
especially related to metabolic and neurological diseases, are observed among the patients. Key: ID, intellectual disability; GI, gastrointestinal
disorders
Canavati et al. Genome Medicine (2024) 16:4 Page 5 of 22
1. Variant frequency
gnomAD [42] <= 0.02 < 0.001
AF_popmax [42] <= 0.02 < 0.001
In-house exome database <= 0.02 < 0.001
2. Splicing and synonymous variants
dbscSNV_RF_SCORE and dbscSNV_ADA_ dbscSNV_RF_SCORE >= 0.5 or dbscSNV_ADA_ dbscSNV_RF_SCORE >= 0.5 or dbscSNV_
SCORE [43] SCORE >= 0.5 or SpliceAI >= 0.5 ADA_SCORE >= 0.5 or SpliceAI >= 0.5
SpliceAI [44]
3. Nonsynonymous variants
Polyphen2_HDIV_score [5] Polyphen2_HDIV_score >= 0.5 or REVEL score >= 0.5 Polyphen2_HDIV_score >= 0.5 or REVEL
REVEL score [6] or SIFT score <= 0.5 score >= 0.5 or SIFT score <= 0.5
SIFT score [45]
Canavati et al. Genome Medicine (2024) 16:4 Page 6 of 22
LPCAT3 variant are as follows: Forward: CGCATAGGG and the STRING database [33]. Our working hypoth-
GTGACATGGTA and Reverse: TATGCATTTTGACGG esis is that of the patient candidate genes, the disease-
GCCTG. causing gene would be functionally linked to the genes
that cause similar phenotypes to those of the patient
Ranking the patient candidate genes according to their (e.g., the disease-causing gene in a patient with ciliopa-
association with each patient’s disease‑phenotypes thy would show significant co-evolution/co-expression/
In order to retrieve a list of genes already reported to interaction with cilia genes). We used the 109-patient
be associated with each patient’s clinical condition, we exome and 900-simulated datasets to tune the param-
encoded the abnormalities reported in each patient’s eters of the algorithm using exclusively phyloge-
medical report (Additional file 1: Table S1) to standard netic profiling (PP). Subsequently, we compared and
HPO terms [46]. The combination of the patient HPO eventually integrated the PP-based analysis with the
terms was then input into the OntologySimilarity pack- STRING-based analysis to maximize the performance
age in R, a semantic similarity-based tool [51], and used of EvORanker.
to identify and rank the genes listed in the HPO data-
base based on their associations with the queried HPO
terms. According to the recommended parameters in Clade‑based phylogenetic profiling
the tool’s documentation [51], the semantic similarity The normalized phylogenetic profiling (NPP) matrix
score was calculated using Lin’s definition of semantic was constructed as previously described [24]. Briefly, a
similarity in combination with the “best-match-average” matrix of BLASTP scores for all human genes against
approach. Additional semantic similarity measures, such the genomes of 1,028 eukaryotic species was con-
as the product measure based on Resnik’s similar- structed. First, the bitscore of each best BLAST hit was
ity expression, were also evaluated, yielding results that normalized by the bitscore of the query protein self-hit.
were nearly identical (Additional file 2: Fig. S1). The out- Then log2 transformation of the normalized bitscore was
put is a ranked list of the 4900 genes listed in the HPO applied. Finally, to avoid any biases due to phylogenetic
database (as of January 2023) scored between zero and distance, the conservation score was scaled for each
one (termed HPO-ranked genes). This score is based on species to their overall distribution by transforming the
the degree of similarity between the patient’s set HPO values in the column (corresponding to a species) into
terms and the HPO terms annotated to each gene in the z-scores.
HPO database. The higher-scoring genes, referred to
as phenotype-related genes in this study, are defined as
genes that exhibit stronger associations with the queried Clade‑based analysis
phenotypes. To have a comprehensive mapping of protein-correlated
Defining an exact threshold for the output of Ontolo- evolution, we used 16 representative clades spanning the
gySimilarity [51] is challenging. In most cases, it is hard eukaryotic tree as previously described [24, 30]. In addi-
to point to a clear threshold that above which the genes tion to including all eukaryotes, the following clades were
are “phenotype-related.” Furthermore, such a potential used: Chordata, Ecdysozoa, Platyhelminthes, Alveolates,
threshold is highly dependent on the user-defined HPO Stramenopiles, Fungi, Viridiplantae, Mammalia, Archelo-
terms as HPOs are variable in their level of complexity suria, Arthropoda, Nematoda, Basidiomycota, Ascomy-
and how well they are defined or studied. As such in our cota, Fungi incertae sedis, Liliopsida, and Eudicotyledons
analysis, the genes were ranked and ordered based on [24]. These 16 representative clades spanning the eukary-
their association with the set of input phenotypes, rang- otic tree show wide coverage (span most of the eukary-
ing from highly associated to not associated (top to bot- otic tree), mutual exclusivity (preferring non-nested
tom). We utilized a one-sided Kolmogorov-Smirnov test clades), and uniformness (similar depth in the tree) in
to assess the significance of a skewed distribution of co- clade types (Additional file 2: Fig. S2) [24].
evolved (or interacting genes) towards the upper end of
the ranked genes (see below).
Retrieving coevolving genes for each patient candidate gene
Building the EvORanker algorithm The degree of co-evolution between two genes was evalu-
The main goal of EvORanker is to establish a link ated using the Pearson correlation coefficient between
between the patient candidate genes and the patient’s their respective rows in the NPP matrix. As we dem-
phenotype. This link was evaluated based on two differ- onstrated before [24], for each patient candidate gene,
ent sources of data on known and predicted gene func- we selected the genes with the top 100 correlation coef-
tional interactions:, clade-wise phylogenetic profiling, ficients in each clade and ranked them from 1 to 100
Canavati et al. Genome Medicine (2024) 16:4 Page 7 of 22
according to the correlation coefficient in each clade interactions, text mining, and co-expression) were
where the gene is found to have an ortholog. retrieved from the STRING database [33]. STRING uses
a scoring system that reflects the evidence of predicted
Using the Kolmogorov‑Smirnov test to prioritize patient interactions. We included interactions with a combined
candidate genes based on phylogenetic profiling score of at least 0.5, which corresponds to a medium-
Per patient exome/simulated genome, we analyzed each confidence network. For each patient exome/simulated
of the patient candidate genes separately. For each of genome in the datasets, we applied the K-S test. The anal-
these genes, we determined whether the genes that co- ysis was done for each patient candidate gene, with the
evolve with it were associated with the patient’s phe- STRING-interacting genes similarly as described above.
notype (i.e., the co-evolved genes were significantly We examined whether a substantial portion of string-
enriched towards the phenotype-related genes). For that, interacting genes were also linked to phenotypes resem-
we examined the ranking of the coevolving genes in the bling those found in the patient.
list of the HPO-ranked genes using a one-tailed, two-sam-
ple Kolmogorov-Smirnov (K-S) test [52]. The final EvORanker gene prediction scoring system
The K-S test is used to test whether two samples come The two p-values obtained from each K-S test using phy-
from the same distribution. The K-S D statistic quantifies logenetic profiling and STRING were finally combined by
the distance between the empirical cumulative distribu- Fisher’s combined probability test [54, 55] (Eq. 1) which
tion function (ECDF) of the sample and the cumulative is the final EvORanker scoring system. Additionally, we
distribution function of the reference distribution. Let assessed Simes’ method for combining the p-values,
i denote the co-evolving genes and j denote the ranked which produced similar results (Additional file 2: Fig.
HPO genes. S1). The Fisher’s combined probability test was computed
The null hypothesis: H0:Fi(x) ≥ Fj(x) using the [Link] function in the survcomp package
The alternative hypothesis: H1:Fi(x) < Fj(x) in R [56].
The D statistic: D- = maxx{Fj(x)-Fi(x)} where Fj is the
k
ECDF of j and similarity for Fi 2
X2k ∼ −2 log(pi ) (1)
The H1 hypothesis for the one-sided K-S test is that the
i=1
cumulative distribution function of the ranking of the
coevolving genes is enriched within the higher-scoring
side of HPO-ranked genes (the phenotype-related genes). Applying EvORanker on genes with knockout phenotypes
A p-value was computed using the [Link] function in the in mice that lack corresponding human annotation
stats package in R [53]. The patient candidate genes were A list of 6395 human genes with mouse knockout phe-
finally ranked by the resulting K-S test p-value (from notypes but not yet associated with a phenotype in
more significant to less significant). humans was compiled from Jackson laboratory’s Mouse
Genome Informatics (MGI) [57], (downloaded, March 1,
Tuning the parameters of the EvORanker phylogenetic 2023). The knockout mouse gene phenotype terms were
profiling‑based analysis then mapped to human HPO terms using uPheno ontol-
We evaluated the performance phylogenetic profiling- ogy inter-ontology closest matches obtained from the
based analysis to identify the “true” disease-causing OBO Phenotype Ontology Github repository [58] end-
gene using the 109-patient exome and the 900-simulated ing up with 6260 genes with mapped HPO terms. Then,
databases. We examined different parameters and cutoff for each gene, the corresponding HPO terms and a list of
values using phylogenetic profiling. We compared dif- randomly sampled genes were input to EvORanker. The
ferent cutoff values of the ranked coevolving genes (top same data was input to Phenolyzer [59] for comparison.
10, 25, 50, 75, 100) with each patient candidate gene. In We were unable to compare to other tools (e.g., Exome-
both datasets, a threshold of the top 50 coevolved genes Walker or PHIVE) due to the impracticability of simulat-
yielded slightly better accuracy in ranking the “true” ing 6,260 × 100 variants in 6,[Link] files as input.
gene in comparison to the other patient candidate genes
(Additional file 2: Fig. S3), which we used for the rest of Tool comparison
the analysis. We compared EvORanker to the gene prioritization stage
(second stage) of ExomeWalker and PHIVE algorithms
Applying the Kolmogorov‑Smirnov test using [3, 8]. We used the 109-patient exome and 900-simu-
STRING‑interacting genes lated benchmarking datasets to compare the tools with
In addition to phylogenetic profiling, other known and the same input HPO terms and patient candidate genes.
predicted functionally associated genes (protein-protein We omitted one exome from the exome dataset where
Canavati et al. Genome Medicine (2024) 16:4 Page 8 of 22
a large deletion was identified containing the NPRL3 a dataset of 900 patients with 300 unique “genetic dis-
gene, leaving us with 108 exomes. We [Link] files for eases”—by spiking disease-causing mutations into real
each of the 108 patient exomes and the 900 genomes as genomes; (3) evaluating EvORanker’s ability to identify
input for Exomiser which includes ExomeWalker [8] and human disease candidate genes using genes with knock-
PHIVE [3]. Additionally, we [Link] files containing out phenotypes in mice that lack corresponding human
the same HPO terms as input for each patient exome and annotation. We demonstrate the contribution of clade-
simulated genome. based phylogenetic profiling (PP) to the improved predic-
tion of the disease-causing gene. This unbiased approach
In vitro splicing analysis was compared and integrated with gene interaction data
In vitro splicing, minigene assays were carried out as pre- obtained from the STRING database [33]. To evaluate
viously described [60, 61]. Briefly, the genomic sequence the potential for bias in disease-gene prediction, we com-
at chr8:1626251-1627026 (hg19) which includes exon pared well-annotated genes to recently published ones.
9 (417 bp) plus 128 and 231 nucleotides from the 5′ Finally, EvORanker was compared to other gene-based
and 3′ flanking sequences, respectively, of DLGAP2 prioritization tools and applied to two unresolved exomes
(NM_001346810) was PCR amplified from a DNA sample to demonstrate its efficacy in disease gene discovery.
homozygous (II-3, Fig. 10A) and wildtype (II-2, Fig. 10A)
for the c.2702 A > T variant using gene-specific primers Benchmarking EvORanker using an exome‑patient dataset
designed with embedded XhoI and BamHI restriction We analyzed an in-house database of 109 patient exomes
enzyme recognition sites. After digestion, the PCR frag- with a genetic diagnosis. The patients suffer from various
ments were ligated into a pre-constructed pET01 Exon- rare hereditary diseases, exhibiting diverse phenotype
trap vector (MoBiTec, Goettingen, Germany). Selected groups (e.g., skeletal, immunological, neurological, and
colonies were then sequenced to confirm the proper metabolic diseases) (Fig. 2A). The dataset included 91
orientation of the cloned fragment and identify both recessive and 18 dominant gene variants that explained
wild-type and variant colonies. Subsequently, the variant the patients’ phenotype (Additional file 1: Table S1). All
and wild-type minigenes were transfected into HEK293 these variants are reported to be pathogenic/likely patho-
cells in triplicate, followed by total RNA extraction 48 h genic in the ClinVar database [38] and co-segregated with
post-transfection, using the Quick-RNA MiniPrep Plus the phenotype in each corresponding family. The dataset
kit (ZYMO Research). cDNA was then synthesized includes 108 unique known disease genes (the CLCN1
using the qScript Flex cDNA synthesis kit (Quanta Bio- gene appears twice, once as autosomal recessive and once
sciences) with a specific primer to the 3′ native exon of as autosomal dominant). For each patient in the exome
the pET01 Exontrap vector. Following PCR amplifica- dataset, we encoded each of the clinical abnormalities
tion, the products were then visualized on a 1.5% agarose found in the patient’s medical record into Human Phe-
gel and were later extracted and then Sanger sequenced. notype Ontology (HPO) [46] terms (Additional file 1:
The primer sequences used for the PCR amplification Table S1).
(XhoI + BamHI) are Forward: AAA-CTCGAG-AACACT
ACCTGCCCTTGAGC, and Reverse: AAA-GGATCC Benchmarking EvORanker using a simulated dataset
-ACTTACCTGACAAAACACACACA. Next, we aimed to assess our ability to identify disease-
causing mutations in simulated data. Simulating genetic
Data analysis and figure creation diseases can be achieved by introducing pathogenic
All data in this study were analyzed using R software [53]. mutations into genomic data from an unaffected indi-
The EvORanker web interface was created using the R vidual. To accomplish this, we utilized 300 unaffected
Shiny package [62]. The majority of the figures were cre- genomes sourced from the 1000 Genome Project [39]
ated using R software. Figures 9D and 10D were created as a benchmark for our evaluations. To introduce patho-
using Cytoscape v3.9.1 [63]. genicity, we randomly integrated 300 distinct pathogenic/
likely pathogenic variants from the ClinVar database [38]
Results into the annotated genomes. These ClinVar variants are
Overview associated with genes showcasing diverse phenotypes
In this work, we developed EvORanker, a phylogenetic including complex neoplastic disorders (Fig. 2B). Of
profiling-based algorithm, to identify disease-causing these variants, 181 followed an autosomal or X-linked
genes. To optimize and evaluate the performance of recessive mode of inheritance, while 119 variants fol-
EvORanker, we employed three different approaches: lowed a dominant inheritance pattern (Additional file 1:
(1) analyzing a private cohort of well phenotypically Table S2). Phenotypic information for each spiked Clin-
characterized patients with rare diseases; (2) simulating Var gene variant was obtained from the HPO database
Canavati et al. Genome Medicine (2024) 16:4 Page 9 of 22
(Additional file 1: Table S2) and was assigned to the and the phenotype-ranked genes using a one-sided Kol-
respective “patient.” We conducted this process thrice, mogorov-Smirnov (K-S) test (Fig. 1). A significant p-value
simulating a total of 900 artificial patients with 300 dif- is obtained if the co-evolving genes rank high within
ferent genetic diseases. Each pathogenic mutation was the phenotype-related genes. For each patient exome/
inserted into three distinct genomes. simulated genome in our dataset, we ranked the patient
candidate genes based on the resulting p-value, with the
Ranking genes based on each patient’s set of phenotypes most significant p-value ranked first. By analyzing the
Using each set of patient HPO terms, we calculated the co-evolved genes across the 16 clades in addition to all
semantic similarity score [51] (see the “Methods” sec- Eukaryotes, the “true” disease-causing gene was ranked
tion) for each gene in the HPO database [46]. The output as the top gene in 46% of the autosomal and X-linked
is a list of genes scored from lower association to higher recessive cases and within the top 5 in 72% (Fig. 3). In
association with the patient’s set of HPO terms (which autosomal and X-linked dominant cases, the “true”
we term the phenotype-related genes). gene was ranked as the top gene in 50% of the cases and
within the top 10 genes in 78%. These results surpass
Retrieving the patient’s candidate genes those obtained from using only the co-evolving genes
We applied our routine variant filtering criteria [35, 36] across Eukaryotes or within the Animalia clades (Chor-
to the annotated variants for each of the 109 exomes and data, Mammalia, Archelosauria, Ecdysozoa, Nematoda,
simulated genomes (Table 1). After variant filtering, each Arthropoda, and Platyhelminthes) (Fig. 3). This indi-
exome/genome contained gene variants that are con- cates the added value of incorporating all 16 clades in the
sidered to be pathogenic and predicted to affect protein analysis. The same analysis was applied on the simulated
function (we term the genes in which these variants were genomes, yielding results consistent with those obtained
observed as patient candidate genes). In autosomal and from the real exome dataset (Fig. 3).
X-linked recessive cases, each patient harbored 11–80
homozygous/hemizygous or compound heterozygous Phylogenetic profiling analysis in different evolutionary
deleterious variants, while 80–170 heterozygous/hemizy- scales improves the prediction of the disease‑causing gene
gous deleterious variants were observed in autosomal We then aimed to assess the contribution of each of
and X-linked dominant cases (Additional file 2: Fig. S4). the 16 clades, in addition to all Eukaryota, towards the
We confirmed that all the “true” causative variants passed prediction of the “true” disease-causing gene. Using the
the filtering criteria and remained within the gene variant 109 patient exomes, this was accomplished by focusing
list for each patient exome. on the genes that obtained a significant p-value (< 0.05)
through co-evolution analysis totaling 71 identified
Using multi‑clade phylogenetic profiling to rank the patient’s genes. We applied the K-S test to these 71 genes using
candidate genes according to the patient’s phenotype the co-evolving genes within each clade. Results showed
Our working hypothesis is that out of all the patient that each clade outperformed others in at least one case,
candidate genes, the one responsible for the disease thus highlighting the importance of combining informa-
will be associated (e.g., co-evolved) with other genes tion from different clades to enhance the performance of
that are known to be associated with the disease (phe- EvORanker (Fig. 4). Interestingly, the Fungi Incertae Sedis
notype-related gene). For each patient candidate gene, clade outperformed other clades in 14% (10/71) of the
we obtained a list of 50 co-evolved genes that exhibit a cases, followed by Chordata, Ascomycota, Arthropoda,
strong correlation based on global and local co-evolu- and Eukaryota, each outperforming others in 10% of the
tion signatures across 1028 eukaryotic species (details in cases (Additional file 2: Fig. S5). Taken together, these
the “Methods” section) [24]. For each patient candidate results emphasize that clades differentially specialize in
gene, we retrieved the top co-evolving genes in 16 clades detecting functional interactions in different pathways
(Chordata, Ecdysozoa, Platyhelminthes, Alveolates, Stra- [24, 29, 30].
menopiles, Fungi, Viridiplantae, Mammalia, Archelosu-
ria, Arthropoda, Nematoda, Basidiomycota, Ascomycota, Phylogenetic profiling is complementary to other existing
Fungi incertae sedis, Liliopsida, and Eudicotyledons) omics datasets
(Additional file 2: Fig. S2). The output per patient candi- NPP represents an unbiased approach that can anno-
date gene is a table of genes that are strongly co-evolved tate gene function independently of the literature.
with it in each clade in addition to all Eukaryotes. We sought to evaluate whether clade-wise NPP could
To determine which of the patient candidate genes is identify disease-associated genes that are overlooked
most likely linked to the patient’s disease phenotype, we by other existing omics. For that, we chose to use the
evaluated the intersection between the co-evolved genes STRING database since it integrates information on
Canavati et al. Genome Medicine (2024) 16:4 Page 10 of 22
Fig. 3 Using clades improves the performance of EvORanker phylogenetic profiling-based analysis. For each patient candidate gene list
in the 109-patient exome and the 900-simulated genomes datasets (300 unique genetic disorders), we compared the accuracy of the phylogenetic
profiling-based algorithm by retrieving the top 50 coevolved genes with each patient candidate gene across all Eukaryotes versus: (1) using all
16 clades where the query gene has an ortholog in addition to Eukaryotes. (2) Across only Animalia clades (Chordata, Mammalia, Archelosauria,
Ecdysozoa, Nematoda, Arthropoda, and Platyhelminthes). Performance was measured by examining the ranking of the “true” disease-causing gene
relative to the other patient candidate genes. The upper bar plot shows results for the autosomal and X-linked recessive cases for the real-exome
dataset (left) and the simulated dataset (right). The simulated dataset contains 181 unique recessive cases and 119 unique dominant cases. The
results present a compilation of three separate independent shuffles totaling 900 simulations. The lower bar plot shows results for the autosomal
and X-linked dominant cases. The y-axis indicates the tested clades, and the x-axis indicates the percentage of cases where the “true” disease gene
was ranked at the top or within the top 3 or top 5 genes relative to the other candidate genes in recessive cases. In dominant cases, the percentage
is for the “true” gene being ranked at the top or within the top 10 genes. Overall, the best performance of ranking the “true” causative gene
was achieved by merging together the co-evolving genes within all clades (the 16 clades in addition to all Eukaryota) in both datasets
protein associations from multiple sources, including [33] using both the patient exome and simulated data-
interaction experiments, known complexes and path- sets (Fig. 5). NPP outperformed STRING in 29/109
ways, scientific literature, co-expression studies, and (27%) of the cases, whereas STRING outperformed
conserved genomic context [33]. We conducted a com- NPP in 50/109 (46%) of the cases (Fig. 5, Additional
parison between the NPP and STRING-based analysis file 2: Fig. S6).
Fig. 5 Comparative performance of NPP, STRING, and EvORanker using the 109-patient exome and the simulated datasets. The performance
of each dataset was measured by examining the ranking of the “true” disease-causing gene relative to the other genes in each exome/genome
in both datasets. The upper bar plot shows results for the autosomal and X-linked recessive cases for the real-exome dataset (left) and the simulated
dataset (right), The simulated dataset contains 181 unique recessive cases and 119 unique dominant cases. The results present a compilation
of three separate independent shuffles totaling 900 simulations. The lower bar plot shows results for the autosomal and X-linked dominant cases.
The y-axis indicates the tested datasets: NPP (using the top 50 coevolved genes), STRING versions 9.1, 11.5, and EvORanker (combining NPP
and the newer version of STRING). The x-axis indicates the percentage of cases where the “true” disease gene was ranked at the top, or within the
top 3 or top 5 genes relative to the other candidate genes in recessive cases. In dominant cases, the percentage is for the “true” gene being ranked
at the top or within the top 10 genes. Overall, the best performance was achieved using the combined approach (EvORanker) in both datasets
Considering the presence of complementarity of co- the consistency of our findings (Fig. 5). In autosomal and
evolution and the STRING-based analysis, we integrated X-linked recessive cases within the simulated dataset,
the two datasets by combining their respective p-values EvORanker ranked the “true” disease-causing gene as
using Fisher’s combined probability test [54]. This com- the top gene in 75% of cases and within the top 5 in 96%
bined scoring system, which we termed EvORanker, of cases. Conversely, for autosomal and X-linked domi-
yielded the highest accuracy in comparison to each data- nant cases, the “true” gene held the top position in 55%
set alone (Fig. 5). Using the exome dataset, we showed of cases and was within the top 10 in 85% of cases. This
that integrating NPP and STRING improved the results parallel in results strongly underscores the robustness
by 43% compared to NPP alone and by 30% compared of EvORanker across both real and simulated datasets.
to STRING alone (Additional file 2: Fig. S6). Overall, Furthermore, to validate the stability of our method, we
in autosomal and X-linked recessive cases, EvORanker conducted three independent spike shuffles, consistently
ranked the “true” disease-causing gene as the top gene yielding coherent and reliable results (Additional file 2:
in 63/91 (69%) and within the top 5 genes in 86/91 (95%) Fig. S7).
cases (Fig. 5). In autosomal and X-linked dominant cases,
the “true” gene was ranked as the top gene in 12/18 Performance of phylogenetic profiling versus STRING
(67%) cases and among the top 10 genes in 17/18 (95%) on new gene entries (2020–2022)
(Fig. 5). On the other hand, the “true” disease genes did As STRING is based on publicly available data, it is
not achieve high scores in a total of 6/109 (5.5%) of the suited to identify well-researched genes. We hypoth-
exomes (within the top 5 for recessive diseases and within esized that STRING performance would be better the
the top 10 for dominant diseases); 5/91 recessive cases, more information it has accrued over time and that our
and 1/18 dominant cases (Additional file 2: Fig. S6). unbiased PP approach would have a particular advantage
We observed similar trends when analyzing the simu- for genes that have not been extensively characterized.
lated dataset, providing further validation and affirming To test this hypothesis, we compared the performance of
Canavati et al. Genome Medicine (2024) 16:4 Page 13 of 22
STRING version v.11.5 [33] with that of the older version association in humans but possess mouse orthologs
STRING v.9.1 [64] (Fig. 5). We found that the perfor- linked to phenotypes. Specifically, we aimed to iden-
mance of the newer version was indeed better than that tify human genes without established phenotype links,
of the older version. Furthermore, the performance of yet having a corresponding mouse ortholog with a phe-
STRING v11.5 decreased dramatically for genes that only notype association. These genes were considered as the
recently became associated with disease. For example, “true” disease gene candidates for the purpose of this
the performance of STRING in ranking the “true” disease evaluation. We compiled a list of 6260 human ortholog
gene within the top 5 is around 85% for genes identified genes with mouse knockout phenotypes, yet not asso-
by the end of 2015 compared to 29% for genes identi- ciated with a phenotype in humans. For each of these
fied between 2016–2020 (Fig. 6, Additional file 2: Fig. S8). genes, we input a set of HPOs mapped from the respec-
We then evaluated the performance of NPP and tive mouse knockout phenotypes. The goal was to evalu-
STRING on newly discovered or recently published ate EvORanker’s ability to correctly pinpoint the “true”
genes. We retrieved a list of 94 new gene entries that disease gene candidate in comparison to 100 randomly
were added to the most recent version of the HPO data- sampled human genes. The same dataset was used as
base (2022) compared to an older version (2020) (Addi- input for Phenolyzer [59] for comparative analysis.
tional file 1: Table S3). We then applied the K-S test EvORanker yielded significant p-values for 41% of the
separately using NPP and STRING and the HPO terms tested genes (Fig. 8A). Moreover, both EvORanker and
associated with each gene as input. We found that for Phenolyzer ranked the “true gene” among the top 10 in
those genes newly associated with human phenotypes, 16% of the cases (Fig. 8B). Notably, EvORanker identified
the K-S test yielded significant p-values using NPP in 45% genes that Phenolyzer failed to identify, and vice versa,
of the genes compared to 38% using STRING (Fig. 7A, highlighting the complementarity of the tools (Additional
B). These results emphasize the success of phylogenetic file 2: Fig. S9).
profiling in predicting the phenotype associations of
newly discovered or less studied genes and highlight the Tool comparison
complementarity observed when comparing these two Using the 109-exome and simulated datasets, we com-
datasets. pared the performance of EvORanker to the gene-pri-
oritization stage of ExomeWalker [8] and PHIVE [3].
Performance of EvORanker on genes with knockout ExomeWalker prioritizes genes based on protein-protein
phenotypes in mice that lack corresponding human interaction, while PHIVE uses mouse phenotypic data.
annotation To ensure a fair comparison, we chose to compare to
We aimed to assess EvORanker’s capability to identify ExomeWalker [8] and PHIVE [3] because both adopt a
disease candidate genes that lack a known phenotype similar strategy to EvORanker. Unlike Phenolyzer [59],
Fig. 6 The effect of years elapsed on the performance of NPP versus STRING, using the 109-patient exome dataset. The x-axis indicates the calendar
years (divided into 5-year windows) in which a gene was described to be associated with a disease phenotype. The y-axis indicates the percentage
of “true” disease genes that ranked at the top (top 1) relative to the other patient candidate using NPP (red bars) or STRING (blue bars)
Canavati et al. Genome Medicine (2024) 16:4 Page 14 of 22
Fig. 7 Comparison of NPP versus STRING for genes with recent (2020–2022) annotation. A The x-axis indicates -log(10) p-values obtained
from running the K-S test using NPP. The y-axis indicates -log(10) p-values obtained from running the K-S test using the STRING dataset. The red
dots represent the genes where NPP performed better than STRING, while the blue dots indicate the opposite. The marginal histogram indicates
the distribution of the -log(10) p-values of both datasets. The correlation score between the two datasets is 0.046, suggesting that the two datasets
exhibit a complex relationship, where a subset of the data displays complementarity, while another subset shows correlation. B Density distribution
of the -log(10) p-values obtained from the K-S test using the NPP, STRING, and both (combined). Significance was calculated using the Wilcoxon test
(*p-value < 0.05, **p-value < 0.01; ns, nonsignificant). Combining NPP and STRING achieved significantly more significant results that either approach
alone
Fig. 8 EvORanker’s performance in identifying candidate disease genes using mouse knockout genes without corresponding human annotation. A
The graph shows the percentage of genes with mouse knockout phenotypes that were tested for significant p-values using EvORanker. Out of 6260
genes, 41% showed significant p-values. B Comparison of EvORanker and Phenolyzer [49] in identifying true disease gene candidates. The graph
shows the count of genes with mouse knockout phenotypes and their respective ranking, each in comparison to 100 randomly sampled genes
by EvORanker and Phenolyzer. Among the tested genes, 16% were ranked in the top 10 by both tools
these methods do not rely on pre-existing knowledge same input HPO terms. However, since ExomeWalker
about known disease genes. The comparison was per- and PHIVE are not well-suited for CNV analysis, we
formed using the exome and simulated datasets with the omitted from this analysis one exome where the causative
Canavati et al. Genome Medicine (2024) 16:4 Page 15 of 22
variant was a large deletion encompassing the NPRL3 employing the previously described analysis steps and
gene. The results of the 108-exome dataset showed that inputting the HPO terms HP:0001263, HP:0002357,
EvORanker outperformed either one or both Exome- HP:0000752, and HP:0000736, EvORanker prioritized
Walker and PHIVE in 74% (80/108) of the cases and out- DLGAP2 as the top candidate gene (Fig. 10B). Further
performed both tools in 30% (32/108) of the cases (Fig. 9, analysis revealed a strong correlation between DLGAP2
Additional file 2: Fig. S10). On the other hand, either one and several genes related to similar phenotypes to that
or both of the other tools outperformed EvORanker in of the patient, such as GRIN2A, NLGN1, CNTNAP2,
20% (22/108) of the cases (Additional file 2: Fig. S10). For SRPX2, SYNGAP1, GABRA5, DLG3, SATB1, PTCHD1,
the simulated dataset, EvORanker outperformed both ARHGEF6, and NLGN4X (Fig. 10C, D, Additional file 2:
ExomeWalker and PHIVE (Fig. 9). Fig. S11). These “phenotype-related” genes were signifi-
cantly enriched within the top co-evolving and STRING-
Solving the unsolved: candidate genes in reanalysis interacting genes with DLGAP2 (combined Fisher
of patient exomes p-value = 1.65 × 10−6) (Fig. 10C, Additional file 2: Fig.
We then initiated the application of EvORanker to iden- S11). Additionally, DLGAP2 was ranked as the top gene
tify novel disease-causing candidate genes in families by both PHIVE [3] and ExomeWalker [8] but ranked 10th
with negative clinical exome results. To illustrate its by Phenolyzer tool [59].
effectiveness, we present two cases where we successfully The high ranking of DLGAP2 by EvORanker
resolved previously unsolved exomes. prompted us to further research the DLGAP2 vari-
ant. The DLGAP2 variant (NM_001346810:c.A2702T,
Family 1 p.Glu901Val) is strongly conserved and not found in
We utilized EvORanker to analyze the exome data of a the gnomAD population frequency database [42] nor
patient with an undiagnosed neurodevelopmental disor- in our in-house database. Both affected siblings were
der for which no disease-causing variant was identified. homozygous for the variant, and it was the only vari-
The patient and one of her siblings displayed symptoms ant that co-segregated with the phenotype in the fam-
of global psychomotor delay, dysphasia, and attention- ily (Fig. 10A). The variant is positioned on the third
deficit hyperactivity disorder (ADHD) (Fig. 10A). By nucleotide preceding the splice donor site within exon
Fig. 9 EvORanker outperforms two other algorithms (ExomeWalker and PHIVE). The performance of each algorithm in the 108-exome dataset
and the simulated dataset (shuffled three times) was measured by examining the ranking of the “true” disease-causing gene relative to the other
patient genes. The upper bar plot shows results for the autosomal and X-linked recessive cases for the real-exome dataset (left) and the simulated
dataset (right). The simulated dataset contains 181 unique recessive cases and 119 unique dominant cases. The results present a compilation
of three separate independent shuffles totaling 900 simulations. The lower bar plot shows results for the autosomal and X-linked dominant cases.
The y-axis indicates the tested algorithms, and the x-axis indicates the percentage of cases where the “true” disease gene was ranked at the top
or within the top 5 genes relative to the other candidate genes in recessive cases. In dominant cases, the percentage indicates whether the “true”
gene was ranked at the top or within the top 10 genes. EvORanker outperformed ExomeWalker and PHIVE in both recessive and dominant diseases
in both datasets
Canavati et al. Genome Medicine (2024) 16:4 Page 16 of 22
Fig. 10 EvORanker identifies DLGAP2 as a novel gene underlying a neurodevelopmental phenotype. A Pedigree: In a consanguineous family
affected children have psychomotor delay and dysphasia, hyperactivity, and poor attention span. Shown is the segregation of the DLGAP2
NM_001346810:c.A2702T, p.Glu901Val variant. N, normal allele; V, variant allele. B EvORanker results: DLGAP2 is ranked as the top candidate relative
to the other patient candidates. The x-axis indicates the proband (patient II-3), and the y-axis indicates the EvORanker -log(10) p-value obtained
from running the K-S test using the co-evolved and STRING-interacting genes with each patient gene. Red dots indicate significant p-values,
and dark blue dots indicate non-significant p-values. DLGAP2 was the only gene that co-segregated with the phenotype in family 1. C One-sided,
two-sample Kolmogorov–Smirnov model. The x-axis indicates the semantic similarity score obtained by the OntologySimilarity tool in relation
to the patient’s (II-3, family 1) phenotypes (HP:0001263, HP:0002357, HP:0000752, HP:0000736). The y-axis indicates the cumulative distribution.
The orange line corresponds to the empirical distribution of all genes listed in the HPO database, ranked according to semantic similarity.
The red line represents the empirical distribution of the genes coevolved with DLGAP2, and the blue line represents the empirical distribution
of the genes interacting with DLGAP2 based on STRING. The red dashed line indicates the D statistic representing the maximum vertical distance
between the empirical cumulative distribution functions of the HPO-ranked genes and the genes coevolved with DLGAP2. The blue dashed line
indicates the D statistic measured by the distance between the empirical cumulative distribution functions of the HPO-ranked genes and the genes
interacting with DLGAP2 based on STRING. Both coevolution and STRING-based analysis yielded significant p-values corresponding to the D
statistic. D Coevolution and STRING-based subnetwork showing the patient’s phenotype-related genes coevolving with the DLGAP2 gene. The dark
grey node in the network indicates DLGAP2 and the light grey nodes represent the phenotype-related genes. The black edges represent STRING
interactions, and the colored edges represent the clade where two genes co-evolve. The network exhibits a group of phenotype-related correlated
genes that have not been identified by the STRING database (EHMT1, IL1RAPL1, SATB2, GABRA5, SRPX2, SEMA3E, CACNG2)
9 (out of 12 exons) of the DLGAP2 gene. It is predicted 4-bp deletion (GAAA del) (Chr8:1,626,792–1,626,795),
to alter gene splicing by different prediction tools (e.g., resulting in a frameshift and premature termination
SpliceAI [44]). Since the DLGAP2 gene is minimally after 59 codons (Additional file 2: Fig. S12).
expressed in whole blood, a minigene splicing assay
was performed to assess the effect of the c.A2702T Family 2
variant on gene splicing (Additional file 2: Fig. S12). We applied EvORanker to the exome data of a patient
The minigene assay results showed that the variant diagnosed with a multisystem disease including fail-
led to the activation of a cryptic splice site and aber- ure to thrive, recurring abdominal pain, chronic diar-
rant splicing (Additional file 2: Fig. S12). Sequencing of rhea, skeletal muscle wasting, elevated liver enzymes,
the RT-PCR product of the mutant construct showed a and high levels of creatine kinase (Fig. 11A). The patient
Canavati et al. Genome Medicine (2024) 16:4 Page 17 of 22
Fig. 11 EvORanker identifies LPCAT3 as a novel gene underlying a multisystem disorder. A Pedigree of a consanguineous family. The affected
son has failure to thrive, chronic diarrhea with recurrent abdominal pain, muscle atrophy, elevated liver enzymes, and high creatine kinase
levels. Shown is the segregation of the LPCAT3 NM_005768:c.G939A, p.Trp313Ter variant. N, normal allele; V, variant allele. B EvORanker results:
LPCAT3 is ranked as the top candidate relative to other candidate genes. The x-axis indicates the proband (patient II-4), and the y-axis indicates
the combined -log10 p-value obtained from running the K-S test using the co-evolved and STRING-interacting genes with each patient gene.
Red dots indicate significant p-values, and dark blue dots indicate non-significant p-values. LPCAT3 was the only gene that co-segregated
with the phenotype in family 2. C One-sided, two-sample Kolmogorov–Smirnov model. The x-axis indicates the semantic similarity score obtained
by the OntologySimilarity tool in relation to the patient’s (II-4, family 2) phenotypes (HP:0001508, HP:0002910, HP:0002574, HP:0002028, HP:0003236,
HP:0003202). The y-axis indicates the cumulative distribution. The orange line corresponds to the empirical distribution of all genes listed in the HPO
database, ranked according to semantic similarity. The red line indicates the empirical distribution of the genes coevolved with LPCAT3, and the blue
line indicates the empirical distribution of the genes interacting with LPCAT3 based on STRING. The red dashed line indicates the D statistic
representing the maximum vertical distance between the empirical cumulative distribution functions of the HPO-ranked genes and the genes
coevolved with LPCAT3. The blue dashed line indicates the D statistic measured by the distance between the empirical cumulative distribution
functions of the HPO-ranked genes and the genes interacting with LPCAT3 based on STRING. Only coevolution-based analysis yielded significant
p-values corresponding to the D statistic. D Coevolution and STRING-based subnetwork showing the patient’s phenotype-related genes coevolving
with the LPCAT3 gene. The yellow node in the network indicates LPCAT3 and the light grey nodes represent the phenotype-related genes. The black
edges represent STRING interactions, and the colored edges represent the clade where two genes co-evolve. We demonstrate that our clade-wise
NPP approach uncovered correlations between LPCAT3 and phenotype-related genes that were not captured by STRING
is of consanguineous parentage and is the sole affected kingdoms (Fig. 11C, D, Additional file 2: Figs. S13, S14
individual in the family (Fig. 11A). Using HPO terms and S15). The genes that showed the strongest coevolu-
corresponding to the patient’s phenotype (HP:0001508, tion with LPCAT3 were significantly enriched within
HP:0002910, HP:0002574, HP:0002028, HP:0003236, the phenotype-related genes (p-value = 7.93 × 10−15)
HP:0003202), EvORanker prioritized LPCAT3 as the (Fig. 10C, Additional file 2: Fig. S13). Conversely, the
top patient candidate gene (Fig. 11B). LPCAT3 demon- genes that interacted with LPCAT3 through STRING
strated strong coevolution signals with genes related to did not exhibit significant enrichment within the phe-
the patient’s phenotype (PYGL, DLD, TXNRD2, COG8, notype-related genes (p-value = 0.53) (Fig. 10C, Addi-
SUCLG1, MVK, SMAD4, CPT1A) in the plant (Vir- tional file 2: Fig. S13). Despite this, LPCAT3 still had
idiplantae and Eudicotyledons), Mammalia, and Fungi the most significant p-value among all candidates based
Canavati et al. Genome Medicine (2024) 16:4 Page 18 of 22
on the combined EvORanker score (Fisher combined (2) Step 2: In the case where none of the queried genes
p-value = 1.42 × 10−13). LPCAT3 was ranked third by are listed in the HPO dataset or where none had a
ExomeWalker [8], excluded by PHIVE [3] and ranked 8th high or sufficient semantic similarity score (i.e., a
by Phenolyzer [59]. The proband (II-4) was homozygous non-diagnostic case), the user can navigate to co-
for a truncating variant in exon 9 of the LPCAT3 gene evolution and STRING-based gene prioritization.
(NM_005768:c.G939A, p.W313X) (Fig. 11A). This vari- The output is a table containing each queried gene
ant was not found in the gnomAD population frequency and the corresponding EvORanker p-value. The
database [42] nor in our in-house database. The LPCAT3 EvORanker p-value is the result of Fisher’s com-
variant was the only variant among the patient candidate bined test obtained by integration of multi-clade
genes that co-segregated with the phenotype in the family phylogenetic profiling and STRING-based analysis
(Fig. 11A). as described above.
Complete knockout of LPCAT3 in mice results in pre-
mature death. However, tissue-specific knockouts in the EvORanker also provides useful visualizations of
liver and intestines have been documented, with the lat- the results, including a bar plot of the ranked genes by
ter causing impaired growth and abnormal enterocyte EvORanker, and a co-evolution and STRING subnet-
morphology along with enterocyte lipid accumulation work generated upon click of any queried gene in the
(Rong et al., 2015). Liver-specific knockouts in mice dis- “Step 2” results table. The network highlights the HPO-
play a decrease in plasma triglycerides and an occurrence related genes co-evolving with the query gene in addition
of hepatosteatosis (Rong et al., 2015). The patient from to edges representing STRING interactions. Addition-
family 2 demonstrated anomalies in both the intestine ally, the user can retrieve more detailed co-evolutionary
and liver. Duodenal biopsies showed nodular lesions in information including the clade where every two genes
the duodenal bulb and the descending portion of the duo- co-evolve, the co-evolutionary rank of the HPO-related
denum with atrophic mucosa suggestive of severe enter- genes with each query gene, and can inspect gene enrich-
opathy. Fragments of duodenal mucosa showed partial ment results of the coevolving genes and STRING-inter-
villous blunting with a mild increase of lamina propria acting genes with each query gene. The web interface is
lymphoplasmacytic cell infiltrate. Liver enzymes revealed available at the following link: [Link]
a reduced ratio of aspartate aminotransferase (AST)/ala- apps.io/EvORanker/. Recognizing the need to analyze a
nine aminotransferase (ALT) ratio, suggesting fatty liver larger number of genes than recommended for the web
disease, along with reduced plasma triglycerides (34 mg/ tool due to memory constraints, we have established a
dL) and HDL levels (27.1 mg/dL). These findings suggest GitHub repository ([Link]
LPCAT3 as a potential causative gene for the disease in nker) [65]. This repository allows users to access the tool
the proband of this family. and input an expanded number of genes, accommodating
their requirements.
EvORanker web tool
The EvORanker web tool ([Link] Discussion
EvORanker/) is an easy-to-use and user-friendly decision Clinical elucidation of genetic variants in connection to
support tool built for geneticists and researchers in the a patient’s phenotype is a time-consuming and costly ele-
NGS field (Additional file 2: Fig. S16). The user submits ment in the genomic diagnosis of rare genetic diseases.
a set of HPO terms describing the patient’s medical con- To address this issue, several computational algorithms
dition and the patient’s candidate genes, preferably genes have been developed over the years to prioritize candi-
that survived variant filtering. The algorithm then per- date genes based on the patient’s phenotype using dif-
forms the aforementioned analyses and returns the out- ferent sources of information, such as protein-protein
puts in two stages: interactions, data mining, and gene expression [3, 8–10,
12–14]. Nevertheless, although PP was successfully
(1) Step 1: If the queried gene is already listed in the used to identify novel disease genes [25, 26, 29, 30, 66],
HPO database, a semantic similarity score (ranging we are not aware of any tool that systematically utilizes
from 0 to 1) reflecting the similarity of the gene’s clade-based phylogenetic profiling to prioritize patient
associated HPO terms to the user’s input HPO candidate genes. Herein, we described EvORanker, an
terms is calculated using the OntologySimilarity algorithm that employs multi-scale phylogenetic profil-
package [51] and is indicated in a table output in ing and gene interaction data from the STRING database
the “Step 1: Semantic Similarity-based Prioritiza- [33] to analyze “unsolved” WES/WGS cases in search of
tion” tab. novel genetic causes of disease. This algorithm integrates
Canavati et al. Genome Medicine (2024) 16:4 Page 19 of 22
unbiased comparative genomic analysis with publicly practical utility and effectiveness of our tool in real-world
available gene data, including function and interactions. applications.
Multi-scale phylogenetic profiling is particu- We applied EvORanker on two unresolved exomes in
larly valuable for identifying disease associations for which previous clinical whole exome sequencing (WES)
poorly annotated genes. The ability to conduct analy- did not identify a known genetic cause. In the first case
sis of every gene independently of existing knowledge (family 1), the DLGAP2 gene was ranked as the top can-
expands the scope of disease-gene discovery. This is didate for a proband with a neurodevelopmental disorder
particularly important in light of the “rich get richer” (Fig. 10B). DLGAP2 plays a role in the molecular organi-
phenomenon, where genes that have already been stud- zation of neuronal synapses and neuronal cell signaling
ied receive disproportionate attention, while poorly [67]. The pathogenicity of the NM_001346810:c.A2702T,
annotated genes are often overlooked. Among the 6260 p.Glu901Val variant observed in DLGAP2 was validated
tested knockout genes that exhibit a phenotype in mice by demonstrating its effect on splicing (Additional file 2:
and have an ortholog in humans, EvORanker was able Fig. S12). Homozygous knockout mice for DLGAP2
to link 41% of these genes to the disease phenotype exhibit novelty-induced hyperactivity, increased aggres-
observed in mice (Fig. 8, Additional file 2: S9). This sion, impaired reverse learning, decreased dendritic spine
highlights the potential of EvORanker to discover new density, and synaptopathy [68] providing further support
disease genes and expand our understanding of disease for the association of DLGAP2 with the patient’s pheno-
mechanisms. type. Furthermore, DLGAP2 was hypothesized to be a
Furthermore, our study demonstrates the power of our strong candidate for neurodevelopmental and behavioral
multi-clade concept in capturing co-evolution, as shown phenotypes observed in patients harboring 8p23.2-pter
by our ability to more effectively identify the “true” dis- microdeletions including DLGAP2 and four other genes
ease-causing genes across multiple clades, beyond just [69]. Notably, our analysis using NPP revealed a group
Eukaryota or Animalia clades (Figs. 3 and 4, Additional of DLGAP2-associated genes not detected by STRING
file 2: Fig. S5). This is aligned with the notion that multi- (Fig. 10D), providing new avenues for investigating the
clade phylogenetic profiling-based methods more effec- role of DLGAP2 in the nervous system.
tively capture co-evolution [29, 30]. Importantly, our In the second “unsolved” exome, only NPP ranked the
clade-wise NPP approach revealed correlations between LPCAT3 gene as the top candidate (family 2, Fig. 11C,
genes that could not be anticipated using other omics Additional file 2: Fig. S13). This ranking of LPCAT3 was
(Fig. 11C, D, Additional file 2: Fig. S13). The integration achieved by the detection of novel functional associa-
of NPP with STRING leads to increased efficiency of tions with phenotype-related genes based on co-evolu-
EvORanker (Fig. 5), especially for newly annotated genes, tion (PYGL, DLD, TXNRD2, COG8, SUCLG1, MVK,
and highlights the complementarity of these two data- SMAD4, CPT1A) (Fig. 11D). These phenotype-related
sets. In future studies, we may contemplate incorporat- genes showed significant coevolution with LPCAT3 in
ing additional datasets into the algorithm, such as mouse the clades of Viridiplantae, Eudicotyledons, Mamma-
and zebrafish knockout data, and other sources for pro- lia, and Fungi (Fig. 11D, Additional file 2: Figs. S11 and
tein-protein interaction networks, by utilizing similar S14), pointing towards novel associations not captured
concepts. by STRING [33]. These findings are supported by phe-
We benchmarked our tool using both real patient notypic similarities between the patient and liver and
exome data, in addition to simulated data. The utiliza- intestinal knockout mice [70], including failure to thrive,
tion of actual patient data enhances the translational enteropathy, and low levels of triglycerides and high-den-
potential of our findings and underscores the clinical sity lipoprotein. LPCAT3 nullizygous mice exhibit post-
relevance of our tool. EvORanker ranked the “true” gene natal death [70], making it difficult to study the global
within the top 5 in 95% of the patient-exome dataset. effects of LPCAT3 knockdown. Although a recent report
On the other hand, failed to rank the “true” gene within linked LPCAT3 overexpression to skeletal muscle myo-
the top 5 for recessive diseases and within the top 10 for pathy [71], further research is needed to understand the
dominant diseases in 6/109 exomes. Further investigation role and mechanism of LPCAT3 in this condition. Taken
revealed that in 3 of those cases (TBL1XR1, NHLRC2, together, these results underscore the potential of clade-
ADGRG1), the HPO terms used as input into the algo- based NPP to predict functional associations with phe-
rithm were both insufficient and non-specific (Additional notype-relevant genes. Further validation in additional
file 1: Table S1). This highlights the importance of precise patients with variants in these genes is warranted to con-
selection of HPO terms to achieve accurate results. Nota- firm their roles as novel disease-associated genes. Subse-
bly, our results remained consistent across both the real- quent functional validation studies are crucial to better
patient and simulation datasets, further validating the understand the mechanisms of disease pathogenesis.
Canavati et al. Genome Medicine (2024) 16:4 Page 20 of 22
EvORanker is entirely gene-based, making it adaptable Additional file 2: Figure S1. Parameter combinations and EvORanker
to various sequencing experiments and accessible for users performance. Figure S2. The 16 clades used in the phylogenetic profiling-
with minimal computational knowledge. In addition to based algorithm. Figure S3. Cutoff values and EvORanker performance.
Figure S4. Distribution of the number of patient candidate genes that
providing a ranked gene list, EvORanker offers the ability passed the variant filtering criteria in autosomal and x-linked recessive
to explore evolutionary and STRING-based gene networks (red) and dominant (dark blue) cases in the (A) patient exome dataset and
across multiple clades. A recommended strategy for users is the (B) simulated dataset (shuffled three times). Figure S5. Contribution
of each of the 16 clades to the overall performance of the EvORanker.
to first examine the ranking of genes based on the Ontolo- Figure S6. Radar plot showing the ranking of the “true” disease-causing
gySimilarity semantic similarity score [51], in the event that gene (top 1, top 10, or NULL) using EvORanker (red), NPP (golden), and
one of the candidate genes is already listed in the HPO data- STRING (dark blue). Figure S7. Evaluating EvORanker Performance across
three independent spike shuffles. Figure S8. Performance of NPP versus
base. If not, the user can then evaluate the ranking of genes STRING using the 109-patient exome dataset across the years. Figure S9.
based on the EvORanker score, where a novel association Comparison of EvORanker and Phenolyzer in identifying true disease gene
between the gene and input phenotypes may be discovered. candidates. Figure S10. Radar plot showing the ranking of the “true” dis‑
ease-causing gene (top 1, top 10, or NULL) using EvORanker (red), PHIVE
The EvORanker server is freely available at [Link] (golden), and ExomeWalker (blue). Figure S11. Distributions of the HPO-
vati.shinyapps.io/EvORanker/, which will be updated on a ranked genes, the co-evolved genes, and STRING-interacting genes with
regular basis. We also created a GitHub repository ([Link] DLGAP2. Figure S12. Effect of DLGAP2 p.E901V on splicing. Figure S13.
Density distributions of the HPO-ranked genes, the co-evolved genes, and
github.com/ccanavati/EvoRanker) [65] which allows users STRING-interacting genes with LPCAT3. Figure S14. Clades differentially
to access the tool and input an expanded number of genes. predict the functional interaction between the phenotype-related genes
and LPCAT3. Figure S15. The Phylogenetic profiles of LPCAT3 and patient
HPO-related genes across 1,028 eukaryotes. Figure S16. Homepage of the
Conclusions EvoRanker web interface.
In summary, our work introduces EvORanker as a
powerful tool in the genomic diagnostic landscape.
Acknowledgements
By integrating multi-scale phylogenetic profiling and We would like to acknowledge The Carole and Andrew Harper Diversity
STRING-based gene interaction data, EvORanker offers Program for their support in this research.
a unique and effective approach to prioritize candidate
Authors’ contributions
genes in “unsolved” cases identified through whole- Conceptualization: CC, DSR, YT, MK, PR, and ELL. Algorithm development: CC,
exome and whole-genome sequencing. Our validation IB, DSR, ES, BT, and YT. Web-tool development: CC. Bioinformatics and data
using real patient exome data and simulation data dem- analysis: CC and FZ. Exome sequencing and subsequent data analysis for the
benchmarking dataset comprising 109 cases, as well as the two “unsolved”
onstrates EvORanker’s robust capability to consistently cases, along with collaboration and communication with clinicians: CC, LK, GR,
prioritize the “true” gene, showcasing its reliability and MK, HK, and MK. In vitro splicing assay: LK, KBA, and IAA. Supervision: YT, ELL,
translational potential in research applications. MK, and PR. Manuscript: CC wrote the manuscript with help from YT, DSR, and
ELL. All authors read and approved the final manuscript.
The effectiveness of EvORanker in identifying candi-
date disease genes, as demonstrated by the identification Funding
of DLGAP2 and LPCAT3 in previously unresolved cases, This study was supported by the Israel Science Foundation (grant no. 3797/21)
to Yuval Tabach, Ephrat Levy-Lahad, and Paul Renbaum; The Koum Foundation
highlights its potential to contribute to our understand- to Ephrat Levy-Lahad and Paul Renbaum; The National Institutes of Health/
ing of disease mechanisms. Moreover, its adaptability, NIDCD (R01DC011835) to Karen B. Avraham and Moien Kanaan; The Ministry
user-friendly interface, and accessibility without exten- of Innovation, Science and Technology (grant no. 3-17417) to Yuval Tabach.
sive computational expertise make EvORanker a valu- Availability of data and materials
able asset for researchers. As we navigate the intricate All the algorithm-related data files on which the conclusions of the paper rely
landscape of rare genetic diseases, EvORanker stands as on are publicly available on GitHub at the following link: [Link]
ccanavati/EvoRanker [65].
a promising tool, offering not only a ranked gene list but The dataset comprising the exome sequencing data of 109 patients at Istishari
also insights into STRING-based and evolutionary gene Arab Hospital in Ramallah, Palestine, is available through Professor Moien
networks across multiple clades. We believe that the Kanaan. It was employed for benchmarking purposes in the current study.
However, due to licensing constraints, these data are not publicly accessible.
adoption of EvORanker will contribute significantly to Researchers interested in accessing the dataset may do so upon making a
advancing genomic diagnostics in the pursuit of unrave- reasonable request. To initiate the request, individuals are required to contact
ling the genetic mysteries underlying rare diseases. Professor Moien Kanaan directly via email ([Link]@[Link]). Please note
that the process of obtaining access involves a meeting with professor Moien
Kanaan, during which the interested party will be required to provide details on
Supplementary Information how the data will be utilized, secured, and maintained to ensure the privacy and
The online version contains supplementary material available at [Link] confidentiality of the patients. This may include a discussion on data security
org/10.1186/s13073-023-01276-2. measures, protocols for restricting access to authorized personnel, and assur‑
ances that the data will not be made available to third parties. The timeline for
granting access will be determined on a case-by-case basis, taking into consid‑
Additional file 1: Table S1. Clinical and Genetic Characteristics of the eration the nature of the request and the required ethical and legal approvals.
109-Patient Exome Dataset. Table S2. ClinVar-Simulated Genetic Variants The VCF files for the 300 genomes utilized in the simulation analysis were
and Associated Phenotypes. Table S3. Newly added disease-gene entries obtained from the 1000 Genome Project [39], accessible at ([Link]
in the Human Phenotype Ontology (HPO) database. enomes.ebi.ac.uk/vol1/ftp/release/20100804/) [40].
Canavati et al. Genome Medicine (2024) 16:4 Page 21 of 22
Declarations 12. Birgmeier J, Haeussler M, Deisseroth CA, Steinberg EH, Jagadeesh KA,
Ratner AJ, et al. AMELIE speeds Mendelian diagnosis by matching
Ethics approval and consent to participate patient phenotype and genotype to primary literature. Sci Transl Med.
The studies involving human participants received ethical approval from the 2020;12:eaau9113.
Institutional Review Board (IRB) of the Shaare Zedek Medical Center, Jerusa‑ 13. Dias R, Torkamani A. Artificial intelligence in clinical and genomic diag‑
lem, Israel (IRB no. 20/10). All procedures were conducted in accordance with nostics. Genome Med. 2019;11:70.
the principles outlined in the Declaration of Helsinki. Prior to their inclusion 14. De La Vega FM, Chowdhury S, Moore B, Frise E, McCarthy J, Hernandez EJ,
in the study, written informed consent was obtained from all participants or et al. Artificial intelligence enables comprehensive genome interpreta‑
their parents. Additionally, participants or their parents granted permission to tion and nomination of candidate diagnoses for rare genetic diseases.
access their medical records. The study was conducted with due considera‑ Genome Med. 2021;13:153.
tion of ethical guidelines, ensuring the confidentiality and voluntary participa‑ 15. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assign‑
tion of all individuals involved. ing protein functions by comparative genome analysis: protein phyloge‑
netic profiles. Proc Natl Acad Sci U S A. 1999;96:4285–8.
Consent for publication 16. Enault F, Suhre K, Poirot O, Abergel C, Claverie J-M. Phydbac2: improved
Not applicable. inference of gene function using interactive phylogenomic profiling and
chromosomal location analysis. Nucleic Acids Res. 2004;32:W336-339.
Competing interests 17. Kim Y, Subramaniam S. Locally defined protein phylogenetic profiles
The authors declare that they have no competing interests. reveal previously missed protein interactions and functional relationships.
Proteins. 2006;62:1115–24.
Author details 18. Eisen JA, Wu M. Phylogenetic analysis and gene functional predictions:
1
Department of Developmental Biology and Cancer Research, Institute phylogenomics in action. Theor Popul Biol. 2002;61:481–7.
of Medical Research - Israel-Canada, The Hebrew University of Jerusalem, Jeru‑ 19. Jiang Z. Protein function predictions based on the phylogenetic profile
salem 9112102, Israel. 2 Molecular Genetics Lab, Istishari Arab Hospital, Ramal‑ method. Crit Rev Biotechnol. 2008;28:233–8.
lah, Palestine. 3 Department of Human Molecular Genetics and Biochemistry, 20. Dey G, Meyer T. Phylogenetic profiling for probing the modular architec‑
Faculty of Medicine and Sagol School of Neuroscience, Tel Aviv University, Tel ture of the human genome. Cell Syst. 2015;1:106–15.
Aviv 6997801, Israel. 4 Medical Genetics Institute, Shaare Zedek Medical Center, 21. Tabach Y, Billi AC, Hayes GD, Newman MA, Zuk O, Gabel H, et al. Iden‑
Jerusalem 91031, Israel. 5 Faculty of Medicine, The Hebrew University of Jerusa‑ tification of small RNA pathway genes using patterns of phylogenetic
lem, Jerusalem 9112102, Israel. 6 Hereditary Research Laboratory and Depart‑ conservation and divergence. Nature. 2013;493:694–8.
ment of Life Sciences, Bethlehem University, Bethlehem 72372, Palestine. 22. Tabach Y, Golan T, Hernández-Hernández A, Messer AR, Fukuda T,
Kouznetsova A, et al. Human disease locus discovery and mapping
Received: 29 April 2023 Accepted: 15 December 2023 to molecular pathways through phylogenetic profiling. Mol Syst Biol.
2013;9:692.
23. Dey G, Jaimovich A, Collins SR, Seki A, Meyer T. Systematic discovery of
human gene function and principles of modular organization through
phylogenetic profiling. Cell Rep. 2015;10:993–1006.
References 24. Tsaban T, Stupp D, Sherill-Rofe D, Bloch I, Sharon E, Schueler-Furman O,
1. Bamshad MJ, Nickerson DA, Chong JX. Mendelian gene discovery: fast et al. CladeOScope: functional interactions through the prism of clade-
wise co-evolution. NAR Genomics Bioinform. 2021;3:lqab024.
2. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Insti‑
and furious with no end in sight. Am J Hum Genet. 2019;105:448–55.
25. Omar I, Guterman-Ram G, Rahat D, Tabach Y, Berger M, Levaot N.
tute of Genetic Medicine, Johns Hopkins University (Baltimore, MD). 2023. Schlafen2 mutation in mice causes an osteopetrotic phenotype due to a
[Link] Accessed 18 Sept 2023. decrease in the number of osteoclast progenitors. Sci Rep. 2018;8:13005.
3. Robinson PN, Köhler S, Oellrich A, Project SMG, Wang K, Mungall CJ, et al. 26. Arkadir D, Lossos A, Rahat D, Abu Snineh M, Schueler-Furman O, Nitschke
Improved exome prioritization of disease genes through cross-species S, et al. MYORG is associated with recessive primary familial brain calcifi‑
phenotype comparison. Genome Res. 2014;24:340–8. cation. Ann Clin Transl Neurol. 2019;6:106–13.
4. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. 27. Date SV, Marcotte EM. Discovery of uncharacterized cellular systems
Identifying a high fraction of the human genome to be under selective by genome-wide analysis of functional linkages. Nat Biotechnol.
constraint using GERP++. PLoS Comput Biol. 2010;6:e1001025. 2003;21:1055–62.
5. Adzhubei I, Jordan DM, Sunyaev SR. Predicting functional effect of 28. Liu C, Wright B, Allen-Vercoe E, Gu H, Beiko R. Phylogenetic clustering of
human missense mutations using PolyPhen-2. Curr Protoc Hum Genet. genes reveals shared evolutionary trajectories and putative gene func‑
2013;Chapter 7:Unit7.20. tions. Genome Biol Evol. 2018;10:2255–65.
6. Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, 29. Sherill-Rofe D, Rahat D, Findlay S, Mellul A, Guberman I, Braun M, et al.
et al. REVEL: an ensemble method for predicting the pathogenicity of rare Mapping global and local coevolution across 600 species to iden‑
missense variants. Am J Hum Genet. 2016;99:877–85. tify novel homologous recombination repair genes. Genome Res.
7. Labes S, Stupp D, Wagner N, Bloch I, Lotem M, Lahad EL, et al. Machine- 2019;29:439–48.
learning of complex evolutionary signals improves classification of SNVs. 30. Stupp D, Sharon E, Bloch I, Zitnik M, Zuk O, Tabach Y. Co-evolution based
NAR Genomics Bioinform. 2022;4:lqac025. machine-learning for predicting functional interactions between human
8. Smedley D, Köhler S, Czeschik JC, Amberger J, Bocchini C, Hamosh A, genes. Nat Commun. 2021;12:6454.
et al. Walking the interactome for candidate prioritization in exome 31. Unterman I, Bloch I, Cazacu S, Kazimirsky G, Ben-Zeev B, Berman BP, et al.
sequencing studies of Mendelian diseases. Bioinformatics Oxf Engl. Expanding the MECP2 network using comparative genomics reveals
2014;30:3215–22. potential therapeutic targets for Rett syndrome. eLife. 2021;10:e67085.
9. Zemojtel T, Köhler S, Mackenroth L, Jäger M, Hecht J, Krawitz P, et al. Effec‑ 32. Braun M, Sharon E, Unterman I, Miller M, Shtern AM, Benenson S, et al.
tive diagnosis of genetic disease by computational phenotype analysis of ACE2 co-evolutionary pattern suggests targets for pharmaceutical inter‑
the disease-associated genome. Sci Transl Med. 2014;6:252ra123. vention in the COVID-19 pandemic. iScience. 2020;23:101384.
10. Tranchevent L-C, Ardeshirdavani A, ElShal S, Alcaide D, Aerts J, Auboeuf 33. Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al.
D, et al. Candidate gene prioritization with Endeavour. Nucleic Acids Res. STRING v11: protein–protein association networks with increased cover‑
2016;44:W117-121. age, supporting functional discovery in genome-wide experimental
11. Zolotareva O, Kleine M. A Survey of gene prioritization tools for Men‑ datasets. Nucleic Acids Res. 2019;47:D607–13.
delian and complex human diseases. J Integr Bioinform. 2019;16:/j/ 34. Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, et al. A compara‑
[Link]-4/jib-2018-0069/[Link]. tive encyclopedia of DNA elements in the mouse genome. Nature.
2014;515:355–64.
Canavati et al. Genome Medicine (2024) 16:4 Page 22 of 22
35. Canavati C, Klein KM, Afawi Z, Pendziwiat M, Abu Rayyan A, Kamal L, 60. Booth KT, Azaiez H, Kahrizi K, Wang D, Zhang Y, Frees K, et al. Exonic
et al. Inclusion of hemimegalencephaly into the phenotypic spectrum mutations and exon skipping: lessons learned from DFNA5. Hum Mutat.
of NPRL3 pathogenic variants in familial focal epilepsy with variable foci. 2018;39:433–40.
Epilepsia. 2019;60:e67-73. 61. Hirsch Y, Tangshewinsirikul C, Booth KT, Azaiez H, Yefet D, Quint A, et al. A
36. Kamal L, Pierce SB, Canavati C, Rayyan AA, Jaraysa T, Lobel O, et al. synonymous variant in MYO15A enriched in the Ashkenazi Jewish popu‑
Helicase-inactivating BRIP1 mutation yields Fanconi anemia with micro‑ lation causes autosomal recessive hearing loss due to abnormal splicing.
cephaly and other congenital abnormalities. Cold Spring Harb Mol Case Eur J Hum Genet. 2021;29:988–97.
Stud. 2020;6:a005652. 62. Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, et al.
37. Elson A, Stein M, Rabie G, Barnea-Zohar M, Winograd-Katz S, Reuven shiny: web application framework for R. R package version 1.8.0.9000.
N, et al. Sorting Nexin 10 as a key regulator of membrane trafficking in 2023. Available from: [Link] [Link]
bone-resorbing osteoclasts: lessons learned from osteopetrosis. Front posit.co/.
Cell Dev Biol. 2021;9:671210. 63. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al.
38. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. Cytoscape: a software environment for integrated models of biomolecu‑
ClinVar: improving access to variant interpretations and supporting lar interaction networks. Genome Res. 2003;13:2498–504.
evidence. Nucleic Acids Res. 2018;46:D1062–7. 64. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A,
39. The 1000 Genomes Project Consortium. A global reference for human et al. STRING v9.1: protein-protein interaction networks, with increased
genetic variation. Nature. 2015;526:68–74. coverage and integration. Nucleic Acids Res. 2013;41:D808-15.
40. 1000 Genomes Project. Data Release 20100804. [Link] 65. Canavati C. EvoRanker: a phylogenetic profiling-based algorithm for
ebi.ac.uk/vol1/ftp/release/20100804/. Accessed 23 Aug 2023. prioritizing candidate genes. 2023. Available from: [Link]
41. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic ccanavati/EvoRanker.
variants from high-throughput sequencing data. Nucleic Acids Res. 66. Findlay S, Heath J, Luo VM, Malina A, Morin T, Coulombe Y, et al. SHLD2/
2010;38:e164. FAM35A co-operates with REV7 to coordinate DNA double-strand break
42. Chen S, Francioli LC, Goodrich JK, Collins RL, Kanai M, Wang Q, et al. A repair pathway choice. EMBO J. 2018;37:e100158.
genome-wide mutational constraint map quantified from variation in 67. Rasmussen AH, Rasmussen HB, Silahtaroglu A. The DLGAP family:
76,156 human genomes. Genetics. 2022. Available from: [Link] neuronal expression, function and role in brain disorders. Mol Brain.
org/lookup/doi/10.1101/2022.03.20.485034. 2017;10:43.
43. Jian X, Boerwinkle E, Liu X. In silico prediction of splice-altering 68. Luo J, Norris RH, Gordon SL, Nithianantharajah J. Neurodevelopmental
single nucleotide variants in the human genome. Nucleic Acids Res. synaptopathies: Insights from behaviour in rodent models of syn‑
2014;42:13534–44. apse gene mutations. Prog Neuropsychopharmacol Biol Psychiatry.
44. Jaganathan K, Panagiotopoulou SK, McRae JF, Darbandi SF, Knowles D, Li 2018;84:424–39.
YI, et al. Predicting splicing from primary sequence with deep learning. 69. Catusi I, Garzo M, Capra AP, Briuglia S, Baldo C, Canevini MP, et al. 8p23.2-
Cell. 2019;176:535-548.e24. pter microdeletions: seven new cases narrowing the candidate region
45. Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein and review of the literature. Genes. 2021;12:652.
function. Nucleic Acids Res. 2003;31:3812. 70. Rong X, Wang B, Dunham MM, Hedde PN, Wong JS, Gratton E, et al.
46. Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasi‑ Lpcat3-dependent production of arachidonoyl phospholipids is a key
levsky NA, et al. The human phenotype ontology in 2021. Nucleic Acids determinant of triglyceride secretion. eLife. 2015;4:e06557.
Res. 2021;49:D1207–17. 71. Ferrara PJ, Verkerke ARP, Maschek JA, Shahtout JL, Siripoksup P, Eshima
47. Li H. Aligning sequence reads, clone sequences and assembly contigs H, et al. Low lysophosphatidylcholine induces skeletal muscle myopathy
with BWA-MEM. arXiv; 2013. Available from: [Link] that is aggravated by high-fat diet feeding. FASEB J Off Publ Fed Am Soc
3997. Cited 2022 Sep 15. Exp Biol. 2021;35:e21867.
48. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A,
et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. Publisher’s Note
49. Garcia FADO, de Andrade ES, Palmero EI. Insights on variant analysis in Springer Nature remains neutral with regard to jurisdictional claims in pub‑
silico tools for pathogenicity prediction. Front Genet. 2022;13:1010327. lished maps and institutional affiliations.
50. Fromer M, Purcell SM. Using XHMM software to detect copy number vari‑
ation in whole-exome sequencing data. Curr Protoc Hum Genet Editor
Board Jonathan Haines Al. 2014;81:7.23.1-7.23.21.
51. Greene D, Richardson S, Turro E. ontologyX: a suite of R packages for
working with ontological data. Bioinformatics. 2017;33:1104–6.
52. Schröer G, Trenkler D. Exact and randomization distributions of
Kolmogorov-Smirnov tests two or three samples. Comput Stat Data Anal.
1995;20:185–202.
53. R Core Team. R: a language and environment for statistical computing.
Vienna: R Foundation for Statistical Computing; 2022. Available from:
[Link]
54. Fisher R. Statistical methods for research workers. Edinburgh: Oliver and
Ready to submit your research ? Choose BMC and benefit from:
Boyd; 1925.
55. Mosteller F, Fisher RA. Questions and answers. Am Stat. 1948;2:30–1.
• fast, convenient online submission
56. Schröder MS, Culhane AC, Quackenbush J, Haibe-Kains B. survcomp: an
R/Bioconductor package for performance assessment and comparison of • thorough peer review by experienced researchers in your field
survival models. Bioinformatics. 2011;27:3206–8. • rapid publication on acceptance
57. Blake JA, Baldarelli R, Kadin JA, Richardson JE, Smith CL, Bult CJ, et al.
• support for research data, including large and complex data types
Mouse Genome Database (MGD): knowledgebase for mouse-human
comparative biology. Nucleic Acids Res. 2021;49:D981–7. • gold Open Access which fosters wider collaboration and increased citations
58. OBO Phenotype Ontology. HPO to MP best matches. 2023. [Link] • maximum visibility for your research: over 100M website views per year
github.com/obophenotype/upheno/blob/master/mappings/hp-to-mp-
bestmatches.tsv. Accessed 15 Feb 2023. At BMC, research is always in progress.
59. Yang H, Robinson PN, Wang K. Phenolyzer: phenotype-based prioritiza‑
tion of candidate genes for human diseases. Nat Methods. 2015;12:841–3. Learn more [Link]/submissions