Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

How are orthologs calculated?

NCBI's Eukaryotic Genome Annotation Pipeline identifies ortholog gene groups for the NCBI Gene dataset using a combination of protein sequence similarity and local synteny information.

Orthology is determined between a genome being annotated and a reference genome, for example, human or zebrafish, and pairs of one-to-one orthologs are tracked as a set. Reference genomes are selected to optimize both the ortholog calls within a given taxonomic group, and to provide a useful source of names for genes. Ortholog groups are expanded through transitive relationships. For example, orthologs are computed in a two-layer process for all fish other than zebrafish; one-to-one orthologs are computed against zebrafish and consolidated with zebrafish orthologs computed against human. If the fish gene has a zebrafish ortholog that has a human ortholog, then the fish gene is combined into the human ortholog group. Conversely, if a fish gene is an ortholog of a zebrafish gene that lacks a human ortholog, it is added to the zebrafish ortholog group unique to fishes. Currently, over 1,200 species spanning a limited range of taxonomic groups, including vertebrates, arthropods, protists, and fungi are used for ortholog calculations.

NCBI’s workflow for computing orthologs begins with all-to-all alignment of proteins from the two genomes under consideration. For each protein from the genome being annotated, the reference genome is searched for best and near-best matches based on protein sequence similarity. Candidates are further analyzed for nucleotide sequence similarity across all exons (including UTRs) and an additional 2 kb sequence on either side of the gene, as well as microsynteny within the local genomic neighborhood (+/- 10 genes). Orthology relationships are assigned only when there is a clear 1:1 relationship between a pair of gene loci, using the microsynteny information to help resolve closely related paralogs, and may be reviewed by a RefSeq curator to further refine the set.