How are orthologs calculated?
NCBI's Eukaryotic Genome Annotation Pipeline identifies ortholog gene groups for the NCBI Gene dataset using a combination of protein sequence similarity and local synteny information.
Orthology is determined between a genome being annotated and a reference genome, for example, human or zebrafish, and pairs of one-to-one orthologs are tracked as a set. Reference genomes are selected to optimize both the ortholog calls within a given taxonomic group, and to provide a useful source of names for genes. Ortholog groups are expanded through transitive relationships. For example, orthologs are computed in a two-layer process for all fish other than zebrafish; one-to-one orthologs are computed against zebrafish and consolidated with zebrafish orthologs computed against human. If the fish gene has a zebrafish ortholog that has a human ortholog, then the fish gene is combined into the human ortholog group. Conversely, if a fish gene is an ortholog of a zebrafish gene that lacks a human ortholog, it is added to the zebrafish ortholog group unique to fishes. Currently, over 1,200 species spanning a limited range of taxonomic groups, including vertebrates, arthropods, protists, and fungi are used for ortholog calculations.
NCBI’s workflow for computing orthologs begins with all-to-all alignment of proteins from the two genomes under consideration. For each protein from the genome being annotated, the reference genome is searched for best and near-best matches based on protein sequence similarity. Candidates are further analyzed for nucleotide sequence similarity across all exons (including UTRs) and an additional 2 kb sequence on either side of the gene, as well as microsynteny within the local genomic neighborhood (+/- 10 genes). Orthology relationships are assigned only when there is a clear 1:1 relationship between a pair of gene loci, using the microsynteny information to help resolve closely related paralogs, and may be reviewed by a RefSeq curator to further refine the set.