0% found this document useful (0 votes)
50 views5 pages

DNA Sequence Formats - Various Databases

The document outlines various DNA sequence formats including Plain, EMBL, FASTA, GCG, GCG-RSF, GenBank, and IG formats, detailing their structure and examples. It also includes information on IUPAC nucleic acid codes for representing ambiguity in sequences and NCBI accession ID conventions. Each format has specific rules regarding sequence representation, annotations, and identifiers.

Uploaded by

Vidhi Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views5 pages

DNA Sequence Formats - Various Databases

The document outlines various DNA sequence formats including Plain, EMBL, FASTA, GCG, GCG-RSF, GenBank, and IG formats, detailing their structure and examples. It also includes information on IUPAC nucleic acid codes for representing ambiguity in sequences and NCBI accession ID conventions. Each format has specific rules regarding sequence representation, annotations, and identifiers.

Uploaded by

Vidhi Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

7/11/23, 6:46 PM DNA Sequence formats

DNA Sequence formats


[Plain] [EMBL] [FASTA] [GCG] [GenBank] [IG] [IUPAC]

Plain sequence format


A sequence in plain format may contain only IUPAC characters and spaces (no numbers!).

Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file.

An example sequence in plain format is:


AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC

EMBL format
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line
starting with "SQ" and the end of the sequence is marked by two slashes ("//").

An example sequence in EMBL format is:


ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
https://www.animalgenome.org/bioinfo/resources/manuals/seqformats 1/5
7/11/23, 6:46 PM DNA Sequence formats

ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180


tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

FASTA format
A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with
a greater-than (">") symbol in the first column.

An example sequence in FASTA format is:


>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC

GCG format
A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line
ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum. This format should
only be used if the file was created with the GCG package.

An example sequence in GCG format is:


ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
AA03518 Length: 237 Check: 4514 ..

1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc


61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg

https://www.animalgenome.org/bioinfo/resources/manuals/seqformats 2/5
7/11/23, 6:46 PM DNA Sequence formats

121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc


181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc

GCG-RSF (rich sequence format)


The new GCG-RSF can contain several sequences in one file. This format should only be used if the file was created with the GCG package.

GenBank format
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is
marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). (ref: Keys)

An example sequence in GenBank format is:


LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//

IG format
A sequence file in IG format can contain several sequences, each consisting of a number of comment lines that must begin with a semicolon
(";"), a line with the sequence name (it may not contain spaces!) and the sequence itself terminated with the termination character '1' for linear
or '2' for circular sequences.

An example sequence in IG format is:


; comment
; comment
https://www.animalgenome.org/bioinfo/resources/manuals/seqformats 3/5
7/11/23, 6:46 PM DNA Sequence formats

U03518
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC1

IUPAC nucleic acid codes


To represent ambiguity in DNA sequences the following letters can be used (following the rules of the International Union of Pure and Applied
Chemistry (IUPAC)):
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S = G C
W = A T
B = G T C
D = G A T
H = A C T
V = G C A
N = A G C T (any)

NCBI accession ID conventions


Pre-fixes Description

Genbank
AE | CP | CY : Genome projects (nucleotide)
U | AF | AY : Direct submissions (nucleotide)
DQ | EF | EU
FJ | GQ | GU
HM | HQ | JF
JN | JQ | JX
KC | KF | KJ
KM | KP | KR

https://www.animalgenome.org/bioinfo/resources/manuals/seqformats 4/5
7/11/23, 6:46 PM DNA Sequence formats
KT | KU | KX

AAAA - AZZZ : Whole genome shotgun sequences (nucleotide)


JAAA - JZZZ,
LAAA - LZZZ,
MAAA - MZZZ,
NAAA - NZZZ,
PAAA - PZZZ,
QAAA - QZZZ,
RAAA - RZZZ
AAA-AZZ : Protein ID
EAA-EZZ, KAA-KZZ : WGS protein ID
O/P/Q : Swissprot (protein)

RefSeq:
AC_ : Genomic Complete genomic molecule, usually alternate assembly
AP_ : Protein Annotated on AC_ alternate assembly
NC_ : Genomic Complete genomic molecule, usually reference assembly, Curated
NG_ : Curated, Incomplete genomic region
NM_ : Curated, mRNA
NR_ : Curated, ncRNA
NP_ : Curated, Protein Associated with an NM_ or NC_ accession
NS_ : Genomic Environmental sequence
NT_ : Automated, Genomic, Contig or scaffold, clone-based or WGSa
NZ_ : Genomic, Unfinished WGS
NW_ : Automated, Genomic contig or scaffold, primarily WGSa
XM_ : Automated, predicted mRNA model
XP_ : Automated, predicted protein model
XR_ : Automated, predicted ncRNA model
YP_ : Protein
XP_ : Protein Predicted model, associated with an XM_ accession
ZP_ : Protein Predicted model, annotated on NZ_ genomic records
ZP_ :

https://www.animalgenome.org/bioinfo/resources/manuals/seqformats 5/5

You might also like