GenBank (.gb) File Format
GenBank file format
Description
Details on the GenBank format
Notes
Examples
References
GenBank is a plaintext format for storing DNA data as character sequences. It is a popular interchange format for molecular biology software.
It is the DNA encoding format used by the U.S. National Center for Biotechnology Information (NCBI).
The commands Import and Export support this format.
The GenBank format employs the following standard IUB/IUPAC conventions for encoding protein or nucleic acid sequences as alphabetic characters.
In addition to codes specifying particular nucleic acids or amino acids, the convention supports codes for ambiguous sequences where a position may be occupied by more than one possible nucleic acid or amino acid. For example the code R matches either adenine (A) or guanine (G).
Table 1: Nucleic Acid Codes
Code
Meaning
A
Adenine
B
{C,G,T,U}
Not A
C
Cytosine
D
{A,G,T,U}
Not C
G
Guanine
H
{A,C,T,U}
Not G
T
Thymine
V
{A,C,G}
Not T or U
U
Uracil
N
{A,C,G,T,U}
Any Nucleic acid
R
{A,G}
Purine
Y
{C,T,U}
Pyramidine
K
{G,T,U}
Ketone
M
{A,C}
Amino
S
{C,G}
Strong interaction
W
{A,T,U}
Weak interaction
Table 2: Amino Acid Codes
Alanine
J
I or L
Serine
D or N
Lysine
Threonine
Cysteine
L
Leucine
Selenocysteine
Aspartic acid
Methionine
Valine
E
Glutamic acid
Asparagine
Tryptophan
F
Phenylalanine
O
Pyrrolysine
Glycine
P
Proline
Tyrosine
Histidine
Q
Glutamine
Z
E or Q
I
Isoleucine
Arginine
X
any amino acid
*
translation stop
-
gap of indeterminate length
Content-Type: chemical/seq-na-genbank
Import a DNA sequence from a GenBank file.
Examine positions 200 through 250.
Count the frequency of each of the nucleotide base pairs within the sequence.
IUPAC code for incomplete nucleic acid specification, National Center for Biotechnology Information.
A One-Letter Notation for Amino Acid Sequences, International Union of Pure and Applied Chemistry.
See Also
Formats
Formats,FASTA
Formats,FASTQ
Download Help Document