8.2. MolType
– specifying biological sequence type#
True biological sequences are polymers of monomeric units with distinct chemical properties. These are represented by specific alphabets of allowed characters, a collection of ambiguous characters [1], and various operations that can be performed on them. In cogent3
, we represent some of these using concepts with the an object type called a MolType
. The available moltypes are displayed by calling a convenience function, available_moltypes()
.
from cogent3 import available_moltypes
available_moltypes()
Abbreviation | Number of states | Moltype |
---|---|---|
'ab' | 2 | MolType(('a', 'b')) |
'dna' | 4 | MolType(('T', 'C', 'A', 'G')) |
'rna' | 4 | MolType(('U', 'C', 'A', 'G')) |
'protein' | 21 | MolType(('A', 'C', 'D', 'E', 'F', 'G', ... |
'protein_with_stop' | 22 | MolType(('A', 'C', 'D', 'E', 'F', 'G', ... |
'text' | 52 | MolType(('a', 'b', 'c', 'd', 'e', 'f', ... |
'bytes' | 256 | MolType(('\x00', '\x01', '\x02', '\x03'... |
7 rows x 3 columns
Note
Typically you specify what moltype your data is simply using the string abbreviation of the appropriate molecular type.
To illustrate the object capabilities, we load the DNA moltype and use some of the methods.
from cogent3 import get_moltype
dna = get_moltype("dna")
dna
MolType(('T', 'C', 'A', 'G'))
For this moltype, there’s a notion of the complement of a sequence
dna.complement("TTGG")
'AACC'
and of the reverse complement.
dna.rc("TTGG")
'CCAA'
The IUPAC ambiguities for DNA are accessed as an attribute (which is just a dict
).
dna.ambiguities
{'?': ('T', 'C', 'A', 'G', '-'),
'-': ('-',),
'N': ('A', 'C', 'T', 'G'),
'R': ('A', 'G'),
'Y': ('C', 'T'),
'W': ('A', 'T'),
'S': ('C', 'G'),
'K': ('T', 'G'),
'M': ('C', 'A'),
'B': ('C', 'T', 'G'),
'D': ('A', 'T', 'G'),
'H': ('A', 'C', 'T'),
'V': ('A', 'C', 'G'),
'T': ('T',),
'C': ('C',),
'A': ('A',),
'G': ('G',)}
The alphabet attribute stores defines the alphabet states and provides, among other things, mapping’s between characters “A”, “C” etc.. to integers (which is how some data structures represent sequences).
dna.alphabet
('T', 'C', 'A', 'G')