MolType – specifying biological sequence type

8.2. MolType – specifying biological sequence type#

True biological sequences are polymers of monomeric units with distinct chemical properties. These are represented by specific alphabets of allowed characters, a collection of ambiguous characters [1], and various operations that can be performed on them. In cogent3, we represent some of these using concepts with the an object type called a MolType. The available moltypes are displayed by calling a convenience function, available_moltypes().

from cogent3 import available_moltypes

available_moltypes()
Specify a moltype by the Abbreviation (case insensitive).
AbbreviationNumber of statesMoltype
'ab'2MolType(('a', 'b'))
'dna'4MolType(('T', 'C', 'A', 'G'))
'rna'4MolType(('U', 'C', 'A', 'G'))
'protein'21MolType(('A', 'C', 'D', 'E', 'F', 'G', ...
'protein_with_stop'22MolType(('A', 'C', 'D', 'E', 'F', 'G', ...
'text'52MolType(('a', 'b', 'c', 'd', 'e', 'f', ...
'bytes'256MolType(('\x00', '\x01', '\x02', '\x03'...

7 rows x 3 columns

Note

Typically you specify what moltype your data is simply using the string abbreviation of the appropriate molecular type.

To illustrate the object capabilities, we load the DNA moltype and use some of the methods.

from cogent3 import get_moltype

dna = get_moltype("dna")
dna
MolType(('T', 'C', 'A', 'G'))

For this moltype, there’s a notion of the complement of a sequence

dna.complement("TTGG")
'AACC'

and of the reverse complement.

dna.rc("TTGG")
'CCAA'

The IUPAC ambiguities for DNA are accessed as an attribute (which is just a dict).

dna.ambiguities
{'?': ('T', 'C', 'A', 'G', '-'),
 '-': ('-',),
 'N': ('A', 'C', 'T', 'G'),
 'R': ('A', 'G'),
 'Y': ('C', 'T'),
 'W': ('A', 'T'),
 'S': ('C', 'G'),
 'K': ('T', 'G'),
 'M': ('C', 'A'),
 'B': ('C', 'T', 'G'),
 'D': ('A', 'T', 'G'),
 'H': ('A', 'C', 'T'),
 'V': ('A', 'C', 'G'),
 'T': ('T',),
 'C': ('C',),
 'A': ('A',),
 'G': ('G',)}

The alphabet attribute stores defines the alphabet states and provides, among other things, mapping’s between characters “A”, “C” etc.. to integers (which is how some data structures represent sequences).

dna.alphabet
('T', 'C', 'A', 'G')