8.3. Alignments, Sequence Collections and Sequences#

Biological sequences are represented by several objects in cogent3. Typically, these are created by loading from file or by making them directly from standard python types. There is no more fundamental type than that of sequences. So let’s just get cracking and actually load some data from a file. Along the way, I’ll point out different aspects of the process of creating objects that you need to pay attention to.

8.3.1. Alignments#

8.3.1.1. Loading aligned sequences from a file#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/brca1-bats.fasta", moltype="dna")
aln
0
LittleBroTGTGGCACAGATACTCATGCCAGCTCATTA
FlyingFox.........A..G.............T...
DogFaced.........A............A.......
FreeTaile..............................
TombBat.........AG................G..

5 x 3009 (truncated to 5 x 30) dna alignment

The moltype argument specifies the molecular type of the sequence. See MolType – specifying biological sequence type for more information.

Note

Replace the path in the above code snippet with this link (control+click to copy the url).

type(aln)
cogent3.core.alignment.ArrayAlignment

The ArrayAlignment is one of the two different cogent3 types for handling sequence alignments and is built using numpy.arrays [1]. So, under the hood, sequences are represented as arrays of integers with the integer values corresponding to the index of the character in the alphabet. The class handles conversions to / from the standard text representation.

For all the methods, see ArrayAlignment.

When we talk about an alignment with cogent3, this is what we mean.

8.3.1.2. Getting the length of an alignment#

len(aln)
3009

8.3.1.3. Getting the lengths of individual sequences#

Wait! Aren’t all sequences in a alignment exactly the same length? Yes, but only because the include gaps. It’s often the case that you want to know how much actual sequence data there is for each sequence. In this case, we use the get_lengths() method.

aln.get_lengths()
FlyingFoxDogFacedFreeTaileLittleBroTombBat
27012685274727862713

8.3.1.4. Getting the number of sequences#

aln.num_seqs
5

8.3.1.5. Getting the sequence names#

These are available as the names attribute.

aln.names
['FlyingFox', 'DogFaced', 'FreeTaile', 'LittleBro', 'TombBat']

8.3.1.6. Getting the individual sequences#

Alignments have many useful attributes, including the individual sequences. These can be accessed either by the seqs attribute of the instance (this is just a list of sequence objects).

aln.seqs[0]
0
FlyingFoxTGTGGCACAAATGCTCATGCCAGCTCTTTA

DnaSequence, length=3,009 (truncated to 30)

by the named_seqs attribute, which is a dictionary.

aln.named_seqs["TombBat"]
0
TombBatTGTGGCACAAGTACTCATGCCAGCTCAGTA

DnaSequence, length=3,009 (truncated to 30)

or by the method get_seq().

aln.get_seq(aln.names[0])
0
FlyingFoxTGTGGCACAAATGCTCATGCCAGCTCTTTA

DnaSequence, length=3,009 (truncated to 30)

8.3.1.7. Alignments are aligned column based#

This means when you slice them, you are slicing alignment columns.

aln[10:20]
0
DogFacedATACTCATGC
FlyingFox..G.......
FreeTaile..........
LittleBro..........
TombBatG.........

5 x 10 dna alignment

You can also use a “stride”.

aln[10:20:3]
0
DogFacedACAC
FlyingFox....
FreeTaile....
LittleBro....
TombBatG...

5 x 4 dna alignment

Warning

Slicing with a stride only works for the ArrayAlignment class.

8.3.1.8. cogent3 Alignment types are immutable!#

So any method that modifies their data returns a new instance.

8.3.1.9. Getting a subset of sequences#

This is done via a method.

subset = aln.take_seqs(["TombBat", "FlyingFox"])
subset
0
TombBatTGTGGCACAAGTACTCATGCCAGCTCAGTA
FlyingFox..........A.G.............TT..

2 x 3009 (truncated to 2 x 30) dna alignment

8.3.1.10. Converting sequences into a standard Python dict#

This is useful if you want to directly manipulate the strings, for instance [2]

data = subset[:21].to_dict()
data
{'TombBat': 'TGTGGCACAAGTACTCATGCC', 'FlyingFox': 'TGTGGCACAAATGCTCATGCC'}

8.3.1.11. Creating an alignment from a Python dict#

We use a different function for building an alignment from standard Python types. The function has a very similar interface to load_unaligned_seqs().

from cogent3 import make_aligned_seqs

subset2 = make_aligned_seqs(data=data, moltype="dna")
subset2
0
FlyingFoxTGTGGCACAAATGCTCATGCC
TombBat..........G.A........

2 x 21 dna alignment

8.3.1.12. Writing sequences to file#

The various alignment and sequence collection objects have a write() method. Providing a file path with a known suffix generates a text file with that format. For example

subset2.write("some_dir/subset2.fasta")

will produce a fasta formatted sequence file.

8.3.1.13. Interpreting the display of alignments in Jupyter notebooks#

The visualisation you see is a style known as a pretty print. The "." character indicates a match to the character in the first sequence in that column. We refer to this first sequence as the reference.

Colouring is provided for alignments with RNA, DNA or PROTEIN moltypes. If you do not specify a moltype on loading / creating an alignment, the display will not be coloured.

8.3.1.14. Controlling the display in Jupyter notebooks#

This is done via modifying the representation policy. You can change the number of sequences (num_seqs), the number of aligned positions that will be shown (num_pos), how many columns to display per line (wrap).

aln.set_repr_policy(num_pos=15, wrap=10)
aln
0
LittleBroTGTGGCACAG
FlyingFox.........A
DogFaced.........A
FreeTaile..........
TombBat.........A
10
LittleBroATACT
FlyingFox..G..
DogFaced.....
FreeTaile.....
TombBatG....

5 x 3009 (truncated to 5 x 15) dna alignment

Warning

Rendering the html can take a long time if the number of positions (and/or sequences) is large.

You can also specify the sequence to be used as a reference (the default is to use the longest sequence without gaps).

aln.set_repr_policy(ref_name="FreeTaile")
aln
0
FreeTaileTGTGGCACAG
FlyingFox.........A
DogFaced.........A
LittleBro..........
TombBat.........A
10
FreeTaileATACT
FlyingFox..G..
DogFaced.....
LittleBro.....
TombBatG....

5 x 3009 (truncated to 5 x 15) dna alignment

8.3.1.15. Translating nucleic acids to protein#

There are a few factors to consider here. First, some sequences may be incomplete – meaning the actual sequence does not cover the entire gene and may end with an incomplete codon. Second, the sequence may be complete but terminate with a stop codon. Both of those will cause the translation method to fail. In this case, the data has an incomplete codon (it contains a gap character), which we address as follows

aa_aln = aln.get_translation(incomplete_ok=True)
aa_aln
0
LittleBroCGTDTHASSLQHENSSLLLTKDRMNVEKAE
FlyingFox...NA.........-...Y.........TD
DogFaced...N...N..........Y.........TD
FreeTaile..............................
TombBat...S.....V..................LD

5 x 1003 (truncated to 5 x 30) protein alignment

If the failure is due to having a stop codon, using the trim_stop_codons() method first will do the trick, so long as the stop is at the end.

Another key consideration for translation is to specify the genetic code. The default is to use the standard vertebrate code. (See Genetic Codes for more details on what cogent3 provides.) We will demonstrate specifying the standard code explicitly (using gc=1).

aa_aln = aln.get_translation(incomplete_ok=True, gc=1)

8.3.1.16. Getting the reverse complement of nucleic acid sequences#

Use the rc() method!

subset_rc = subset.rc()
subset_rc
0
TombBat-------------------------GGCTC
FlyingFox.............TCCTCTTCTGAG.A...

2 x 3009 (truncated to 2 x 30) dna alignment

8.3.2. SequenceCollection – for unaligned collections of sequences#

If your sequences are not aligned, they may not be of the same length. To load such sequence data from file, or create from Python objects, you use the functions load_unaligned_seqs() and make_unaligned_seqs(). The signatures of these functions match those of their counterparts for aligned sequences. Likewise, many of the methods on SequenceCollection are the same as for the alignment data types (see SequenceCollection for documentation). But note that a SequenceCollection cannot be sliced.

8.3.2.1. Making from a collection of unaligned sequences from dict#

from cogent3 import make_unaligned_seqs

data = {"seq-0": "ACGGT", "seq-1": "AGGGACGTA"}
coll = make_unaligned_seqs(data=data, moltype="dna")
coll
<cogent3.core.alignment.SequenceCollection at 0x7faa073e9760>
seq_0 = coll.named_seqs["seq-0"]
seq_0
0
seq-0ACGGT

DnaSequence, length=5

8.3.2.2. Making from a collection of unaligned sequences an Alignment#

Just use the degap() method. This strips all gap characters (“-”) from the sequences.

seq_coll = aln.degap()
seq_coll
<cogent3.core.alignment.SequenceCollection at 0x7faa4c113370>

8.3.2.3. Reverse complement and many other methods are available as for alignment data types#

rc_ed = coll.rc()
rc_ed.named_seqs["seq-0"]
0
seq-0ACCGT

DnaSequence, length=5

8.3.3. Sequences#

Collections and alignments give you an organised interface to manipulate groups of sequences. There is also a specific set of sequence data types. These consist of classes that are specific to the different molecular types. (See DnaSequence and ProteinSequence for the documentation.)

We can make a sequence from Python data types.

from cogent3 import make_seq

seq = make_seq("ACGTTTAAA", name="seq-0", moltype="dna")
seq
0
seq-0ACGTTTAAA

DnaSequence, length=9

Sequences are loaded from file using the load data functions for collections (load_unaligned_seqs), or alignments (load_aligned_seqs).

8.4. Exercises#

Download the alignment of bat BRCA1 sequences, or using Python.

  1. Set the incomplete_ok argument in the get_translation() method to False. What happens and why? (Use data = dict(seq1=”ACGAC-”, seq2=”TCGACA”) as your data.)

  2. Create an alignment from a dict with sequences that you make up [3]. Slice the alignment to remove the last 3 aligned columns.

  3. Create an alignment from a dict with sequences that you make up. Slice the alignment to get every second codon position.

  4. Using the downloaded alignment, count the number of second codon positions that have variation.

  5. Load the downloaded alignment without specifying the moltype. Use a method on the object to convert it to the DNA moltype.