8.5. Exploring genetic variation using Alignments#
Alignments and sequence collections have methods to facilitate exploring genetic variation.
8.5.1. Find the variable positions#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/brca1-bats.fasta", moltype="dna")
aln = aln[:30]
var_pos = aln.variable_positions(include_gap_motif=False)
var_pos
[9, 10, 12, 22, 26, 27]
8.5.2. Entropy per aligned position#
We start by looking at entropy measures (see Measuring information using Shannon’s entropy for background). We will focus on alignments and column wise measurement.
aln = load_aligned_seqs("data/brca1.fasta", moltype="dna")
aln = aln[9:120]
aln.entropy_per_pos()
array([1.09747064, 0.34246377, 0. , 0.48896604, 0.25387844,
0.25387844, 0.25387844, 0.6682137 , 0. , 0. ,
0. , 0. , 0. , 0.45668363, 0.13709948,
0. , 0.3182153 , 0.62190414, 0.23519338, 0.13709948,
0.23519338, 0.139233 , 0.45668363, 0.3711939 , 0.1350362 ,
0.96799794, 0.1350362 , 0.31381296, 0.1350362 , 0.50951572,
0.2695489 , 0.80824065, 0.23181305, 0.46025743, 0.37677436,
0. , 0. , 0.1350362 , 0.2695489 , 0.1350362 ,
0. , 0.23181305, 0.3860189 , 0.1350362 , 0.65727298,
0.3860189 , 0.7452063 , 0.1350362 , 0. , 0.65120959,
0.50951572, 0.61219613, 0.44724744, 0.3860189 , 0.44724744,
0.23181305, 0. , 0.1350362 , 0.58012635, 0.1350362 ,
0.58310316, 0.1350362 , 0.23181305, 0.58310316, 0.36579197,
0.1350362 , 0.13709948, 0.52609979, 0.27365492, 0.1350362 ,
0.1350362 , 0. , 0. , 0. , 0.13303965,
0.51191005, 0.13303965, 0. , 0. , 0.22853814,
0.44104126, 0.51191005, 0. , 0.36055986, 0.26557518,
0.13303965, 0.22853814, 0.60529121, 0.30954343, 0.36055986,
0.38094659, 0.13303965, 0.44104126, 0. , 0.30954343,
0.22853814, 0.2695489 , 0. , 0.55642157, 0.13303965,
0. , 0.26557518, 0.30954343, 0. , 0.22853814,
0.13303965, 0.90550832, 0.26557518, 0.36055986, 0.22853814,
0.13303965])
There are a number of options here, including the fact that we can specifically ask for entropy measurement for \(k\)-mers > 1. In cogent3
, we use the argument motif_length
to specify the value of \(k\).
8.5.3. Visual display of information per aligned position#
8.5.3.1. Sequence logo’s to represent information#
You’ve seen how sequence logos [SS90] are used to visualise PSSMs. They can also be used to visualise the amount of information at specific positions. The alignment objects have a seqlogo()
method that computes the Information at a position. The resulting display has the letter stack height equalling the information.
logo = aln.seqlogo(width=700, height=300, wrap=60, vspace=0.07)
logo.show()
The wrap
argument indicates how many alignment columns to include on a row. The vspace
argument how much vertical space between rows. Typically, it takes a bit of tweaking to get the plot width / height right to make the display sensible.
Warning
This is a compute intensive type of display! Start with a short stretch of alignment first and gradually increase the length.
Note
cogent3
uses plotly for visualisation. Methods that return a drawable return a custom cogent3
object that behaves “like” a plotly object, but is not one.
8.5.3.2. Information plots#
This is a line plot that smoothes the information scores and also displays them with information about gaps.
aln = load_aligned_seqs("data/brca1.fasta", moltype="dna")
aln = aln[:1500]
info_plot = aln.information_plot(window=30)
info_plot.show(width=600, height=250)
You can remove one of the traces by clicking on it’s member in the figures legend. You can also zoom in on parts of the plot by click and drag to include the portion you want. Double click the plot to revert back.
8.5.4. Comparing sequences using dotplots#
cogent3
implements an advanced dotplot algorithm with some very useful features for exploring the relationship between sequences, and the quality of your alignment. Note that this method also exists on the SequenceCollection
class.
8.5.4.1. Dotplot between random sequences#
subaln = aln[:750]
dp = subaln.dotplot()
dp.show(width=500, height=500)
Note
The alignment
item in the legend shows the path the alignment algorithm found. Hopefully, that sits precisely on top of the main displayed diagonal!
8.5.4.2. Include the reverse complement in the dotplot#
dp = subaln.dotplot(rc=True)
dp.show(width=500, height=500)
8.5.4.3. Change the match criteria#
Two key arguments to the dotplot()
method that affect the definition of a match are window
and threshold
. The former corresponds to specifying the size of the \(k\)-mer being compared. The latter controls how many characters within the \(k\)-mer must be identical for it to be considered a match.
dp = subaln.dotplot(rc=True, window=6, threshold=6)
dp.show(width=500, height=500)
8.5.4.4. Set a plot title#
dp = subaln.dotplot(rc=True, window=6, threshold=6, title="Demo dotplot")
dp.show(width=500, height=500)
8.5.4.5. Specify the sequences to be compared#
dp = subaln.dotplot(name1="Human", name2="Wombat")
dp.show(width=500, height=500)
If you just specify name1
, then the second sequence will be chosen at random.
8.5.4.6. Dotplot a sequence to itself#
This can be useful for examining the existence of inversions, repeated sequence features, etc…
dp = subaln.dotplot(name1="Wombat", name2="Wombat", title="Wombats are awesome!")
dp.show(width=500, height=500)
8.6. Exercises#
Download the large alignment of BRCA1 sequences
, or using Python.
Look at the ArrayAlignment documentation and identify methods that can be used to select the positions that are variable.
Search the internet for the definition of a moving average. Then experiment with changing the
window
argument toinformation_plot()
. How do you interpret the impact of increasing the value ofwindow
?Select a smallish segment from one of the sequences within the downloaded data set (say < 50 bases). Manually edit that so it contains an inversion [1]. Use
make_unaligned_seqs()
to create a sequence collection with just this edited sequence and dotplot this synthetic sequence to itself usingrc=True
.Modify your synthetic sequence to have some repeats and see what the effect of changing
window
andthreshold
are on the detection of those.How does interpretation of a sequence logo from an alignment of sequences descended from a common ancestor differ from that described in The role of sequence in encoding life?