Exploring genetic variation using Alignments

8.6. Exploring genetic variation using Alignments#

Alignments and sequence collections have methods to facilitate exploring genetic variation.

8.6.1. Find the variable positions#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/brca1-bats.fasta", moltype="dna")
aln = aln[:30]
var_pos = aln.variable_positions(include_gap_motif=False)
var_pos

[9, 10, 12, 22, 26, 27]

8.6.2. Entropy per aligned position#

We start by looking at entropy measures (see Measuring information using Shannon’s entropy for background). We will focus on alignments and column wise measurement.

aln = load_aligned_seqs("data/brca1.fasta", moltype="dna")
aln = aln[9:120]

aln.entropy_per_pos()

array([1.09747064, 0.34246377, 0.        , 0.48896604, 0.25387844,
       0.25387844, 0.25387844, 0.6682137 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.45668363, 0.13709948,
       0.        , 0.3182153 , 0.62190414, 0.23519338, 0.13709948,
       0.23519338, 0.139233  , 0.45668363, 0.3711939 , 0.1350362 ,
       0.96799794, 0.1350362 , 0.31381296, 0.1350362 , 0.50951572,
       0.2695489 , 0.80824065, 0.23181305, 0.46025743, 0.37677436,
       0.        , 0.        , 0.1350362 , 0.2695489 , 0.1350362 ,
       0.        , 0.23181305, 0.3860189 , 0.1350362 , 0.65727298,
       0.3860189 , 0.7452063 , 0.1350362 , 0.        , 0.65120959,
       0.50951572, 0.61219613, 0.44724744, 0.3860189 , 0.44724744,
       0.23181305, 0.        , 0.1350362 , 0.58012635, 0.1350362 ,
       0.58310316, 0.1350362 , 0.23181305, 0.58310316, 0.36579197,
       0.1350362 , 0.13709948, 0.52609979, 0.27365492, 0.1350362 ,
       0.1350362 , 0.        , 0.        , 0.        , 0.13303965,
       0.51191005, 0.13303965, 0.        , 0.        , 0.22853814,
       0.44104126, 0.51191005, 0.        , 0.36055986, 0.26557518,
       0.13303965, 0.22853814, 0.60529121, 0.30954343, 0.36055986,
       0.38094659, 0.13303965, 0.44104126, 0.        , 0.30954343,
       0.22853814, 0.2695489 , 0.        , 0.55642157, 0.13303965,
       0.        , 0.26557518, 0.30954343, 0.        , 0.22853814,
       0.13303965, 0.90550832, 0.26557518, 0.36055986, 0.22853814,
       0.13303965])

There are a number of options here, including the fact that we can specifically ask for entropy measurement for \(k\)-mers > 1. In cogent3, we use the argument motif_length to specify the value of \(k\).

8.6.3. Visual display of information per aligned position#

8.6.3.1. Sequence logo’s to represent information#

You’ve seen how sequence logos [SS90] are used to visualise PSSMs. They can also be used to visualise the amount of information at specific positions. The alignment objects have a seqlogo() method that computes the Information at a position. The resulting display has the letter stack height equalling the information.

logo = aln.seqlogo(width=700, height=300, wrap=60, vspace=0.07)
logo.show()

The wrap argument indicates how many alignment columns to include on a row. The vspace argument how much vertical space between rows. Typically, it takes a bit of tweaking to get the plot width / height right to make the display sensible.

Warning

This is a compute intensive type of display! Start with a short stretch of alignment first and gradually increase the length.

Note

cogent3 uses plotly for visualisation. Methods that return a drawable return a custom cogent3 object that behaves “like” a plotly object, but is not one.

8.6.3.2. Information plots#

This is a line plot that smoothes the information scores and also displays them with information about gaps.

aln = load_aligned_seqs("data/brca1.fasta", moltype="dna")
aln = aln[:1500]

info_plot = aln.information_plot(window=30)
info_plot.show(width=600, height=250)

You can remove one of the traces by clicking on it’s member in the figures legend. You can also zoom in on parts of the plot by click and drag to include the portion you want. Double click the plot to revert back.

8.6.4. Comparing sequences using dotplots#

cogent3 implements an advanced dotplot algorithm with some very useful features for exploring the relationship between sequences, and the quality of your alignment. Note that this method also exists on the SequenceCollection class.

8.6.4.1. Dotplot between random sequences#

subaln = aln[:750]
dp = subaln.dotplot()
dp.show(width=500, height=500)

Note

The alignment item in the legend shows the path the alignment algorithm found. Hopefully, that sits precisely on top of the main displayed diagonal!

8.6.4.2. Include the reverse complement in the dotplot#

dp = subaln.dotplot(rc=True)
dp.show(width=500, height=500)

8.6.4.3. Change the match criteria#

Two key arguments to the dotplot() method that affect the definition of a match are window and threshold. The former corresponds to specifying the size of the \(k\)-mer being compared. The latter controls how many characters within the \(k\)-mer must be identical for it to be considered a match.

dp = subaln.dotplot(rc=True, window=6, threshold=6)
dp.show(width=500, height=500)

8.6.4.4. Set a plot title#

dp = subaln.dotplot(rc=True, window=6, threshold=6, title="Demo dotplot")
dp.show(width=500, height=500)

8.6.4.5. Specify the sequences to be compared#

dp = subaln.dotplot(name1="Human", name2="Wombat")
dp.show(width=500, height=500)

If you just specify name1, then the second sequence will be chosen at random.

8.6.4.6. Dotplot a sequence to itself#

This can be useful for examining the existence of inversions, repeated sequence features, etc…

dp = subaln.dotplot(name1="Wombat", name2="Wombat", title="Wombats are awesome!")
dp.show(width=500, height=500)

8.7. Exercises#

Download the large alignment of BRCA1 sequences, or using Python.

Look at the ArrayAlignment documentation and identify methods that can be used to select the positions that are variable.
Search the internet for the definition of a moving average. Then experiment with changing the window argument to information_plot(). How do you interpret the impact of increasing the value of window?
Select a smallish segment from one of the sequences within the downloaded data set (say < 50 bases). Manually edit that so it contains an inversion [1]. Use make_unaligned_seqs() to create a sequence collection with just this edited sequence and dotplot this synthetic sequence to itself using rc=True.
Modify your synthetic sequence to have some repeats and see what the effect of changing window and threshold are on the detection of those.
How does interpretation of a sequence logo from an alignment of sequences descended from a common ancestor differ from that described in The role of sequence in encoding life?