8.10. Modelling sequence evolution#

cogent3 has a lot of substitution models! The following displays just the nucleotide models.

from cogent3 import available_models

available_models(model_types="nucleotide")
Specify a model using 'Abbreviation' (case sensitive).
Model TypeAbbreviationDescription
nucleotideBHBarry and Hartigan Discrete Time substitution model Barry and Hartigan 1987. Biometrics 43: 261–76.
nucleotideDTDiscrete Time substitution model (non-stationary, non-reversible). motif_length=2 makes this a dinucleotide model, motif_length=3 a trinucleotide model.
nucleotideGNGeneral Markov Nucleotide (non-stationary, non-reversible). Kaehler, Yap, Zhang, Huttley, 2015, Sys Biol 64 (2): 281–93
nucleotidessGNstrand-symmetric general Markov nucleotide (non-stationary, non-reversible). Kaehler, 2017, Journal of Theoretical Biology 420: 144–51
nucleotideK80Kimura 1980
nucleotideJC69Jukes and Cantor's 1969 model
nucleotideGTRGeneral Time Reversible nucleotide substitution model.
nucleotideTN93Tamura and Nei 1993 model
nucleotideHKY85Hasegawa, Kishino and Yano 1985 model
nucleotideF81Felsenstein's 1981 model

10 rows x 3 columns

These are cogent3 apps. There’s also an app for performing hypothesis testing using these models.

8.10.1. Using the model app#

Using the model app involves specifying the substitution model name and a tree. If you are analysing just 3 sequences, the tree is not required. Otherwise, the tree can be a path to a file or a newick formatted string.

The resulting model instance can then be called as if it was a function, passing it an alignment instance. It will maximise the likelihood of the model for the alignment and return a custom result data type.

I’m creating an app to apply the F81 substitution model to an alignment and specifying the a tree as a file path.

from cogent3 import get_app

f81 = get_app("model", "F81", tree="data/brca1_primate.tree")

Warning

If you are providing a tree, the tip names must exactly match the names in the alignments that will be analysed.

Note

If you are analysing 3 sequences, there is only one possible unrooted tree and the model object will automatically construct it for you.

We now load our alignment and fit the model to it.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/brca1_primate.fasta", moltype="dna")
result = f81(aln)

8.10.2. model_result#

The app returns a model_result object whose representation provides an overview of the fitted model.

result
F81
keylnLnfpDLCunique_Q
'F81'-7175.775611TrueTrue

The column titles in that summary display have the following meaning

Column Heading
keyThe dictionary key for accessing this model
lnLThe log-likelihood value
nfpThe number of free parameters in the model
DLCDiagonal largest in column in all P matrices.
unique_QWhether the Q uniquely map to P

The last two are concerned with model identifiability (if either is False, the results could be unreliable).

To see the parameter MLEs, we access the lf attribute [1]

result.lf

F81

log-likelihood = -7175.7756

number of free parameters = 11

Edge params
edgeparentlength
Galagoroot0.1666
HowlerMonroot0.0438
Rhesusedge.30.0215
Orangutanedge.20.0077
Gorillaedge.10.0025
Humanedge.00.0061
Chimpanzeeedge.00.0028
edge.0edge.10.0000
edge.1edge.20.0035
edge.2edge.30.0117
edge.3root0.0085
Motif params
ACGT
0.37570.17420.20950.2406

There are 2 tables always present in this display – “edge params” and “motif params”. The former will always show the branch lengths. The following displays the tree, with the branch length defined by the values in the “edge params” table. The columns “edge” and “parent” denote which branch the “length” column value corresponds to [2]. That value is the MLE for the expected number of substitutions per site (our measure of evolutionary time).

dnd = result.tree.get_figure()
dnd.scale_bar = "top left"
dnd.show()

We can get all those statistics out as cogent3 tables using the tabulate_stats app

tabulate = get_app("tabulate_stats")
tables = tabulate(result)
tables
2x tabular_result('edge params': Table, 'motif params': Table)

which allows us to get at the parts from the lf display

tables["edge params"]
edge params
edgeparentlength
Galagoroot0.1666
HowlerMonroot0.0438
Rhesusedge.30.0215
Orangutanedge.20.0077
Gorillaedge.10.0025
Humanedge.00.0061
Chimpanzeeedge.00.0028
edge.0edge.10.0000
edge.1edge.20.0035
edge.2edge.30.0117
edge.3root0.0085

11 rows x 3 columns

8.10.3. Motif params are the state probabilities#

The “motif params” table corresponds to \({\mathbf \pi}\), the frequencies of nucleotides in the unobserved ancestor [3] of the alignment. More generally, cogent3 uses the variable motif_probs to denote this. For instance, the alignment has a method for getting the motif probabilities as the average across all sequences.

aln.get_motif_probs()
{'T': 0.24063356685957965,
 'C': 0.17418011980911768,
 'A': 0.37572342369783734,
 'G': 0.20946288963346532}

With default model settings, the values returned by the model will be the same as this.

tables["motif params"]
motif params
ACGT
0.37570.17420.20950.2406

1 rows x 4 columns

8.10.4. Refining the model app settings#

8.10.4.1. Optimising the motif probabilities#

For continuous-time substitution models, the default model settings are that the process is time-homogeneous (one rate matrix for the entire tree) and \({\mathbf \pi}\), the state (or motif) probabilities at the root is just that returned by the alignment method. The latter can be changed so that the motif probabilities are treated as free parameters and optimised.

f81_mprobs = get_app("model",
    "F81",
    name="F81-mprobs free",
    optimise_motif_probs=True,
    tree="data/brca1_primate.tree",
)

result = f81_mprobs(aln)

Note that the number of free parameters has now increased by 3.

result
F81-mprobs free
keylnLnfpDLCunique_Q
'F81-mprobs free'-7174.844514TrueTrue

and that the motif probs are different to those above.

result.lf

F81-mprobs free

log-likelihood = -7174.8445

number of free parameters = 14

Edge params
edgeparentlength
Galagoroot0.1666
HowlerMonroot0.0438
Rhesusedge.30.0215
Orangutanedge.20.0077
Gorillaedge.10.0025
Humanedge.00.0061
Chimpanzeeedge.00.0028
edge.0edge.10.0000
edge.1edge.20.0035
edge.2edge.30.0117
edge.3root0.0085
Motif params
ACGT
0.37440.17550.21700.2331

8.10.4.2. Constraining a substitution model parameter#

If we wish to modify the definition of a likelihood function we need to set rules for the parameter of interest. We do this using the param_rules argument to model().

Let’s create a new instance of the HKY85 and constrain the likelihood function so that it is F81. To do that, we need to set the \(\kappa\) parameter to be equal to 1, and constant (so the numerical optimiser does not modify it during it’s work).

hky_as_f81 = get_app("model",
    "HKY85",
    name="HKY85 with κ=1",
    tree="data/brca1_primate.tree",
    param_rules=[dict(par_name="kappa", is_constant=True, value=1)],
)

result = hky_as_f81(aln)
result
HKY85 with κ=1
keylnLnfpDLCunique_Q
'HKY85 with κ=1'-7175.775611TrueTrue

As there can be more than one parameter rule applied to a function, we provide them as a list. Each element of the list must be a dictionary with at least one key "par_name". This is the name of the parameter we wish to modify and the value must be a string. The is_constant=True sets that parameter as a constant with value=1.

result.lf

HKY85 with κ=1

log-likelihood = -7175.7756

number of free parameters = 11

Global params
kappa
1
Edge params
edgeparentlength
Galagoroot0.1666
HowlerMonroot0.0438
Rhesusedge.30.0215
Orangutanedge.20.0077
Gorillaedge.10.0025
Humanedge.00.0061
Chimpanzeeedge.00.0028
edge.0edge.10.0000
edge.1edge.20.0035
edge.2edge.30.0117
edge.3root0.0085
Motif params
ACGT
0.37570.17420.20950.2406

From that you can see the lnL and nfp are identical to our first fitting of the F81 model. While the display includes “kappa” (in the “global params” table), it has the value 1 and was not changed by the optimiser.

More generally, the “global params” table shows MLEs for the exchangeability terms in \({\mathbf{\rm Q}}\).

8.10.5. Testing hypotheses using the hypothesis app#

To perform a hypothesis test, we need two models. I will create a HKY85 model which will serve as our alternate to the null of F81.

hky85 = get_app("model","HKY85", tree="data/brca1_primate.tree")

We construct the hypothesis app by providing the null and alternate models.

hyp = get_app("hypothesis", f81, hky85)

We apply it to our alignment.

result = hyp(aln)

The hypothesis app fits both models to the alignment. One important point to make is, in the case of nested models, it will use the MLEs from the null as “seed” values to alternate as a pre-step to numerical optimisation. This is done for 2 reasons. First, it typically speeds up fitting. Second, it guarantees the likelihood from the alternate will be ≥ that of the null.

8.10.6. hypothesis_result#

The hypothesis app returns a hypothesis_result object, which also provides an overview of the hypothesis test outcome.

result
Statistics
LRdfpvalue
341.837512.54e-76
hypothesiskeylnLnfpDLCunique_Q
null'F81'-7175.775611TrueTrue
alt'HKY85'-7004.856912TrueTrue

The column titles in that summary display have the following meaning

Column Heading
LRLikelihood ratio statistic
dfdegrees of freedom
pvaluep-value
hypothesisnull or alternate

The LR statistic (computed using this equation) and df are computed from the lnL and nfp values, respectively, of the different models under “hypothesis”. The \(p\)-value is the probability of a LR ≥ the observed statistic from the theoretical \(\chi^2\) distribution with df degrees of freedom.

The model_result for each of the models is available as .null and .alt attributes.

result.alt
HKY85
keylnLnfpDLCunique_Q
'HKY85'-7004.856912TrueTrue

hypothesis_result also behaves like a dictionary, with the keys being the model name. So this display is identical to the previous one.

result["HKY85"]
HKY85
keylnLnfpDLCunique_Q
'HKY85'-7004.856912TrueTrue

8.10.7. The NotCompleted object#

If your analysis returns one of these, it means there was an error or the algorithm could not complete. That object contains information about where the error occurred and what type it was. See the cogent3 documentation for more details.

8.11. Exercises#

The following require you to repeat the above hypothesis test. Download the alignment of primate BRCA1 sequences and the tree, or using Python.

  1. Select 3 sequences from the alignment and perform a hypothesis test comparing F81 and HKY85.

  2. From the test result, which model do you choose and why?

  3. Displaying the MLE’s for the null hypothesis and the alternate hypothesis in separate cells. Which specific parameter value(s) is responsible for the difference in likelihoods?

  4. Give a biological interpretation of the result.

  5. What are the assumptions of these models?