8.10. Modelling sequence evolution#
cogent3
has a lot of substitution models! The following displays just the nucleotide models.
from cogent3 import available_models
available_models(model_types="nucleotide")
Model Type | Abbreviation | Description |
---|---|---|
nucleotide | BH | Barry and Hartigan Discrete Time substitution model Barry and Hartigan 1987. Biometrics 43: 261–76. |
nucleotide | DT | Discrete Time substitution model (non-stationary, non-reversible). motif_length=2 makes this a dinucleotide model, motif_length=3 a trinucleotide model. |
nucleotide | GN | General Markov Nucleotide (non-stationary, non-reversible). Kaehler, Yap, Zhang, Huttley, 2015, Sys Biol 64 (2): 281–93 |
nucleotide | ssGN | strand-symmetric general Markov nucleotide (non-stationary, non-reversible). Kaehler, 2017, Journal of Theoretical Biology 420: 144–51 |
nucleotide | K80 | Kimura 1980 |
nucleotide | JC69 | Jukes and Cantor's 1969 model |
nucleotide | GTR | General Time Reversible nucleotide substitution model. |
nucleotide | TN93 | Tamura and Nei 1993 model |
nucleotide | HKY85 | Hasegawa, Kishino and Yano 1985 model |
nucleotide | F81 | Felsenstein's 1981 model |
10 rows x 3 columns
These are cogent3 apps. There’s also an app for performing hypothesis testing using these models.
8.10.1. Using the model
app#
Using the model
app involves specifying the substitution model name and a tree. If you are analysing just 3 sequences, the tree is not required. Otherwise, the tree can be a path to a file or a newick formatted string.
The resulting model instance can then be called as if it was a function, passing it an alignment instance. It will maximise the likelihood of the model for the alignment and return a custom result data type.
I’m creating an app to apply the F81 substitution model to an alignment and specifying the a tree as a file path.
from cogent3 import get_app
f81 = get_app("model", "F81", tree="data/brca1_primate.tree")
Warning
If you are providing a tree, the tip names must exactly match the names in the alignments that will be analysed.
Note
If you are analysing 3 sequences, there is only one possible unrooted tree and the model object will automatically construct it for you.
We now load our alignment and fit the model to it.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/brca1_primate.fasta", moltype="dna")
result = f81(aln)
8.10.2. model_result
#
The app returns a model_result
object whose representation provides an overview of the fitted model.
result
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'F81' | -7175.7756 | 11 | True | True |
The column titles in that summary display have the following meaning
Column Heading | |
---|---|
key | The dictionary key for accessing this model |
lnL | The log-likelihood value |
nfp | The number of free parameters in the model |
DLC | Diagonal largest in column in all P matrices. |
unique_Q | Whether the Q uniquely map to P |
The last two are concerned with model identifiability (if either is False
, the results could be unreliable).
To see the parameter MLEs, we access the lf
attribute [1]
result.lf
F81
log-likelihood = -7175.7756
number of free parameters = 11
edge | parent | length |
---|---|---|
Galago | root | 0.1666 |
HowlerMon | root | 0.0438 |
Rhesus | edge.3 | 0.0215 |
Orangutan | edge.2 | 0.0077 |
Gorilla | edge.1 | 0.0025 |
Human | edge.0 | 0.0061 |
Chimpanzee | edge.0 | 0.0028 |
edge.0 | edge.1 | 0.0000 |
edge.1 | edge.2 | 0.0035 |
edge.2 | edge.3 | 0.0117 |
edge.3 | root | 0.0085 |
A | C | G | T |
---|---|---|---|
0.3757 | 0.1742 | 0.2095 | 0.2406 |
There are 2 tables always present in this display – “edge params” and “motif params”. The former will always show the branch lengths. The following displays the tree, with the branch length defined by the values in the “edge params” table. The columns “edge” and “parent” denote which branch the “length” column value corresponds to [2]. That value is the MLE for the expected number of substitutions per site (our measure of evolutionary time).
If you hover your mouse over the internal nodes on the tree, the name of the node will appear.
dnd = result.tree.get_figure()
dnd.scale_bar = "top left"
dnd.show()
We can get all those statistics out as cogent3
tables using the tabulate_stats
app
tabulate = get_app("tabulate_stats")
tables = tabulate(result)
tables
2x tabular_result('edge params': Table, 'motif params': Table)
which allows us to get at the parts from the lf
display
tables["edge params"]
edge | parent | length |
---|---|---|
Galago | root | 0.1666 |
HowlerMon | root | 0.0438 |
Rhesus | edge.3 | 0.0215 |
Orangutan | edge.2 | 0.0077 |
Gorilla | edge.1 | 0.0025 |
Human | edge.0 | 0.0061 |
Chimpanzee | edge.0 | 0.0028 |
edge.0 | edge.1 | 0.0000 |
edge.1 | edge.2 | 0.0035 |
edge.2 | edge.3 | 0.0117 |
edge.3 | root | 0.0085 |
11 rows x 3 columns
8.10.3. Motif params are the state probabilities#
The “motif params” table corresponds to \({\mathbf \pi}\), the frequencies of nucleotides in the unobserved ancestor [3] of the alignment. More generally, cogent3
uses the variable motif_probs
to denote this. For instance, the alignment has a method for getting the motif probabilities as the average across all sequences.
At the node labelled “root”.
aln.get_motif_probs()
{'T': 0.24063356685957965,
'C': 0.17418011980911768,
'A': 0.37572342369783734,
'G': 0.20946288963346532}
With default model settings, the values returned by the model will be the same as this.
tables["motif params"]
A | C | G | T |
---|---|---|---|
0.3757 | 0.1742 | 0.2095 | 0.2406 |
1 rows x 4 columns
8.10.4. Refining the model app settings#
8.10.4.1. Optimising the motif probabilities#
For continuous-time substitution models, the default model
settings are that the process is time-homogeneous (one rate matrix for the entire tree) and \({\mathbf \pi}\), the state (or motif) probabilities at the root is just that returned by the alignment method. The latter can be changed so that the motif probabilities are treated as free parameters and optimised.
f81_mprobs = get_app("model",
"F81",
name="F81-mprobs free",
optimise_motif_probs=True,
tree="data/brca1_primate.tree",
)
result = f81_mprobs(aln)
Note that the number of free parameters has now increased by 3.
result
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'F81-mprobs free' | -7174.8445 | 14 | True | True |
and that the motif probs are different to those above.
result.lf
F81-mprobs free
log-likelihood = -7174.8445
number of free parameters = 14
edge | parent | length |
---|---|---|
Galago | root | 0.1666 |
HowlerMon | root | 0.0438 |
Rhesus | edge.3 | 0.0215 |
Orangutan | edge.2 | 0.0077 |
Gorilla | edge.1 | 0.0025 |
Human | edge.0 | 0.0061 |
Chimpanzee | edge.0 | 0.0028 |
edge.0 | edge.1 | 0.0000 |
edge.1 | edge.2 | 0.0035 |
edge.2 | edge.3 | 0.0117 |
edge.3 | root | 0.0085 |
A | C | G | T |
---|---|---|---|
0.3744 | 0.1755 | 0.2170 | 0.2331 |
8.10.4.2. Constraining a substitution model parameter#
If we wish to modify the definition of a likelihood function we need to set rules for the parameter of interest. We do this using the param_rules
argument to model()
.
Let’s create a new instance of the HKY85 and constrain the likelihood function so that it is F81. To do that, we need to set the \(\kappa\) parameter to be equal to 1, and constant (so the numerical optimiser does not modify it during it’s work).
hky_as_f81 = get_app("model",
"HKY85",
name="HKY85 with κ=1",
tree="data/brca1_primate.tree",
param_rules=[dict(par_name="kappa", is_constant=True, value=1)],
)
result = hky_as_f81(aln)
result
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'HKY85 with κ=1' | -7175.7756 | 11 | True | True |
As there can be more than one parameter rule applied to a function, we provide them as a list. Each element of the list must be a dictionary with at least one key "par_name"
. This is the name of the parameter we wish to modify and the value must be a string. The is_constant=True
sets that parameter as a constant with value=1
.
result.lf
HKY85 with κ=1
log-likelihood = -7175.7756
number of free parameters = 11
kappa |
---|
1 |
edge | parent | length |
---|---|---|
Galago | root | 0.1666 |
HowlerMon | root | 0.0438 |
Rhesus | edge.3 | 0.0215 |
Orangutan | edge.2 | 0.0077 |
Gorilla | edge.1 | 0.0025 |
Human | edge.0 | 0.0061 |
Chimpanzee | edge.0 | 0.0028 |
edge.0 | edge.1 | 0.0000 |
edge.1 | edge.2 | 0.0035 |
edge.2 | edge.3 | 0.0117 |
edge.3 | root | 0.0085 |
A | C | G | T |
---|---|---|---|
0.3757 | 0.1742 | 0.2095 | 0.2406 |
From that you can see the lnL and nfp are identical to our first fitting of the F81 model. While the display includes “kappa” (in the “global params” table), it has the value 1 and was not changed by the optimiser.
More generally, the “global params” table shows MLEs for the exchangeability terms in \({\mathbf{\rm Q}}\).
8.10.5. Testing hypotheses using the hypothesis
app#
To perform a hypothesis test, we need two models. I will create a HKY85 model which will serve as our alternate to the null of F81.
hky85 = get_app("model","HKY85", tree="data/brca1_primate.tree")
We construct the hypothesis app by providing the null and alternate models.
hyp = get_app("hypothesis", f81, hky85)
We apply it to our alignment.
result = hyp(aln)
The hypothesis
app fits both models to the alignment. One important point to make is, in the case of nested models, it will use the MLEs from the null as “seed” values to alternate as a pre-step to numerical optimisation. This is done for 2 reasons. First, it typically speeds up fitting. Second, it guarantees the likelihood from the alternate will be ≥ that of the null.
8.10.6. hypothesis_result
#
The hypothesis
app returns a hypothesis_result
object, which also provides an overview of the hypothesis test outcome.
result
LR | df | pvalue |
---|---|---|
341.8375 | 1 | 2.54e-76 |
hypothesis | key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|---|
null | 'F81' | -7175.7756 | 11 | True | True |
alt | 'HKY85' | -7004.8569 | 12 | True | True |
The column titles in that summary display have the following meaning
Column Heading | |
---|---|
LR | Likelihood ratio statistic |
df | degrees of freedom |
pvalue | p-value |
hypothesis | null or alternate |
The LR statistic (computed using this equation) and df are computed from the lnL
and nfp
values, respectively, of the different models under “hypothesis”. The \(p\)-value is the probability of a LR ≥ the observed statistic from the theoretical \(\chi^2\) distribution with df degrees of freedom.
The model_result
for each of the models is available as .null
and .alt
attributes.
result.alt
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'HKY85' | -7004.8569 | 12 | True | True |
hypothesis_result
also behaves like a dictionary, with the keys being the model name. So this display is identical to the previous one.
result["HKY85"]
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'HKY85' | -7004.8569 | 12 | True | True |
8.10.7. The NotCompleted
object#
If your analysis returns one of these, it means there was an error or the algorithm could not complete. That object contains information about where the error occurred and what type it was. See the cogent3 documentation for more details.
8.11. Exercises#
The following require you to repeat the above hypothesis test. Download the alignment of primate BRCA1 sequences
and the tree
, or using Python.
Select 3 sequences from the alignment and perform a hypothesis test comparing F81 and HKY85.
From the test result, which model do you choose and why?
Displaying the MLE’s for the null hypothesis and the alternate hypothesis in separate cells. Which specific parameter value(s) is responsible for the difference in likelihoods?
Give a biological interpretation of the result.
What are the assumptions of these models?