Modelling sequence evolution

8.11. Modelling sequence evolution#

cogent3 has a lot of substitution models! The following displays just the nucleotide models.

from cogent3 import available_models

available_models(model_types="nucleotide")

Specify a model using 'Abbreviation' (case sensitive).
Model Type	Abbreviation	Description
nucleotide	BH	Barry and Hartigan Discrete Time substitution model Barry and Hartigan 1987. Biometrics 43: 261–76.
nucleotide	DT	Discrete Time substitution model (non-stationary, non-reversible). motif_length=2 makes this a dinucleotide model, motif_length=3 a trinucleotide model.
nucleotide	GN	General Markov Nucleotide (non-stationary, non-reversible). Kaehler, Yap, Zhang, Huttley, 2015, Sys Biol 64 (2): 281–93
nucleotide	ssGN	strand-symmetric general Markov nucleotide (non-stationary, non-reversible). Kaehler, 2017, Journal of Theoretical Biology 420: 144–51
nucleotide	K80	Kimura 1980
nucleotide	JC69	Jukes and Cantor's 1969 model
nucleotide	GTR	General Time Reversible nucleotide substitution model.
nucleotide	TN93	Tamura and Nei 1993 model
nucleotide	HKY85	Hasegawa, Kishino and Yano 1985 model
nucleotide	F81	Felsenstein's 1981 model

10 rows x 3 columns

These are cogent3 apps. There’s also an app for performing hypothesis testing using these models.

8.11.1. Using the `model` app#

Using the model app involves specifying the substitution model name and a tree. If you are analysing just 3 sequences, the tree is not required. Otherwise, the tree can be a path to a file or a newick formatted string.

The resulting model instance can then be called as if it was a function, passing it an alignment instance. It will maximise the likelihood of the model for the alignment and return a custom result data type.

I’m creating an app to apply the F81 substitution model to an alignment and specifying the a tree as a file path.

from cogent3 import get_app

f81 = get_app("model", "F81", tree="data/brca1_primate.tree")

Warning

If you are providing a tree, the tip names must exactly match the names in the alignments that will be analysed.

Note

If you are analysing 3 sequences, there is only one possible unrooted tree and the model object will automatically construct it for you.

We now load our alignment and fit the model to it.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/brca1_primate.fasta", moltype="dna")
result = f81(aln)

8.11.2. `model_result`#

The app returns a model_result object whose representation provides an overview of the fitted model.

result

F81
key	lnL	nfp	DLC	unique_Q
'F81'	-7175.7784	11	True	True

The column titles in that summary display have the following meaning

Column Heading
key	The dictionary key for accessing this model
lnL	The log-likelihood value
nfp	The number of free parameters in the model
DLC	Diagonal largest in column in all P matrices.
unique_Q	Whether the Q uniquely map to P

The last two are concerned with model identifiability (if either is False, the results could be unreliable).

To see the parameter MLEs, we access the lf attribute [1]

result.lf

F81

log-likelihood = -7175.7784

number of free parameters = 11

Edge params
edge	parent	length
Galago	root	0.1666
HowlerMon	root	0.0438
Rhesus	edge.3	0.0215
Orangutan	edge.2	0.0077
Gorilla	edge.1	0.0025
Human	edge.0	0.0061
Chimpanzee	edge.0	0.0028
edge.0	edge.1	0.0000
edge.1	edge.2	0.0035
edge.2	edge.3	0.0117
edge.3	root	0.0085

Motif params
A	C	G	T
0.3757	0.1742	0.2095	0.2406

There are 2 tables always present in this display – “edge params” and “motif params”. The former will always show the branch lengths. The following displays the tree, with the branch length defined by the values in the “edge params” table. The columns “edge” and “parent” denote which branch the “length” column value corresponds to [2]. That value is the MLE for the expected number of substitutions per site (our measure of evolutionary time).

dnd = result.tree.get_figure()
dnd.scale_bar = "top left"
dnd.show()

We can get all those statistics out as cogent3 tables using the tabulate_stats app

tabulate = get_app("tabulate_stats")
tables = tabulate(result)
tables

2x tabular_result('edge params': Table, 'motif params': Table)

which allows us to get at the parts from the lf display

tables["edge params"]

edge params
edge	parent	length
Galago	root	0.1666
HowlerMon	root	0.0438
Rhesus	edge.3	0.0215
Orangutan	edge.2	0.0077
Gorilla	edge.1	0.0025
Human	edge.0	0.0061
Chimpanzee	edge.0	0.0028
edge.0	edge.1	0.0000
edge.1	edge.2	0.0035
edge.2	edge.3	0.0117
edge.3	root	0.0085

11 rows x 3 columns

8.11.3. Motif params are the state probabilities#

The “motif params” table corresponds to \({\mathbf \pi}\), the frequencies of nucleotides in the unobserved ancestor [3] of the alignment. More generally, cogent3 uses the variable motif_probs to denote this. For instance, the alignment has a method for getting the motif probabilities as the average across all sequences.

aln.get_motif_probs()

{'T': 0.24063356685957965,
 'C': 0.17418011980911768,
 'A': 0.37572342369783734,
 'G': 0.20946288963346532}

With default model settings, the values returned by the model will be the same as this.

tables["motif params"]

motif params
A	C	G	T
0.3757	0.1742	0.2095	0.2406

1 rows x 4 columns

8.11.4. Refining the model app settings#

8.11.4.1. Optimising the motif probabilities#

For continuous-time substitution models, the default model settings are that the process is time-homogeneous (one rate matrix for the entire tree) and \({\mathbf \pi}\), the state (or motif) probabilities at the root is just that returned by the alignment method. The latter can be changed so that the motif probabilities are treated as free parameters and optimised.

f81_mprobs = get_app("model",
    "F81",
    name="F81-mprobs free",
    optimise_motif_probs=True,
    tree="data/brca1_primate.tree",
)

result = f81_mprobs(aln)

Note that the number of free parameters has now increased by 3.

result

F81-mprobs free
key	lnL	nfp	DLC	unique_Q
'F81-mprobs free'	-7174.8473	14	True	True

and that the motif probs are different to those above.

result.lf

F81-mprobs free

log-likelihood = -7174.8473

number of free parameters = 14

Edge params
edge	parent	length
Galago	root	0.1666
HowlerMon	root	0.0438
Rhesus	edge.3	0.0215
Orangutan	edge.2	0.0077
Gorilla	edge.1	0.0025
Human	edge.0	0.0061
Chimpanzee	edge.0	0.0028
edge.0	edge.1	0.0000
edge.1	edge.2	0.0035
edge.2	edge.3	0.0117
edge.3	root	0.0085

Motif params
A	C	G	T
0.3744	0.1755	0.2170	0.2331

8.11.4.2. Constraining a substitution model parameter#

If we wish to modify the definition of a likelihood function we need to set rules for the parameter of interest. We do this using the param_rules argument to model().

Let’s create a new instance of the HKY85 and constrain the likelihood function so that it is F81. To do that, we need to set the \(\kappa\) parameter to be equal to 1, and constant (so the numerical optimiser does not modify it during it’s work).

hky_as_f81 = get_app("model",
    "HKY85",
    name="HKY85 with κ=1",
    tree="data/brca1_primate.tree",
    param_rules=[dict(par_name="kappa", is_constant=True, value=1)],
)

result = hky_as_f81(aln)
result

HKY85 with κ=1
key	lnL	nfp	DLC	unique_Q
'HKY85 with κ=1'	-7175.7784	11	True	True

As there can be more than one parameter rule applied to a function, we provide them as a list. Each element of the list must be a dictionary with at least one key "par_name". This is the name of the parameter we wish to modify and the value must be a string. The is_constant=True sets that parameter as a constant with value=1.

result.lf

HKY85 with κ=1

log-likelihood = -7175.7784

number of free parameters = 11

Global params
kappa
1

Edge params
edge	parent	length
Galago	root	0.1666
HowlerMon	root	0.0438
Rhesus	edge.3	0.0215
Orangutan	edge.2	0.0077
Gorilla	edge.1	0.0025
Human	edge.0	0.0061
Chimpanzee	edge.0	0.0028
edge.0	edge.1	0.0000
edge.1	edge.2	0.0035
edge.2	edge.3	0.0117
edge.3	root	0.0085

Motif params
A	C	G	T
0.3757	0.1742	0.2095	0.2406

From that you can see the lnL and nfp are identical to our first fitting of the F81 model. While the display includes “kappa” (in the “global params” table), it has the value 1 and was not changed by the optimiser.

More generally, the “global params” table shows MLEs for the exchangeability terms in \({\mathbf{\rm Q}}\).

8.11.5. Testing hypotheses using the `hypothesis` app#

To perform a hypothesis test, we need two models. I will create a HKY85 model which will serve as our alternate to the null of F81.

hky85 = get_app("model","HKY85", tree="data/brca1_primate.tree")

We construct the hypothesis app by providing the null and alternate models.

hyp = get_app("hypothesis", f81, hky85)

We apply it to our alignment.

result = hyp(aln)

The hypothesis app fits both models to the alignment. One important point to make is, in the case of nested models, it will use the MLEs from the null as “seed” values to alternate as a pre-step to numerical optimisation. This is done for 2 reasons. First, it typically speeds up fitting. Second, it guarantees the likelihood from the alternate will be ≥ that of the null.

8.11.6. `hypothesis_result`#

The hypothesis app returns a hypothesis_result object, which also provides an overview of the hypothesis test outcome.

result

Statistics
LR	df	pvalue
341.8375	1	2.54e-76

hypothesis	key	lnL	nfp	DLC	unique_Q
null	'F81'	-7175.7784	11	True	True
alt	'HKY85'	-7004.8597	12	True	True

The column titles in that summary display have the following meaning

Column Heading
LR	Likelihood ratio statistic
df	degrees of freedom
pvalue	p-value
hypothesis	null or alternate

The LR statistic (computed using this equation) and df are computed from the lnL and nfp values, respectively, of the different models under “hypothesis”. The \(p\)-value is the probability of a LR ≥ the observed statistic from the theoretical \(\chi^2\) distribution with df degrees of freedom.

The model_result for each of the models is available as .null and .alt attributes.

result.alt

HKY85
key	lnL	nfp	DLC	unique_Q
'HKY85'	-7004.8597	12	True	True

hypothesis_result also behaves like a dictionary, with the keys being the model name. So this display is identical to the previous one.

result["HKY85"]

HKY85
key	lnL	nfp	DLC	unique_Q
'HKY85'	-7004.8597	12	True	True

8.11.7. The `NotCompleted` object#

If your analysis returns one of these, it means there was an error or the algorithm could not complete. That object contains information about where the error occurred and what type it was. See the cogent3 documentation for more details.

8.12. Exercises#

The following require you to repeat the above hypothesis test. Download the alignment of primate BRCA1 sequences and the tree, or using Python.

Select 3 sequences from the alignment and perform a hypothesis test comparing F81 and HKY85.
From the test result, which model do you choose and why?
Displaying the MLE’s for the null hypothesis and the alternate hypothesis in separate cells. Which specific parameter value(s) is responsible for the difference in likelihoods?
Give a biological interpretation of the result.
What are the assumptions of these models?