7.7. Substitution models#

As we demonstrated, calculating a likelihood requires a matrix of substitution probabilities. We can define such matrices in two ways. In the first, we simply specify a matrix of probabilities which we have denoted \({\mathbf{\rm P}}\). This corresponds to a discrete-time Markov process. This class of Markov process (substitution model) is not widely used because every possible exchange requires a separate probability and this means the number of free parameters in the model is proportional to the square of the matrix dimensions multiplied by the number of edges on a phylogenetic genetic tree.

The second approach uses an instantaneous transition rate matrix which we denote \({\mathbf{\rm Q}}\). This is a mathematical precursor to \({\mathbf{\rm P}}\) that corresponds to a continuous-time Markov process. In this class of model, we do not define a matrix of probabilities but instead define a matrix of instantaneous rates. This is the most commonly used category of substitution models for molecular evolutionary analyses. Their popularity stems in part because they enable an intuitive measurement of elapsed time (described below) and because they can be defined with a very small number of parameters. Their construction can also be related to chemical and or biochemical properties of DNA. We will focus on this category from now on.

7.7.1. Time reversible continuous-time substitution models#

Recall the following equation.

\[P_t = \exp^{Q}\]

This shows how the substitution probabilities needed to compute a likelihood are obtained from a rate matrix \({\mathbf{\rm Q}}\). The properties of \({\mathbf{\rm P}}\) are that all elements are valid probabilities and the row sums are 1. There are constraints on the construction of \({\mathbf{\rm Q}}\) such that it can be used to generate a valid \({\mathbf{\rm P}}\). Those constraints are:

  • off-diagonal elements are positive

  • row sums are 0

The following is an example of such a matrix. The constraint that rows sum to 0 is achieved by setting the diagonal element as equal to the negative of the sum of the remaining elements.

Q
TCAG
T-0.01570.00360.00780.0044
C0.0050-0.01710.00780.0044
A0.00500.0036-0.01300.0044
G0.00500.00360.0078-0.0164

It is typically the case that the Markov process is assumed to be time-reversible. This means that, the probability of the forward and reverse substitutions are equal. For example, for a given time period \(t\), a C\(\rightarrow\)T occurs with the same probability as T\(\rightarrow\)C.

π
TCAG
0.24020.17440.37480.2106
from_t = pi[0, "T"] * Q["T", "C"]
to_t = pi[0, "C"] * Q["C", "T"]
from_t, to_t
(0.0008680857395685076, 0.0008680857395685077)

Note

These values are not identical due to the limits of numerical precision.

7.7.1.1. The Jukes-Cantor substitution model#

The Jukes-Cantor 1969 model (abbreviated JC69) [JC69] was the first published substitution model. It assumes all bases are equally frequent and that all substitutions occur at the same rate, irrespective of the identify of the bases being exchanged.

\[\begin{split}\begin{matrix} & \begin{matrix} {\bf A}\; & {\bf C}\; & {\bf G}\; & {\bf T} \end{matrix} \\ Q = \begin{matrix} {\bf A} \\[1mm] {\bf C} \\[1mm] {\bf G} \\[1mm] {\bf T} \end{matrix} & \begin{bmatrix} \frac{-3}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4}\\ \frac{1}{4} & \frac{-3}{4} & \frac{1}{4} & \frac{1}{4}\\ \frac{1}{4} & \frac{1}{4} & \frac{-3}{4} & \frac{1}{4}\\ \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{-3}{4}\\ \end{bmatrix} \end{matrix}\end{split}\]

7.7.1.2. F81#

Described by Felsenstein in his seminal 1981 paper [Fel81]. This matrix is time reversible and, therefore, stationary. The \({\mathbf \pi}\) elements correspond to the stationary nucleotide frequencies. F81 allows nucleotide frequencies to differ (e.g. \(\pi_A \neq \pi_C\)) but assumes all exchanges otherwise occur at the same rate.

Note

In the following definition, “\(-\)“ is being used for diagonal elements to simplify the expression. Those elements are computed as the negative sum of the remaining elements on the row.

\[\begin{split}\begin{matrix} & \begin{matrix} {\bf A}\; & {\bf C}\; & {\bf G}\; & {\bf T} \end{matrix} \\ Q = \begin{matrix} {\bf A} \\ {\bf C} \\ {\bf G} \\ {\bf T} \end{matrix} & \begin{bmatrix} - & \pi_C & \pi_G & \pi_T \\ \pi_A & - & \pi_G & \pi_T \\ \pi_A & \pi_C & - & \pi_T \\ \pi_A & \pi_C & \pi_G & -\\ \end{bmatrix} \end{matrix}\end{split}\]

7.7.1.3. HKY85#

Named after its authors [HKY85], the Hasegawa-Kishino-Yano (HKY or HKY85) model includes an additional exchangeability parameter [1] \(\kappa\). This parameter is within matrix cells that correspond to transition substitutions. Because the parameter is applied multiplicatively in the matrix (discussed below), it is the ratio of transition / transversion substitution rates.

\[\begin{split}\begin{matrix} & \begin{matrix} {\bf A}\quad & {\bf C}\quad & {\bf G}\quad & {\bf T} \end{matrix} \\ Q = \begin{matrix} {\bf A} \\ {\bf C} \\ {\bf G} \\ {\bf T} \end{matrix} & \begin{bmatrix} - & \pi_C & \pi_G {\color{blue}\bf\kappa} & \pi_T \\ \pi_A & - & \pi_G & \pi_T {\color{blue}\bf\kappa}\\ \pi_A {\color{blue}\bf\kappa} & \pi_C & - & \pi_T \\ \pi_A & \pi_C {\color{blue}\bf\kappa} & \pi_G & -\\ \end{bmatrix} \end{matrix}\end{split}\]

The rate matrix of the HKY85 model illustrates a fundamental structure shared with all time-reversible Markov processes. Namely, that the exchangeability parameters are symmetric across the diagonal. As a consequence, time-reversible matrices can be expressed as the product of a diagonal matrix of the \({\mathbf \pi}\) terms and a symmetric matrix consisting of the exchangeability terms.

7.7.1.4. GTR#

GTR stands for General Time Reversible [LS84]. It is the most general nucleotide substitution model that is time-reversible. This model contains 6 additional parameters: \(\alpha\), \(\beta\), \(\gamma\), \(\delta\), \(\epsilon\) and \(\zeta\). This model is saying that all of the possible nucleotide exchanges have distinctive rates.

\[\begin{split}\begin{matrix} & \begin{matrix} {\bf A}\quad & {\bf C}\quad & {\bf G}\quad & {\bf T} \end{matrix} \\ Q = \begin{matrix} {\bf A} \\ {\bf C} \\ {\bf G} \\ {\bf T} \end{matrix} & \begin{bmatrix} - & \pi_C \alpha & \pi_G \beta & \pi_T \gamma\\ \pi_A \alpha & - & \pi_G \delta & \pi_T \epsilon\\ \pi_A \beta & \pi_C \delta & - & \pi_T \zeta\\ \pi_A \gamma & \pi_C \epsilon & \pi_G \zeta & -\\ \end{bmatrix} \end{matrix}\end{split}\]

Notice again that the exchangeability parameters are symmetric across the diagonal.

7.7.2. Stationarity – base frequencies constant through time#

A \({\mathbf{\rm P}}\) matrix specifies the probabilities of changing from one base into another for a specific time period. So what will the base frequencies be at the end of that time period?

Let us set \({\mathbf \pi}\)\((0)\) as the base frequencies at the beginning of our time interval and \({\mathbf{\rm P}}\)\((t)\) as the substitution probability matrix for time \(t\). We then obtain the base frequencies at time \(t\) as

\[{\mathbf \pi}(t) = {\mathbf \pi}(0) \cdot {\mathbf {\rm P}}(t)\]

where “\(\cdot\)” is matrix multiplication [2]. If \({\mathbf \pi}\)\((0)\) equals \({\mathbf \pi}\)\((t)\) then the base frequencies have not changed and we can state that \({\mathbf \pi}\)\((0)\) is the stationary frequency vector of \({\mathbf{\rm P}}\). This is the condition of stationarity. For simplicity, we just denote this stationary vector as \({\mathbf \pi}\).

In the case of a continuous-time Markov process, we can operate directly on the rate matrix \({\mathbf{\rm Q}}\) without needing to specify a time period. Recall that the rate matrix specifies the instantaneous rates of exchanges between states. If \({\mathbf \pi}\) is the stationarity frequency vector, then there will be no exchanges and we expect a vector of zeros. Thus,

\[{\mathbf \pi} \cdot {{\mathbf\rm Q}} = \mathbf{0}\]

where \(\mathbf{0}\) is a vector of zeros and indicates that \({\mathbf \pi}\) is the stationary frequency vector of \({\mathbf{\rm Q}}\).

All time-reversible substitution models are stationary Markov processes. However, not all stationary Markov processes are time-reversible.

7.7.3. The behaviour of parameters in \({\mathbf{\rm Q}}\)#

For the models I’m presenting, exchangeability parameters are applied “multiplicatively” in the rate matrix. By this I mean the parameters in a “rate matrix cell” are multiplied together [3]. From the definition of a rate matrix, the only constraint operating on exchangeability parameters is that they are ≥0. This construction means that setting an exchangeability parameter value to 1 means it has no effect on the instantaneous rate of change. Setting \(\kappa=1\) in HKY85 simplifies the expression to produce the F81 rate matrix definition.

The multiplicative construction of the rate matrices also provides guidance for interpreting parameter estimates from a model. Parameters whose estimate is <1 are reducing the instantaneous rate of change for the substitutions they influence compared to the remainder. The converse applies when they are > 1.

As you will see below, we typically define one exchangeability parameter as the reference parameter, setting it to the value 1. This effectively means that all other exchangeability parameters are relative to this term. Which parameter is chosen to be the reference is arbitrary. This strategy is used to allow including a time parameter in our models of sequence evolution.

7.7.4. Time in molecular evolution#

What we call “time” in molecular evolution is not chronological time as we know it in our daily lives, but a measure of sequence change. Recall that Kimura showed the rate of substitution is equal to the neutral mutation rate for neutrally evolving genetic variants. Time comes into this via the neutral mutation rate as this quantity is measured as the probability of mutation per generation.

In molecular evolutionary analyses, we measure time (\(t\)) as the expected number of substitutions per position. This is a product of the amount of elapsed chronological time and the mutation rate. For a stationary Markov process, we obtain the expected number of substitutions by multiplying the base frequencies by the flow away from the base (the diagonal element of \({\mathbf{\rm Q}}\)) [4].

\[t=-\sum_{i=1}^4\pi_i Q_{i,i}\]

Although it is not possible to extract chronological time from just sequence divergence data alone, methods exist that use datings from the fossil record to facilitate estimation of chronological times.

7.7.5. Calibrating rate matrices to facilitate estimation of time on a tree#

In order to estimate the expected number of substitutions on all the branches on a tree, molecular evolutionary software often employs a trick. It uses a rate matrix that has been calibrated such that \(t\)=1.

The calibration step involves calculating \(t\) for a given \({\mathbf{\rm Q}}\) according to the equation above and dividing both sides by it.

There is a major computational advantage of this approach. Specifically, the eigendecomposition algorithm for matrix exponentiation allows us to do the algorithmically “hard part” once for an entire tree and store the next to final step in the computation [SYE+08]. That final step — producing the different substitution probability matrices for edges with different \(t\) (\({\mathbf{\rm P}}\)\((t)\)) — is very efficiently obtained by applying the corresponding branch length to this intermediate product.

By specifying time as a parameter in our models, we must eliminate another parameter to avoid over specifying the model [5]. This is achieved by selecting a reference exchangeability parameter and setting it to 1.

7.7.6. Assumptions of substitution process#

time-reversible

That the rate of forward substitutions is equal to their reverse rate. The consequence is that evolution looks the same going forwards or backwards.

stationary

The sequence composition, or frequencies of the states, is the same across the data set. An assumption of all time-reversible models and some others too. This becomes less likely with increasing time since species shared a common ancestor.

time-homogeneous

The substitution process remains the same along each branch. Translated, this means the same intrinsic dynamics governing mutation, and therefore substitution, operate across the entire branch for every branch that we might be analysing. If the lineages aren’t too long then this seems a pretty reasonable assumption. If we assume a single rate matrix for the entire tree we are further assuming the mutation and substitution processes have not changed between the different edges. As you increase the time depth of your sample, this assumption can be more problematic since the potential for changes affecting mutation increase. Accordingly, in such circumstances this property should be checked for [VYP+13].

7.8. Exercises#

  1. What constraints would you impose on HKY85 to make it F81?

  2. Assume I have a rate matrix cell \(q_{i,j}=r_j\theta\). Holding \(r_j\) fixed, what will be the effect on the magnitude of \(q_{i,j}\) if

    a.) \(\theta<1\)? b.) \(\theta>1\)? c.) \(\theta=1\)?

7.8.1. Advanced exercises#

  1. Show that JC is a time-reversible process.

  2. Scale the JC matrix such that \(t=1\).

  3. With \({\mathbf \pi}\)

    pi = {
        "T": 0.2402306967984934,
        "C": 0.17443502824858756,
        "A": 0.3747645951035782,
        "G": 0.21056967984934086,
    }
    

    is the following a calibrated matrix?

    TCAG
    T-0.8501620.5305270.2046480.114986
    C0.730639-1.0502730.2046480.114986
    A0.1311830.095254-0.8668650.640428
    G0.1311830.0952541.139811-1.366248
  4. In order to apply a statistical procedure to data we need to check whether the assumptions made by the model are satisfied by the data. Many of the substitution models used in molecular evolution are time-reversible. As a consequence of this property, the models are also stationary. Test if the following orthologous sequences could have been generated by the same stationary process.

seq1 = (
    "TCCAAGTGCCTAACTACAGTAAGTACTTACAGTCAAGCATCAGTATGAAT"
    "TTGGTCCAAGATGTTTCGTGAAAGTGAGACAGTTATTATTTGAAATCCTG"
    "ATTGGTCATTAGATTTCATTGGTAATCAATTAGCTATGATATTTTAGAAC"
    "AGCTTTTGTAATATAATCCAAAGTTACAATGACTGGGACCCCACTATATA"
    "TAAATTTGAGAAAGTCCATAAGTAGATAACTTTGTTCGAATGATAGTTAG"
    "ATGATCAGGGTTAGGTTTTTTTGTAAATTTTGTGATTCAAAACAAATTCA"
    "GATATACCTACTGACAATCCTAAATAGTGGGGGTTCGTTTGTAAACTATA"
    "CATTTTAGATTTTTCTAGAGAAGCCAGACGCCACAACGATATATACGGTC"
    "GATAGATAATCCTTCAGGGAATATTTTTGTATCTATAATCTTCTAAAAAA"
    "GAAAATATTACCAGATAAGTGATAATAGTCTTAGATTTTTCTGATCGAGA"
)

seq2 = (
    "TCTAATTACCTAGCCACAGTAAGTACCTACAGTCAAGCAGCTCTATGAAT"
    "CTGCATCTGGTGGTCTTGTGGACAGGGGGCTGCTATTGCGTACAGACCTG"
    "ATTGGACATTCTACTTCCCTGCTGGCAGGTTCAATGTCAGATTCTTGGTC"
    "AGTCCTTTCGATGTAACTCGTAACTATAATAACTGAGATCTTGTTATACG"
    "TTATGTTTCGCTAGTCACCGAGTAAGCATCGGTGTCCGGGCTATCACTAG"
    "ATCTTCATCAACAGGCGTCTTTGGACATTTAGAGATTGAAGTGGAACTCA"
    "TAAGTATCCACTGGTAATTATCAGAAACGGGGGTACGCCTTTGCATTACA"
    "CTTTGTAGGCTCTCCTAGAGAATTAGGACACCAGTGGAAAATGTACGGCT"
    "GATCAAAATTCTATCAAAAAACGTCTTTCTATTTGTAGATCCTCAAAATA"
    "GCGAACACCACATGCGTAGGGATAACAACGGTGGTTTTTGCTGCTCAGTA"
)

Citations

[Fel81]

J Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981.

[HKY85]

M Hasegawa, H Kishino, and T Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol, 22:160–174, 1985.

[JC69]

T Jukes and C Cantor. Mammalian Protein Metabolism, pages 21–132. Academic Press, New York, 1969.

[KYZH15]

Benjamin D Kaehler, Von Bing Yap, Rongli Zhang, and Gavin A Huttley. Genetic distance for a general non-stationary Markov substitution process. Systematic Biology, 64:281–93, 2015. URL: http://www.ncbi.nlm.nih.gov/pubmed/25503772.

[LS84]

Cecilia Lanave and C Saccone. A New Method for Calculating Evolutionary Substitution Rates. J Molecular Evolution, 20:86–93, 1984.

[SYE+08]

Harold W Schranz, Von Bing Yap, Simon Easteal, Rob Knight, and Gavin A Huttley. Pathological rate matrices: from primates to pathogens. BMC Bioinformatics, 9:550, Dec 2008. doi:10.1186/1471-2105-9-550.

[VYP+13]

Klara L Verbyla, Von Bing Yap, Anuj Pahwa, Yunli Shao, and Gavin A Huttley. The embedding problem for Markov models of nucleotide substitution. PLoS One, 8:e69187, 2013. doi:10.1371/journal.pone.0069187.