Statistical methods for molecular evolutionary analyses

7.7. Statistical methods for molecular evolutionary analyses#

The most widely used phylogenetic methods are maximum likelihood or Bayesian. The Bayesian methods incorporate the likelihood calculations and thus inherit many of the assumptions.

Both these approaches are considered “probabilistic”, meaning they involve probability calculations under a specific model of evolution. Their popularity in phylogenetics stems from their well developed theoretical foundations. Those foundations facilitate objective inference into the evolution of biological sequences.

7.7.1. Maximum likelihood#

\[\mathcal{L}(M|D) \propto P(D|M)\]

The likelihood (\(\mathcal{L}\)) of a model (\(M\)) given some data (\(D\)) is proportional to the conditional probability of \(D\) given \(M\).

Maximum likelihood is a method of estimation. If I have an algebraic representation of how something happens (\(M\)) I can implement that in code. By assigning some specific values to those terms in \(M\), I can calculate a probability the sequences in a data sample were generated by that model. The maximum likelihood method proceeds by modifying model parameter values until the likelihood is maximised.

7.7.1.1. The elements to compute likelihood#

Shown in the Elements of a phylogenetic model figure are the basic elements required to compute a likelihood.

A tree shape, or tree topology#

To be consistent with the vast majority of models that are in current use, we place the “root” of the tree at the internal node edge.0 (hover your mouse over the nodes to find it) [1] [2].

The particular tree shown includes the chronological root. This is the point from which time flows forward until the present. However, the vast majority of models are unable to resolve which way times arrow points. As a consequence, the chronological root cannot be identified and an unrooted tree is used.

\({\mathbf \pi}\) – the state frequencies in the common ancestor#

The symbol \({\mathbf \pi}\) (a variable) at the “root” denotes the frequencies of the different sequence states. If we are modelling evolution as a nucleotide process, then \({\mathbf \pi}\) is the frequencies of the nucleotides in the unobserved ancestor. If we were modelling amino acid sequences, it would be the frequency of the amino acids in the unobserved ancestor. The one shown below is again estimated from one alignment [3]. And these were our estimate of the nucleotide frequencies. So the frequency T is 0.24, C is 0.174, etc.

pi

π
T	C	A	G
0.240	0.174	0.375	0.211

\({\mathbf{\rm P}}\) – the substitution probability matrices#

We have \({\mathbf{\rm P}}\) matrices, one for each edge (branch) on the tree. These represent the probabilities of changing from one state into another in some period of time. In the example below, the row labels correspond to the state being changed “from”, the column labels the state being changed “to”. Every element in this matrix is a probability, which means they are bounded between zero and one. Given the row based frame of reference, all elements in a row correspond to the full set of possible outcomes for the “from” state. Accordingly, the row sum must be 1.

P_Human
	T	C	A	G
T	0.994	0.001	0.003	0.002
C	0.002	0.993	0.003	0.002
A	0.002	0.001	0.995	0.002
G	0.002	0.001	0.003	0.993

Notice that the diagonal element is the largest value in each row. This means it is more likely that nucleotides remain unchanged. This is just one \({\mathbf{\rm P}}\) substitution probability matrix [3]. Given the modest amount of time elapsed since we shared a common ancestor with the Chimpanzee, there is not a huge amount of genetic change and why \({\mathbf{\rm P}}\) is dominated by the diagonal.

7.7.1.2. Example calculation of the probability of an alignment column#

We now use just the above two components (\({\mathbf \pi}\), \({\mathbf{\rm P}}\)) to calculate the likelihood of the sample alignment displayed in the tree. Normally we would have a separate \({\mathbf{\rm P}}\) for each edge on the tree, but to keep this calculation simple we just use the one.

We calculate this for the first alignment column in my little example, which has a C, T, C as the observed states. We start by asking what is the likelihood that the ancestral base was a C at this column? We obtain the probability of a C in the ancestor from the \({\mathbf \pi}\) vector.

pi[0, "C"]

np.float64(0.17443502824858756)

Conditioned on having a C, we then extract the probabilities from \({\mathbf{\rm P}}\) that correspond to C being observed on the green and blue edges, and changing into T on the orange edge. The likelihood of a C at alignment column 1 [4] is specified by this equation

\[\mathcal{L}_1(\text{C})=\pi_\text{C} \times P_1(C,C) \times P_2(C,T) \times P_3(C,C)\]

and this code

pi[0, "C"] * P["C", "C"] * P["C", "T"] * P["C", "C"]

np.float64(0.0003512099641637212)

This is simply the probability conditioned on whatever the values are in our matrices and vector. We repeat this calculation for T as the ancestral state, i.e.

\[\mathcal{L}_1(\text{T})=\pi_\text{T} \times P_1(T,C) \times P_2(T,T) \times P_3(T,C)\]

and

pi[0, "T"] * P["T", "C"] * P["T", "T"] * P["T", "C"]

np.float64(5.247164711389856e-07)

for A as the ancestral base and then G as the ancestral base. The likelihood under our model for the first position is simply the sum of these individual likelihoods.

\[\mathcal{L}_1 = \mathcal{L}_1(\text{T}) + \mathcal{L}_1(\text{C}) + \mathcal{L}_1(\text{A}) + \mathcal{L}_1(\text{G})\]

To obtain the likelihood of the entire alignment we perform the above steps for every alignment column and take the product of all the resulting values (there would be 6 for this sample alignment).

\[\mathcal{L}_{\text{Alignment}} = \mathcal{L}_1 \times \mathcal{L}_2 \times \mathcal{L}_3 \times \mathcal{L}_4 \times \mathcal{L}_5 \times \mathcal{L}_6\]

or more simply

\[\mathcal{L}_{\text{Alignment}} = \prod_{i=1}^6\mathcal{L}_i\]

For 3 taxa, the calculation is not particularly complicated. If you’ve got a big tree with many internal nodes, this calculation is solved using a dynamic programming algorithm [Fel81].

So this representation is one of sequence evolution from a common ancestor to the tips. We have described, by specifying the substitution probability matrices, how nucleic acid sequences change in time. We have used the resulting substitution probability matrices to compute the likelihood of observing our data, given the model.

To avoid underflow errors this expression is converted into logs

\[\log\mathcal{L}_{\text{Alignment}} = \sum_1^6 \log \mathcal{L}_i\]

and as a result, we talk of log-likelihoods.

The log-likelihood statistic is a summary statistic that represents the accumulated evidence of observing that entire alignment, given the model.

7.7.2. Practical issues#

We typically do not specify our models in terms of \({\mathbf{\rm P}}\) matrices – the substitution probability matrices. We typically work with a rate matrix \({\mathbf{\rm Q}}\). While \({\mathbf{\rm P}}\) matrices have very good mathematical and statistical properties, they are too “parameter rich” [5]. They make interrogating the dynamics of the process of divergence more complicated. We can obtain the substitution probability matrix from a rate matrix by the matrix exponential

\[P(t) = \exp^{Qt}\]

where \({\mathbf{\rm Q}}\) is the instantaneous rate matrix, \(\exp\) the matrix exponential and \(t\) is the expected number of substitutions per site.

So now we have slightly revised components of the model. Because we were just dealing with a substitution probability matrix, time was embedded as a part of the matrix. It was not a separate parameter. Now, we have all of those things but we have swapped \({\mathbf{\rm Q}}\) plus branch lengths for \({\mathbf{\rm P}}\).

We will discuss substitution models in more detail in the next section.

7.7.3. What’s this maximising stuff…?#

I’ll give you a stripped down example of how this sort of operates. Let’s make some very strict assumptions to make the demonstration easier. I assume

a star phylogeny for three taxa
the branch length on each edge is identical (so there’s only one length)
the sequences evolve according to the Jukes Cantor substitution model (a very simple model, The Jukes-Cantor substitution model).
Jukes-Cantor assumes the nucleotides occur with equal frequency in the ancestor

So we only have one free parameter, the branch length which I will refer to as \(t\). I am now going to start with \(t\)=0.001. I compute my \({\mathbf{\rm P}}\) matrix for that value of time and then the conditional probability of observing an alignment. I increment \(t\) by 0.001 and repeat until I get to some upper limit of \(t\).

The results of this calculation are shown below with the log-likelihood on the \(y\)-axis and the branch length (\(t\)) on the \(x\)-axis. This is a brute-force “line search” and not how we normally do things!

The peak of these data points (when \(t \approx 0.015\)) is the maximum likelihood. That is, the value of \(t\) that maximises the likelihood. This value of \(t\) gets a special label – we call it the maximum likelihood estimator (or MLE) [6].

When you have a really complex function, a probability model with lots of parameters, you don’t have one line search, you have a line search for every parameter. Numerical optimisation algorithms handle that. It’s a bit of magic, and it’s definitely beyond the scope of this course.

7.7.4. What are we actually measuring?#

In a statistical sense, when using a likelihood model we are fitting it to the distribution of distinct alignment columns. For example, the alignment shown above has 5 distinct alignment columns, one of which occurs twice. In our working through the calculation of the likelihood for an alignment column above, we produced a \(\mathcal{L}_i\), the likelihood value for alignment column \(i\). This value is the expected frequency of that alignment column under the model.

Our model arrives at those expected values by specifying the relationships amongst the sequences (the tree) and the way sequences change through time (the substitution model). The model is thus measuring variation in the sequence through time.

In a biological sense, it is the factors that influence genetic variation that we measure. Those changes arise from the influence of the processes illustrated earlier. When we compare sequences between vertebrate species, for instance, we are observing genetic differences that originated from mutagenic events in sex cells. It is events that occur in the germline that shape the distribution of genetic variation in the next generation. The formation of DNA lesions and their repair in the germline that give rise to mutations and the processes that shape those events shape polymorphism and the processes that shape polymorphism shape substitutions. With these simple models, we measure the aggregate of all of these factors.

Precisely how much we can learn about the origins of genetic variation depends on the model of sequence change that we use. In other words, it is the definition of our substitution models that dictates what information we can extract from genetic variation. We will address that in the next section.

7.7.5. Assumptions#

A single tree topology: All positions of a sequence share exactly the same evolutionary history with respect to the genealogy. This can be violated when there are recombination events amongst members of a gene family, for instance. Or, in the case of microbes, there have been horizontal (or lateral) gene transfer events.
Independence of alignment columns: In the calculation of the likelihood of an alignment, we take the product of the likelihoods from all alignment columns. That is an assumption of independence. Stated another way, we assume the alignment columns are evolving independently of each other. We know that’s not true. Recall that we previously tested whether nucleotides occurred independently of each other and rejected the null hypothesis in that case.
Independent and identically distributed, or iid: This is a further refinement of the above. We applied exactly the same model to every alignment column. This corresponds to assuming that every position in the alignment is evolving according to exactly the same process. So we are explicitly imposing the same process for every column of the alignment. Those two conditions are referred to as iid – independent and identically distributed.
Independence between lineages (branches): We assume that the different lineages on a tree are not interacting with each other in some way that’s affecting the substitution rate.

Note

There are other assumptions of the substitution models which we will cover in that section.

Citations

[Fel81] (1,2)

J Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981.