7.9. Hypothesis Testing#

The likelihood models are parametric models, they are defined in terms of specific parameters whose values we seek to estimate from data. One of the major advantages of the development of phylogeny-based maximum likelihood methods is they allow us to take advantage of the large wealth of theoretical developments relating to using maximum likelihood for statistical inference. Developments that include hypothesis testing.

The ratio of the likelihood of two different models is a natural measure for hypothesis testing. Let’s first define the LR statistic [1].

(1)#\[\text{LR} = 2 (\ln \mathcal{L}_{\text{alt}} - \ln \mathcal{L}_{\text{null}})\]

Note that the likelihood from the alternate comes first. For nested models, it is guaranteed that the likelihood for the alternate will be ≥ that for the null [2].

We obtain \(\ln \mathcal{L}_{\text{null}}\) and \(\ln \mathcal{L}_{\text{alt}}\) after fitting [3] both models to the data. If \(H_o\) and \(H_a\) are nested (more on that below), then the LR statistic computed according to the equation has the convenient property of being \(\chi^2\) distributed with degrees of freedom (abbreviated df) equal to the difference in the number of free parameters (abbreviated nfp) between those two models [4].

(2)#\[\text{df} = \text{nfp}_{\text{alt}} - \text{nfp}_{\text{null}}\]

What being “\(\chi^2\) distributed” means is that we obtain the \(p\)-value for a LR from the theoretical \(\chi^2\) distribution. For example if LR=3.84 and df=1, then the \(p\)-value=0.05.

Note

The mathematical proof that LR statistics are \(\chi^2\) distributed assumes there is infinite data. In our case, the amount of data is specified by the length of the alignment. Clearly, alignments are never infinitely long so can we rely on this assumption? It has been demonstrated in some circumstances that alignments as short as a couple of hundred bases long are sufficient for the assumption of a \(\chi^2\) to be reasonable. The exact length will depend on the amount of variation between the sequences.

7.9.1. Nesting hypotheses#

Hypotheses are considered nested if one model can be specified as a mathematical restriction of another. Recall the rate matrices from F81 and HKY85. These rate matrices are distinguished by the \(\kappa\) parameter in HKY85. Because of the multiplicative nature of the definition of HKY85, imposing the constraint \(\kappa\)=1 transforms the HKY85 rate matrix into F81. Thus, we state that F81 is nested within HKY85. The null model in this case is \(\kappa=1\).

If we perform a hypothesis test and wind up in not rejecting the null, we may be accepting [5] that the sequences could reasonably have been generated by the F81 model. If we rejected the null, we are concluding the sequences were more likely to be generated by a model in which transition substitutions had a distinctive rate. In this case, the MLE for \(\kappa\) indicates whether the transition substitutions occurred faster / slower than the transversion substitutions (see The behaviour of parameters in Q).

To summarise, the hypothesis test outcome guides our choice of which model to prefer. The parameter values from that fitted model inform us as to why a model was superior (or adequate if we failed to reject the null).

This F81 vs HKY85 example illustrates the general pattern of nested relationships. There are a multitude of ways this approach is employed to construct hypotheses. Here are some specific examples.

7.9.1.1. Testing the molecular clock#

In this case, it is branch length parameters that are constrained. In our usual applications, branch lengths are typically free parameters. The clock model imposes a constraint on that which I’ll illustrate with an example.

For the above tree, we can test whether the Human and Chimpanzee lineages have been evolving in a clock-like manner since their descent from a common ancestor. We specify a clock model by constraining the Human and Chimpanzee branch lengths to be exactly the same. We would contrast that with a model in which the Human and Chimpanzee are allowed to be different.

7.9.1.2. Testing time-homogeneity#

Time-homogeneity is the case when only one calibrated rate matrix is used (see calibrating rate matrices). The alternate hypothesis is time-heterogeneity, in which their are multiple rate matrices. For example, if we take a time-homogeneous HKY85 model as our null we could specify the alternate as different values of \(\kappa\) are used for the Human and Chimpanzee branch [6].

7.9.1.3. Rate heterogeneity models#

Independent sites#

It has been known for a very long time [KJ69] that some sequence positions evolve at different rates. The standard assumption of the likelihood analyses thus far is that all alignment columns are treated as evolving under precisely the same substitution process and thus at the same rate. This is a rate-homogeneous model. Rate heterogeneity models seek to account for observed differences in rate.

In a nutshell, the typical rate-heterogeneity models are a “mixture model”. I illustrate this with the widely used \(\Gamma\) [7] distributed rate heterogeneity model. The \(\Gamma\) distribution shape is controlled by parameters that are included [8] as free parameters in the model. The distribution represents a scalar on branch length and is defined so its mean is 1. Typically, the \(\Gamma\) distribution is split into four bins [9] and the likelihood is computed for each bin separately. The mean value of each bin is used as a multiplier for all the branch lengths on a tree. The full likelihood is computed by summing the likelihoods computed for each bin.

Note

These models do not readily nest the rate-homogeneous models, so more complex methods are required for hypothesis testing.

Dependency amongst sites – phylo-HMMs#

In this form of rate heterogeneity, the rate categories of adjacent sites are correlated. So if one site is evolving slowly, the adjacent site is likely to belong to the same category. This form can be combined with the \(\Gamma\) models described above by addition of a switching parameter that controls whether to stay in the previous sites rate category or pick a new one.

This class of model are termed phylogenetic Hidden Markov Models (or phylo-HMM). (See the example application in the PyCogent publication [KMB+07].)

Note

These models also do not readily nest the rate-homogeneous models, so more complex methods are required for hypothesis testing.

Jointly modelling loci — concatenating alignments#

A common analysis strategy involves concatenating the alignments of different genes from the same group of species [10]. For instance, say I have alignments of one-to-one orthologs for gene A and B from 2 species. I concatenate these by simply concatenating the sequence strings in the same order for each species. For instance

from cogent3 import make_aligned_seqs

a = make_aligned_seqs(dict(B="AGA", A="AAA"), moltype="dna")
a
0
AAAA
B.G.

2 x 3 dna alignment

b = make_aligned_seqs(dict(A="TTT", B="TCT"), moltype="dna")
b
0
ATTT
B.C.

2 x 3 dna alignment

concat = a + b
concat
0
AAAATTT
B.G..C.

2 x 6 dna alignment

The modelling is then done on the concatenated alignment (concat in our example).

This corresponds to a null model in which the two loci are constrained to be evolving in an identical manner – the rate of evolution is the same. The alternate in this case is to model the evolution of the two genes separately – the rate of evolution differs between the genes.

7.9.1.4. Mixed time and rate heterogeneity models#

These are focussed on attempting to identify specific codons at which lineage specific natural selection has operated.

7.10. Exercises#

  1. In the case of the molecular clock, the two models are: (a) all branches are free parameters, (b) a pair of branches are constrained to be the same. Which of these is the null?

  2. What will be the degrees of freedom?


Citations

[KJ69]

J L King and T H Jukes. Non-Darwinian evolution. Science, 164:788–798, 1969.

[KMB+07]

Rob Knight, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett C Easton, Michael Eaton, Micah Hamady, Helen Lindsay, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, Michael Robeson, Raymond Sammut, Sandra Smit, Matthew J Wakefield, Jeremy Widmann, Shandy Wikman, Stephanie Wilson, Hua Ying, and Gavin A Huttley. PyCogent: a toolkit for making sense from sequence. Genome Biol, 8(8):R171, 2007. doi:10.1186/gb-2007-8-8-r171.