7.9. Hypothesis Testing#
The likelihood models are parametric models, they are defined in terms of specific parameters whose values we seek to estimate from data. One of the major advantages of the development of phylogeny-based maximum likelihood methods is they allow us to take advantage of the large wealth of theoretical developments relating to using maximum likelihood for statistical inference. Developments that include hypothesis testing.
The ratio of the likelihood of two different models is a natural measure for hypothesis testing. Let’s first define the LR statistic [1].
Note that the likelihood from the alternate comes first. For nested models, it is guaranteed that the likelihood for the alternate will be ≥ that for the null [2].
If this is not the case, it indicates that the alternate model has not been fully maximised during the fitting procedure. This typically results because the optimiser was trapped in a local optima. Such cases are common with certain types of models, in particular the rate-heterogeneity models.
We obtain \(\ln \mathcal{L}_{\text{null}}\) and \(\ln \mathcal{L}_{\text{alt}}\) after fitting [3] both models to the data. If \(H_o\) and \(H_a\) are nested (more on that below), then the LR statistic computed according to the equation has the convenient property of being \(\chi^2\) distributed with degrees of freedom (abbreviated df) equal to the difference in the number of free parameters (abbreviated nfp) between those two models [4].
Fitting is another way of saying using numerical optimisation to maximise the likelihood.
If the models are not nested, then we can still use the LR statistic but we can no longer use the \(\chi^2\) distribution to obtain the \(p\)-value. In that case, a parametric bootstrap procedure is an option.
What being “\(\chi^2\) distributed” means is that we obtain the \(p\)-value for a LR from the theoretical \(\chi^2\) distribution. For example if LR=3.84 and df=1, then the \(p\)-value=0.05.
Note
The mathematical proof that LR statistics are \(\chi^2\) distributed assumes there is infinite data. In our case, the amount of data is specified by the length of the alignment. Clearly, alignments are never infinitely long so can we rely on this assumption? It has been demonstrated in some circumstances that alignments as short as a couple of hundred bases long are sufficient for the assumption of a \(\chi^2\) to be reasonable. The exact length will depend on the amount of variation between the sequences.
7.9.1. Nesting hypotheses#
A model is a hypothesis. Hypothesis testing is a procedure for selecting the hypothesis (model) that best explains the data. Typically, hypothesis testing involves comparing two only hypotheses. For cases where there are more than 2 to chose from, we refer to the procedure as “model selection”.
As you can see from that short paragraph, the term which is used depends on the context.
Hypotheses are considered nested if one model can be specified as a mathematical restriction of another. Recall the rate matrices from F81 and HKY85. These rate matrices are distinguished by the \(\kappa\) parameter in HKY85. Because of the multiplicative nature of the definition of HKY85, imposing the constraint \(\kappa\)=1 transforms the HKY85 rate matrix into F81. Thus, we state that F81 is nested within HKY85. The null model in this case is \(\kappa=1\).
If we perform a hypothesis test and wind up in not rejecting the null, we may be accepting [5] that the sequences could reasonably have been generated by the F81 model. If we rejected the null, we are concluding the sequences were more likely to be generated by a model in which transition substitutions had a distinctive rate. In this case, the MLE for \(\kappa\) indicates whether the transition substitutions occurred faster / slower than the transversion substitutions (see The behaviour of parameters in Q).
It is often stated that one “accepts” the null hypothesis (e.g. when a \(p\)-value>0.05). However, this is not accurate since one may “fail to reject the null” because of low statistical power.
To summarise, the hypothesis test outcome guides our choice of which model to prefer. The parameter values from that fitted model inform us as to why a model was superior (or adequate if we failed to reject the null).
This F81 vs HKY85 example illustrates the general pattern of nested relationships. There are a multitude of ways this approach is employed to construct hypotheses. Here are some specific examples.
7.9.1.1. Testing the molecular clock#
In this case, it is branch length parameters that are constrained. In our usual applications, branch lengths are typically free parameters. The clock model imposes a constraint on that which I’ll illustrate with an example.
For the above tree, we can test whether the Human and Chimpanzee lineages have been evolving in a clock-like manner since their descent from a common ancestor. We specify a clock model by constraining the Human and Chimpanzee branch lengths to be exactly the same. We would contrast that with a model in which the Human and Chimpanzee are allowed to be different.
7.9.1.2. Testing time-homogeneity#
Time-homogeneity is the case when only one calibrated rate matrix is used (see calibrating rate matrices). The alternate hypothesis is time-heterogeneity, in which their are multiple rate matrices. For example, if we take a time-homogeneous HKY85 model as our null we could specify the alternate as different values of \(\kappa\) are used for the Human and Chimpanzee branch [6].
This style of test is most often applied to codon models (which we have not covered) but is also being used for examination of non-stationary nucleotide substitution models.
7.9.1.3. Rate heterogeneity models#
Independent sites#
It has been known for a very long time [KJ69] that some sequence positions evolve at different rates. The standard assumption of the likelihood analyses thus far is that all alignment columns are treated as evolving under precisely the same substitution process and thus at the same rate. This is a rate-homogeneous model. Rate heterogeneity models seek to account for observed differences in rate.
In a nutshell, the typical rate-heterogeneity models are a “mixture model”. I illustrate this with the widely used \(\Gamma\) [7] distributed rate heterogeneity model. The \(\Gamma\) distribution shape is controlled by parameters that are included [8] as free parameters in the model. The distribution represents a scalar on branch length and is defined so its mean is 1. Typically, the \(\Gamma\) distribution is split into four bins [9] and the likelihood is computed for each bin separately. The mean value of each bin is used as a multiplier for all the branch lengths on a tree. The full likelihood is computed by summing the likelihoods computed for each bin.
That’s the symbol capital gamma.
I’m skipping quite a bit of detail!
You can think of each bin as corresponding to a rate category. For the 4-bin case, really slow, somewhat slow, somewhat fast and really fast.
Note
These models do not readily nest the rate-homogeneous models, so more complex methods are required for hypothesis testing.
Dependency amongst sites – phylo-HMMs#
In this form of rate heterogeneity, the rate categories of adjacent sites are correlated. So if one site is evolving slowly, the adjacent site is likely to belong to the same category. This form can be combined with the \(\Gamma\) models described above by addition of a switching parameter that controls whether to stay in the previous sites rate category or pick a new one.
This class of model are termed phylogenetic Hidden Markov Models (or phylo-HMM). (See the example application in the PyCogent publication [KMB+07].)
Note
These models also do not readily nest the rate-homogeneous models, so more complex methods are required for hypothesis testing.
Jointly modelling loci — concatenating alignments#
A common analysis strategy involves concatenating the alignments of different genes from the same group of species [10]. For instance, say I have alignments of one-to-one orthologs for gene A and B from 2 species. I concatenate these by simply concatenating the sequence strings in the same order for each species. For instance
from cogent3 import make_aligned_seqs
a = make_aligned_seqs(dict(B="AGA", A="AAA"), moltype="dna")
a
0 | |
A | AAA |
B | .G. |
2 x 3 dna alignment
b = make_aligned_seqs(dict(A="TTT", B="TCT"), moltype="dna")
b
0 | |
A | TTT |
B | .C. |
2 x 3 dna alignment
concat = a + b
concat
0 | |
A | AAATTT |
B | .G..C. |
2 x 6 dna alignment
The modelling is then done on the concatenated alignment (concat
in our example).
This is typically a bad idea from a statistical perspective (you’re making the additional assumption about rate homogeneity). It is a good idea if you are only concerned with computational speed, want to finish fast and don’t care about the reliability of the results.
This corresponds to a null model in which the two loci are constrained to be evolving in an identical manner – the rate of evolution is the same. The alternate in this case is to model the evolution of the two genes separately – the rate of evolution differs between the genes.
7.9.1.4. Mixed time and rate heterogeneity models#
These are focussed on attempting to identify specific codons at which lineage specific natural selection has operated.
7.10. Exercises#
In the case of the molecular clock, the two models are: (a) all branches are free parameters, (b) a pair of branches are constrained to be the same. Which of these is the null?
What will be the degrees of freedom?
Citations
J L King and T H Jukes. Non-Darwinian evolution. Science, 164:788–798, 1969.
Rob Knight, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett C Easton, Michael Eaton, Micah Hamady, Helen Lindsay, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, Michael Robeson, Raymond Sammut, Sandra Smit, Matthew J Wakefield, Jeremy Widmann, Shandy Wikman, Stephanie Wilson, Hua Ying, and Gavin A Huttley. PyCogent: a toolkit for making sense from sequence. Genome Biol, 8(8):R171, 2007. doi:10.1186/gb-2007-8-8-r171.