Step 4 – Interpretation and looking past the p-value

9.4. Step 4 – Interpretation and looking past the \(p\)-value#

In the listing of 4 essential steps for data analysis, the final listing concerns interpreting the results of our analysis with respect to our original question.

9.4.1. Case Study 1#

I reproduce below the result from analysis of case study 1 sequence.

Chisq-test for independence
chisqdfpvalue
22.57790.0072

As explained earlier, the \(p\)-value measures the probability that seq originated from a process whereby nucleotides occur independently. A \(p\)-value\(\approx\)0.007 indicates this is a rare occurrence. Adopting the conventional threshold of \(\alpha\)=0.05 as our cutoff for declaring we have obtained a “significant” result, we reject that null model as a likely explanation for seq. This leads us to the view that nucleotides actually occur in a dependent manner. That seems OK, but we need to challenge our own interpretation.

Simply drawing a conclusion about our data based on the \(p\)-value can be misleading. The amount of data has a direct effect on the power of a statistical test. This translates to \(p\)-values being very sensitive to sample size. For instance, what if we had an enormously long sequence? The statistical power in such a case is so high that we would almost certainly wind up with a tiny \(p\)-value [1]. We might be able to convince ourselves that this is not the case here since the sequence is only 600bp long (as distinct from billions of nucleotides long in the case).

A more valuable measure is to consider how the data depart from the expectation of the null hypothesis. For categorical data analysis, as we have done here, one very informative quantity are the model residuals. A residual is a scaled measure of the difference between the observed and expected quantities.

Residuals
ACGT
A0.1782-1.00111.7219-1.1510
C1.48251.4470-2.6494-0.2130
G-1.0507-0.57820.89220.8981
T-1.2023-0.35201.00000.5352

The other merit of examining residuals is that we move from saying that our data differs from the null to demonstrating how that difference occurs. In the above case, the largest absolute residual is -2.7 for the CG dinucleotide. That the residual is negtaive indicates a striking deficit in our data compared to what our null model predicts. This matches up with published data about mutagenesis. Namely, that the CG (or more precisely CpG) dinucleotide is the sequence context within which C becomes methylated (to produce 5-methyl-cytosine, or 5mC). It has been demonstrated that the mutation rate of \(5mC\rightarrow T\) is ~10x the mutation rate of \(C\rightarrow T\). This mutagenic mechanism thus predicts a deficit of the CpG dinucleotide. So in this case, we have a result which is consistent with evidence from independent studies [KBC98].

9.4.2. Case Study 2#

By now you’ve probably been wondering why I bothered even having the second case study. Fear not, its time has finally come!

Case Study 2 was selected because it is nearly the reverse of a conventional statistical analysis. Typically, the scientist secretly hopes their statistical analysis will have provided evidence (i.e. a \(p\)-value) that enables them to conclude their null model is unlikely. Thus, their interest in the outcome is maximised when the null is rejected. Recall that our null model for this case study is that a query sample is drawn from the same distribution as our reference sample which is the V. cholerae 16S gene. If the test returns a \(p\)-value>0.05 we face the unpleasant prospect of dealing with an outbreak of cholera. In other words, our interest is maximised when the null is accepted.

There are a multitude of reasons why this approach to detecting cholera is super bad and should never be used. In order to explore these we need to think about what underpins our choices. We could reasonably expect that the DNA sequence of V. cholera encodes the pathogenicity of this bacteria. In order to fit that genomic sequence into a conventional statistical analysis we have adopted an extreme data reduction approach – we selected a single gene and we reduced that gene into just the counts of nucleotides. In doing such a radical data transformation we are assuming that we can uniquely identify a pathogenic cholera bacterium purely from this approach.

The test we employed required the assumption that the genome of cholera exhibits a distinctive nucleotide composition to the sequences of other bacteria. Not only do we require all other bacteria to be markedly different to cholera, we also require all pathogenic cholera to not differ markedly from one another. Ultimately, the mechanism that causes species to differ in the nucleotide abundance is mutagenesis. As our first case study highlighted, mutagenesis is a process in which interactions amongs neighbouring bases can play a profound influence. But the analysis we employed assumed this influence does not exist.

In brief, we should not be confident in the results we obtain from this approach. Clearly, the model used is seriously wrong. In the next section we will demonstrate how model choice can profoundly affect your results.


Citations

[CBS+04]

Gregory M. Cooper, Michael Brudno, Eric A. Stone, Inna Dubchak, Serafim Batzoglou, and Arend Sidow. Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes. Genome Research, 14(4):539–548, 2004. doi:10.1101/gr.2034704.

[KBC98]

M Krawczak, E V Ball, and D N Cooper. Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am J Hum Genet, 63(2):474–488, 1998. doi:10.1086/301965.