9.5. Acknowledging the limits of an analysis#
“Since all models are wrong the scientist must be alert to what is importantly wrong.”
—Professor George Box, [Box76]
9.5.1. The impact of model choice#
It’s easy to claim all models are wrong and that we should be cautious, but nothing beats a demonstration of how a seemingly simple analysis can be prove remarkably misleading. In this spirit, I have generated synthetic sequences
based on the reference V. cholerae genome.
We use the \(\chi^2\) test for homogeneity on nucleotide counts on the two synthetic sequences. With a \(p\)-value\(\approx\)0.69, we cannot reject the null.
chisq | df | pvalue |
---|---|---|
1.474 | 3 | 0.6882 |
We employ the same test, but we apply it to dinucleotide counts of the two synthetic sequences (rather than nucleotide counts). The resulting \(p\)-value is so small, it is below the limits of my computers precision to compute it.
chisq | df | pvalue |
---|---|---|
2932.022 | 15 | 0.00e+00 |
This example was deliberately constructed to demonstrate that applying different models to the same data can result in contradictory outcomes. In this case, because I generated the data I knew this problem existed. I can say that analogous situations do arise in the analysis of real data too.
So how do you address such a possibility when analysing real data? You must employ your expert knowledge of the scientific domain. Specifically, based on what you understand about the experimental procedures, biological and/or chemical properties of the system you are examining, is the model you have used reasonable? Has it been used by others [1]? In our first case study we demonstrated that nucleotides do not occur randomly, a property demonstrated as characteristic of our own genome and which has been argued to originate from mechanisms of mutagenesis [SH20].
The most critical step you can take is in acknowledging the uncertainty present in any analysis. Your conclusions are based on the assumptions required by the model and if those are incorrect, your conclusion may be also.
Citations
George E P Box. Science and Statistics. Journal of the American Statistical Association, 71(356):791–799, 1976. doi:10.2307/2286841.
Helmut Simon and Gavin Huttley. Quantifying Influences on Intragenomic Mutation Rate. G3: Genes|Genomes|Genetics, 10(8):g3.401335.2020, jun 2020. doi:10.1534/g3.120.401335.