9.2. Case studies#
We will develop a perspective on statistical analysis through addressing two different problems. In these case studies, I’m describing the problem in a very procedural way, focussing on the essential steps to deliver a \(p\)-value. I am taking this approach because many of the students (and in fact many of the scientists) I have interacted with believe that this is all statistical analysis entails.
The point of this topic is explain why this a manifestly superficial approach to statistical analysis. In subsequent sections I develop our interpretation of the analysis results I obtain here.
9.2.1. Case study 1 – do nucleotides occur randomly in a DNA sequence?#
Much of my own research concerns understanding organisation [1] of DNA sequences and the mechanisms by which those properties arise. Inevitably, this leads me to asking questions about the role that mutation plays in creating patterns in DNA. This particular case study is focussed on a fundamental property of DNA that has important implications for a large number of fields, including molecular evolutionary analyses such as phylogenetics, and cancer genetics. We now follow the 4-step procedure.
9.2.1.1. Step 1 - the biological problem#
0 | |
demo | ATGAAATCCA |
DnaSequence, length=10
A 10bp snippet of the sample sequence. Click to download the sequence data
.
For the single DNA sequence, we want to evaluate whether nucleotides occur randomly along the sequence [2]. In other words, given a nucleotide C
, is its 3’- neighbour a random draw from all 4 nucleotides?
We are referring to nucleotides on a single strand of the double-helix.
9.2.1.2. Step 2 - selecting the measurement#
DNA sequences are represented by the discrete series of symbols [3] which we can view as categories. As I’m interested in asking whether these symbols occur randomly, my measurement needs to allow me to distinguish between randomness and non-randomness.
In the statistical parlance of categorical analysis, the synonyms for randomness and non-randomness are independence and dependence respectively. These synonyms help me identify the \(\chi^2\) test of independence as the test I will use. (Selecting this statistical test means we have selected a model for the null distribution, but more on that when we get to the Interpretation step.)
These symbols are the letters of the DNA alphabet A, C, G, T.
That all sounds great, but how can I transform my data in such a way that it can be used for this analysis?
9.2.1.3. Step 3 - writing the algorithm#
In short, we need to produce what’s referred to as a contingency-table. This is just a table of counts of the number of occurrences of each possibility. Consider the nucleotide A
. For seq
, how many times does that occur as: AA
, AC
, AG
, AT
? For the nucleotide C
, we need to count the number of occurrences of CA
, CC
, CG
, CT
. Then for G
and T
. The pattern here is we need to count dinucleotides!
Awesome, I need to write an algorithm to convert a sequence into dinucleotides. For example, 'ATGAAT'
should be converted to ['AT', 'GA', 'AT']
. The next part is to convert that series of dinucleotides into a \(4\times 4\) table with the row labels corresponding to the first (the 5’-) nucleotide of the dinucleotide and the column labels corresponding to the second (3’-) nucleotide of the dinucleotide.
Applying my algorithm to seq
I get
A | C | G | T | |
---|---|---|---|---|
A | 13 | 19 | 27 | 5 |
C | 29 | 52 | 19 | 14 |
G | 11 | 26 | 28 | 13 |
T | 5 | 15 | 17 | 7 |
Further applying the \(\chi^2\) test to this data produces the following result, a \(\chi^2\) test statistic, associated degrees-of-freedom and \(p\)-value. (We will expand on these later.)
chisq | df | pvalue |
---|---|---|
22.577 | 9 | 0.0072 |
Note
What we do with this result corresponds to the last of the essential steps, which we do later.
9.2.2. Case study 2 – evaluate whether a query sequence belongs to a pathogen#
ID | A | C | G | T |
---|---|---|---|---|
CP043554.1 | 399 | 345 | 488 | 321 |
ID | A | C | G | T |
---|---|---|---|---|
AP023375.1 | 394 | 341 | 485 | 320 |
CP053802.1 | 398 | 345 | 488 | 322 |
CP040172.1 | 397 | 343 | 487 | 322 |
CP090387.1 | 396 | 342 | 485 | 320 |
CP007634.1 | 397 | 343 | 487 | 322 |
... | ... | ... | ... | ... |
CP016987.1 | 393 | 341 | 484 | 318 |
CP025936.1 | 397 | 343 | 488 | 322 |
CP010812.1 | 393 | 338 | 483 | 320 |
CP003330.1 | 396 | 342 | 486 | 320 |
LT992486.1 | 391 | 339 | 486 | 316 |
Top 5 and bottom 5 rows from 87 rows x 5 columns
Nucleotide counts from the 16S rRNA gene of the reference and query genomes.
The properties of DNA are often used for identification purposes. In this case study we are interested in the detection of the cholera pathogen in amplicon sequence data. A standard approach in analyses of microbial communities is to use the DNA sequence from the 16S rRNA gene (hereafter abbreviated 16S) as a species marker. These can be sampled from an environment by PCR amplification and subsequent high-throughput DNA sequencing.
9.2.2.1. Step 1 - the biological problem#
Our question is whether a “query” 16S DNA sequence could be a member of Vibrio cholerae (some strains of which cause cholera).
If the query sequence belongs to V. cholerae, then its DNA sequence should be very similar to the reference [4] V. cholerae pathogen 16S sequence.
We use the NCBI defined reference for V. cholerae.
9.2.2.2. Step 2 - selecting the measurement#
As for Case study 1, because DNA sequences are our basic data type we are dealing with categorical data. We can measure the similarity between DNA sequences in a very large number of ways. One computationally efficient way is to just compare nucleotide counts. If the query belongs to a different species than the reference sequence, we expect the abundance of nucleotides will be less similar than if it is from the same species. In other words, I’m expressing this measurement problem as one of establishing whether a query is not V. cholerae.
Conveniently, this problem can also be evaluated using a \(\chi^2\) test. In this case, it is a homogeneity test [5].
Which tests whether the query comes from the same population as the reference.
9.2.2.3. Step 3 - writing the algorithm#
The first part of the algorithm here is quite simple. For each sequence, we count the number of A
, C
, G
and T
. The result of this is displayed in the case study data overview. The second part of the algorithm selects the counts for a single query and combines those with the counts from the reference, to produce a \(2 \times 4\) contingency table. For example, with AP023375.1
as the query we generate
ID | A | C | G | T |
---|---|---|---|---|
CP043554.1 | 399 | 345 | 488 | 321 |
AP023375.1 | 394 | 341 | 485 | 320 |
Performing the \(\chi^2\) homogeneity test produces a \(\chi^2\) test statistic, associated degrees-of-freedom and \(p\)-value.
chisq | df | pvalue |
---|---|---|
0.011 | 3 | 0.9997 |
This procedure needs to be applied to all the query samples. Again, we will perform the last, interpretation, of out essential setsp later.