9.3. Distributions and the \(p\)-value#
In the Case studies, we executed the first 3 of the 4 essential steps for statistical data analysis. In each case study, the final output was a \(\chi^2\) test statistic, degrees-of-freedom and a \(p\)-value. In this section we are going to examine the following questions as a precursor to completing the 4th step.
What is a \(p\)-value and where do they come from?
What is a distribution?
Where do we get distributions from?
In order to answer (1) we really need to answer (2) and (3) first.
9.3.1. What is a distribution?#
In statistics, we use the term distribution to describe the occurrence of a random variable. Say we were measuring plant height for a particular species. Plant height will be a “continuously distributed” random variable – meaning heights can assume an infinite number of values [1]. If we were to plot those heights along an \(x\)-axis, then as we continue to obtain new data points we would wind up with having to place points on top of each other. If we continued to do this sampling and plotting we would see a shape emerge (e.g. plant heights). This shape can be thought of as the distribution of the plant height random variable.
9.3.2. Where do we get distributions from?#
In the hypothetical example described above, the distribution of plant heights was obtained from the data itself. This is referred to as an empirical distribution.
These values correspond to the distribution of \(\chi^2\) values when the null hypothesis is true. (The values were generated by using pseudo-random numbers drawn from a \(\chi^2\) distribution function with df=9.)
For those of you who have done introductory courses in statistics you have likely encountered a so-called known theoretical distributions [2]. Examples include the Normal (or Gaussian) distribution, Gamma distribution and the uniform distribution [3].
These may also be referred to as theoretical or analytical distributions because there are equations that describe them.
The uniform distribution is of particular use to the task of understanding \(p\)-values
If you choose a measurement that belongs to a known statistical distribution then you get a whole bunch of stuff for free. One thing being the ability to look up the \(p\)-value for a test statistic very efficiently.
9.3.3. Where do we get \(p\)-values from and how do we interpret them?#
A \(p\)-value is a fundamental measure in hypothesis testing that quantifies the consistency of the test statistic with the null hypothesis.
In both of our case studies we defined a reference condition which we sought to contrast our data with. For example, in case study 1 the reference condition is one of independence between adjacent nucleotides. I use the word reference because both the test statistic and its associated \(p\)-value are obtained in reference to the distribution of the test statistic when the null hypothesis is true. Stated another way, the reference condition corresponds to our null hypothesis [4].
To really demonstrate what the null, or reference, distribution corresponds to we will now generate one for the case study 1 problem.
In statistics, the words hypothesis and model can be synonyms.
9.3.3.1. Estimating a \(p\)-value computationally#
With the case study 1 sample data
we will evaluate whether nucleotides occur randomly along this sequence by writing an algorithm that generates the distribution of our test statistic.
Before we do anything, we need to consider first what data should “look” like if our null hypothesis is correct. This will help us decide how to approach this problem algorithmically.
Our null posits that DNA sequences are just a random ordering of nucleotides. If this were true, we can make a DNA sequence by just sequentially drawing nucleotides randomly from a nucleotide pool until we get the desired sequence length. This process will generate a DNA sequence whose dinucleotide frequencies are consistent with the following probability calculation.
Here, given the nucleotide frequencies \(p_i\) and \(p_j\), the expected frequency of the corresponding dinucleotide \(i, j\) is \(p_{i,j}\).
We convert this expected frequency into an expected count (\(E_{i,j}\)) for a sequence of \(\ell\) dinucleotides as
This is a crucial quantity for performing a \(\chi^2\) test for independence.
Algorithmic steps for computing a \(\chi^2\) statistic#
Before proceeding to generate the distribution, let us break the calculations down so that we can write our algorithm. We will use this simple DNA sequence – "AACCCCGT"
– to illustrate the steps we need to take in order to be able to compute a chi-square statistic.
Split the sequence into dinucleotides: From our sample sequence, we need to produce the series of dinucleotides
["AA", "CC", "CC", "GT"]
.def seq_to_dinucs(seq): seq = "".join(seq) # for the case when we get seq as a list dinucs = [seq[i: i + 2] for i in range(0, len(seq) - 1, 2)] return dinucs dinucs = seq_to_dinucs("AACCCCGT") dinucs
['AA', 'CC', 'CC', 'GT']
Define a nucleotide order: We need to make a square matrix of counts where row and columns correspond to specific nucleotides. To enable this we define nucleotides to be in alphabetical order, e.g.
A
has index0
.nucleotide_order = "ACGT"
Convert dinucleotides into pairs of indices: We use the nucleotide order to convert a dinucleotide string into two integers representing array coordinates, e.g. dinucleotide
"AA"
has indices(0, 0)
whileGT
has indices(2, 3)
. We will do this by writing a function that converts a single dinucleotide into coordinates. Applying this to the sample sequence we getdef dinuc_to_indices(dinuc): return tuple(nucleotide_order.index(nuc) for nuc in dinuc) coords = [dinuc_to_indices(dinuc) for dinuc in dinucs] coords
[(0, 0), (1, 1), (1, 1), (2, 3)]
Use dinucleotide indices to make a dinucleotide counts matrix: We use a
numpy
array for the counts. Think of the row and column labels for this array as corresponding to the nucleotides present at the first and second position respectively of a dinucleotide. For our example, we get the followingfrom numpy import zeros def make_counts_matrix(coords): counts = zeros((4,4), dtype=int) for i, j in coords: counts[i, j] += 1 return counts observed = make_counts_matrix(coords) observed
array([[1, 0, 0, 0], [0, 2, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0]])
Use counts of observed dinucleotides to compute their expected values: This can be achieved by first generating row and column sums, converting those to frequencies plus a couple of other steps (detail is below).
from numpy import outer def get_expected(counts): total = counts.sum() # number of dinucleotides row_sums = counts.sum(axis=1) col_sums = counts.sum(axis=0) # converting to frequencies row_probs = row_sums / total col_probs = col_sums / total # outer product gives us matrix of expected frequencies # multiplying by total gives matrix of expected counts expecteds = outer(row_probs, col_probs) * total return expecteds expected = get_expected(observed) expected
array([[0.25, 0.5 , 0. , 0.25], [0.5 , 1. , 0. , 0.5 ], [0.25, 0.5 , 0. , 0.25], [0. , 0. , 0. , 0. ]])
Generate the \(\mathbf{\chi^2}\) statistic: This is defined as follows
(1)#\[\chi^2=\sum_i\sum_j\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\]Where \(O_{i,j}\) and \(E_{i,j}\) correspond to the observed and expected counts for dinucleotide \(i,j\). We express this as a Python function and apply it to our simple example. (The
numpy
array operations greatly simplify the calculation.)def calc_chisq(observed, expected): # observed and expected are both numpy arrays chisq = (observed - expected)**2 / expected return chisq.sum() calc_chisq(observed, expected)
nan
Note
The nan
that was output from the calc_chisq()
was generated because we were doing a division with 0 in the denominator. (See output from get_expected()
above.) So time to switch to using the full sequence now.
from cogent3 import load_seq
seq = load_seq("data/case_study1.fasta", moltype="dna")
Let’s provide a simplified interface to all these function calls such that if we provide our sequence, all the above steps are invoked and we get back a \(\chi^2\) statistic [5].
Shuffling generates a permutation of the sequences, which is equivalent to sampling without replacement.
def chiqsq_independent_nucs(seq):
dinucs = seq_to_dinucs(seq)
coords = [dinuc_to_indices(dinuc) for dinuc in dinucs]
observed = make_counts_matrix(coords)
expected = get_expected(observed)
return calc_chisq(observed, expected)
chiqsq_independent_nucs(seq)
22.577013388423254
So that’s nice, we are now able to compute the statistic of interest given a sequence. How do we generate the null?
Generating the null distribution computationally#
To produce a distribution of \(\chi^2\) test statistics we need a collection of sequences drawn from the null distribution!
We can generate synthetic sequences consistent with the null by randomly sampling from our actual data. This requires we have a means for producing randomised nucleotide orders from our observed data. Algorithms for generating pseudo-random numbers are important for scientific computing and, as you might expect, there are numerous choices. (Both the Python standard library and numpy
come with a builtin capability for generating such numbers using well regarded algorithms. We will use the one distributed with numpy
.)
In our case, we will use a shuffle()
function. Note that shuffle()
works “in place”, meaning it modifies the data you provide, so we need to convert our sequence into a list.
from numpy.random import shuffle
tmp = list("AACCCCGT")
shuffle(tmp)
tmp
['G', 'C', 'A', 'C', 'A', 'T', 'C', 'C']
Will our functions still work if we give them a list?
chiqsq_independent_nucs(list(seq))
22.577013388423254
Yup!
To recap, the chiqsq_independent_nucs()
function takes a sequence and returns the \(\chi^2\) statistic for the independence of the nucleotides at the first and second positions of dinucleotides in that sequence. We want to generate the null distribution for this statistic so that we can assess how unusual the statistic from the case study 1 data is.
We first decide how big a distribution, i.e. how many synthetic sequences, we will generate. Each of these synthetic sequences is generated in accordance with the null by shuffling the original sequence.
The quantities (and corresponding variable
) required to estimate the \(p\)-value are
- \(\chi^2_o\) (
obs_chisq
) The statistic from the original (observed) sequence.
- \(\chi^2_s\) (
syn_chisq
) The statistic computed from the synthetic sequences.
- \(\mathcal{N}\) (
num_reps
) The number of synthetic sequences to evaluate.
- \(\mathcal{g}\) (
num_gt
) The number of synthetic sequences for which \(\chi^2_s \ge \chi^2_o\).
From these quantities we estimate the probability of a \(\chi^2_o\) of equal or greater magnitude occurring by chance under the null model as \(p\)-value\(=\frac{g}{\mathcal{N}}\).
So here’s the final function.
def estimate_chisq_pval(seq, num_reps):
obs_chisq = chiqsq_independent_nucs(seq)
seq = list(seq)
num_gt = 0
for i in range(num_reps):
shuffle(seq)
syn_chisq = chiqsq_independent_nucs(seq)
if syn_chisq >= obs_chisq:
num_gt += 1
return num_gt / num_reps
estimate_chisq_pval(seq, 4000)
0.00875
This type of approach to statistical estimation relies on resampling from observed data is also known as bootstrapping.
Note
There is a relationship between \(p\)-values and quantiles. The value returned by estimate_chisq_pval()
is 1 minus the quantile of 22.577 in the simulated \(\chi^2_s\) distribution.
9.3.3.2. Estimating a \(p\)-value using a theoretical distribution#
The above is my attempt at making concrete the origins of \(p\)-values, an essential component to hypothesis testing. I have developed this discourse with a focus on the first of our case studies.
For case study 1, we can also take the more conventional approach of assuming the theoretical \(\chi^2\) distribution is appropriate and just use that to “look up” the \(p\)-value. Doing so requires we know the degrees-of-freedom (\(df\)) for our test. In most statistical analysis tools, the df is automatically computed. Certainly for this case, it’s not a hard calculation! For the \(\chi^2\) applied to a contingency table with \(m \times n\) rows and columns \(df=(m-1) \times (n-1)\), which is 9 in this case. The result is what we saw earlier
chisq | df | pvalue |
---|---|---|
22.577 | 9 | 0.0072 |
As you can see, this \(p\)-value is close to that estimated above [6]. The interpretation is the same, we would expect a \(\chi^2_9\ge\)22.577 will occur by chance 0.0072 of the time when the null hypothesis is true.
For resampling approaches, the estimated \(p\)-value will converge on the theoretical value with increasing num_reps
. That said, this statement is not universally true – for a 4bp long sequence there are only 256 possible synthetic sequences.
The definition of \(p\)-values has some very import corollaries.
You can obtain a very small \(p\)-value even if the null hypothesis is true, just by chance.
A small \(p\)-value does not mean the null hypothesis is disproven.
A large \(p\)-value does not mean the null hypothesis is proven.