Distributions and the p-value

9.3. Distributions and the \(p\)-value#

In the Case studies, we executed the first 3 of the 4 essential steps for statistical data analysis. In each case study, the final output was a \(\chi^2\) test statistic, degrees-of-freedom and a \(p\)-value. In this section we are going to examine the following questions as a precursor to completing the 4th step.

What is a \(p\)-value and where do they come from?
What is a distribution?
Where do we get distributions from?

In order to answer (1) we really need to answer (2) and (3) first.

9.3.1. What is a distribution?#

In statistics, we use the term distribution to describe the occurrence of a random variable. Say we were measuring plant height for a particular species. Plant height will be a “continuously distributed” random variable – meaning heights can assume an infinite number of values [1]. If we were to plot those heights along an \(x\)-axis, then as we continue to obtain new data points we would wind up with having to place points on top of each other. If we continued to do this sampling and plotting we would see a shape emerge (e.g. plant heights). This shape can be thought of as the distribution of the plant height random variable.

9.3.2. Where do we get distributions from?#

In the hypothetical example described above, the distribution of plant heights was obtained from the data itself. This is referred to as an empirical distribution.

For those of you who have done introductory courses in statistics you have likely encountered a so-called known theoretical distributions [2]. Examples include the Normal (or Gaussian) distribution, Gamma distribution and the uniform distribution [3].

If you choose a measurement that belongs to a known statistical distribution then you get a whole bunch of stuff for free. One thing being the ability to look up the \(p\)-value for a test statistic very efficiently.

9.3.3. Where do we get \(p\)-values from and how do we interpret them?#

A \(p\)-value is a fundamental measure in hypothesis testing that quantifies the consistency of the test statistic with the null hypothesis.

In both of our case studies we defined a reference condition which we sought to contrast our data with. For example, in case study 1 the reference condition is one of independence between adjacent nucleotides. I use the word reference because both the test statistic and its associated \(p\)-value are obtained in reference to the distribution of the test statistic when the null hypothesis is true. Stated another way, the reference condition corresponds to our null hypothesis [4].

To really demonstrate what the null, or reference, distribution corresponds to we will now generate one for the case study 1 problem.

9.3.3.1. Estimating a \(p\)-value computationally#

With the case study 1 sample data we will evaluate whether nucleotides occur randomly along this sequence by writing an algorithm that generates the distribution of our test statistic.

Before we do anything, we need to consider first what data should “look” like if our null hypothesis is correct. This will help us decide how to approach this problem algorithmically.

Our null posits that DNA sequences are just a random ordering of nucleotides. If this were true, we can make a DNA sequence by just sequentially drawing nucleotides randomly from a nucleotide pool until we get the desired sequence length. This process will generate a DNA sequence whose dinucleotide frequencies are consistent with the following probability calculation.

\[p_{i,j} = p_i\times p_j\]

Here, given the nucleotide frequencies \(p_i\) and \(p_j\), the expected frequency of the corresponding dinucleotide \(i, j\) is \(p_{i,j}\).

We convert this expected frequency into an expected count (\(E_{i,j}\)) for a sequence of \(\ell\) dinucleotides as

\[E_{i, j} = p_{i, j}\times \ell.\]

This is a crucial quantity for performing a \(\chi^2\) test for independence.

Algorithmic steps for computing a \(\chi^2\) statistic#

Before proceeding to generate the distribution, let us break the calculations down so that we can write our algorithm. We will use this simple DNA sequence – "AACCCCGT" – to illustrate the steps we need to take in order to be able to compute a chi-square statistic.

Split the sequence into dinucleotides: From our sample sequence, we need to produce the series of dinucleotides ["AA", "CC", "CC", "GT"].

def seq_to_dinucs(seq):
    seq = "".join(seq) # for the case when we get seq as a list
    dinucs = [seq[i: i + 2] for i in range(0, len(seq) - 1, 2)]
    return dinucs

dinucs = seq_to_dinucs("AACCCCGT")
dinucs

['AA', 'CC', 'CC', 'GT']

Define a nucleotide order: We need to make a square matrix of counts where row and columns correspond to specific nucleotides. To enable this we define nucleotides to be in alphabetical order, e.g. A has index 0.
nucleotide_order = "ACGT"
Convert dinucleotides into pairs of indices: We use the nucleotide order to convert a dinucleotide string into two integers representing array coordinates, e.g. dinucleotide "AA" has indices (0, 0) while GT has indices (2, 3). We will do this by writing a function that converts a single dinucleotide into coordinates. Applying this to the sample sequence we get
def dinuc_to_indices(dinuc): return tuple(nucleotide_order.index(nuc) for nuc in dinuc) coords = [dinuc_to_indices(dinuc) for dinuc in dinucs] coords
[(0, 0), (1, 1), (1, 1), (2, 3)]

Use dinucleotide indices to make a dinucleotide counts matrix: We use a numpy array for the counts. Think of the row and column labels for this array as corresponding to the nucleotides present at the first and second position respectively of a dinucleotide. For our example, we get the following

from numpy import zeros

def make_counts_matrix(coords):
    counts = zeros((4,4), dtype=int)
    for i, j in coords:
        counts[i, j] += 1
    return counts

observed = make_counts_matrix(coords)
observed

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 0]])

Use counts of observed dinucleotides to compute their expected values: This can be achieved by first generating row and column sums, converting those to frequencies plus a couple of other steps (detail is below).

from numpy import outer

def get_expected(counts):
    total = counts.sum() # number of dinucleotides
    row_sums = counts.sum(axis=1)
    col_sums = counts.sum(axis=0)
    # converting to frequencies
    row_probs = row_sums / total
    col_probs = col_sums / total
    # outer product gives us matrix of expected frequencies
    # multiplying by total gives matrix of expected counts
    expecteds = outer(row_probs, col_probs) * total

    return expecteds

expected = get_expected(observed)
expected

array([[0.25, 0.5 , 0.  , 0.25],
       [0.5 , 1.  , 0.  , 0.5 ],
       [0.25, 0.5 , 0.  , 0.25],
       [0.  , 0.  , 0.  , 0.  ]])

Generate the \(\mathbf{\chi^2}\) statistic: This is defined as follows
(1)#\[\chi^2=\sum_i\sum_j\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\]

Where \(O_{i,j}\) and \(E_{i,j}\) correspond to the observed and expected counts for dinucleotide \(i,j\). We express this as a Python function and apply it to our simple example. (The numpy array operations greatly simplify the calculation.)
def calc_chisq(observed, expected): # observed and expected are both numpy arrays chisq = (observed - expected)**2 / expected return chisq.sum() calc_chisq(observed, expected)
np.float64(nan)

Note

The nan that was output from the calc_chisq() was generated because we were doing a division with 0 in the denominator. (See output from get_expected() above.) So time to switch to using the full sequence now.

from cogent3 import load_seq

seq = load_seq("data/case_study1.fasta", moltype="dna")

Let’s provide a simplified interface to all these function calls such that if we provide our sequence, all the above steps are invoked and we get back a \(\chi^2\) statistic [5].

def chiqsq_independent_nucs(seq):
    dinucs = seq_to_dinucs(seq)
    coords = [dinuc_to_indices(dinuc) for dinuc in dinucs]
    observed = make_counts_matrix(coords)
    expected = get_expected(observed)
    return calc_chisq(observed, expected)

chiqsq_independent_nucs(seq)

np.float64(22.577013388423254)

So that’s nice, we are now able to compute the statistic of interest given a sequence. How do we generate the null?

Generating the null distribution computationally#

To produce a distribution of \(\chi^2\) test statistics we need a collection of sequences drawn from the null distribution!

We can generate synthetic sequences consistent with the null by randomly sampling from our actual data. This requires we have a means for producing randomised nucleotide orders from our observed data. Algorithms for generating pseudo-random numbers are important for scientific computing and, as you might expect, there are numerous choices. (Both the Python standard library and numpy come with a builtin capability for generating such numbers using well regarded algorithms. We will use the one distributed with numpy.)

In our case, we will use a shuffle() function. Note that shuffle() works “in place”, meaning it modifies the data you provide, so we need to convert our sequence into a list.

from numpy.random import shuffle

tmp = list("AACCCCGT")
shuffle(tmp)
tmp

['C', 'A', 'A', 'T', 'C', 'G', 'C', 'C']

Will our functions still work if we give them a list?

chiqsq_independent_nucs(list(seq))

np.float64(22.577013388423254)

Yup!

To recap, the chiqsq_independent_nucs() function takes a sequence and returns the \(\chi^2\) statistic for the independence of the nucleotides at the first and second positions of dinucleotides in that sequence. We want to generate the null distribution for this statistic so that we can assess how unusual the statistic from the case study 1 data is.

We first decide how big a distribution, i.e. how many synthetic sequences, we will generate. Each of these synthetic sequences is generated in accordance with the null by shuffling the original sequence.

The quantities (and corresponding variable) required to estimate the \(p\)-value are

\(\chi^2_o\) (obs_chisq): The statistic from the original (observed) sequence.
\(\chi^2_s\) (syn_chisq): The statistic computed from the synthetic sequences.
\(\mathcal{N}\) (num_reps): The number of synthetic sequences to evaluate.
\(\mathcal{g}\) (num_gt): The number of synthetic sequences for which \(\chi^2_s \ge \chi^2_o\).

From these quantities we estimate the probability of a \(\chi^2_o\) of equal or greater magnitude occurring by chance under the null model as \(p\)-value\(=\frac{g}{\mathcal{N}}\).

So here’s the final function.

def estimate_chisq_pval(seq, num_reps):
    obs_chisq = chiqsq_independent_nucs(seq)
    seq = list(seq)
    num_gt = 0
    for i in range(num_reps):
        shuffle(seq)
        syn_chisq = chiqsq_independent_nucs(seq)
        if syn_chisq >= obs_chisq:
            num_gt += 1
    return num_gt / num_reps

estimate_chisq_pval(seq, 4000)

0.00725

This type of approach to statistical estimation relies on resampling from observed data is also known as bootstrapping.

Note

There is a relationship between \(p\)-values and quantiles. The value returned by estimate_chisq_pval() is 1 minus the quantile of 22.577 in the simulated \(\chi^2_s\) distribution.

9.3.3.2. Estimating a \(p\)-value using a theoretical distribution#

The above is my attempt at making concrete the origins of \(p\)-values, an essential component to hypothesis testing. I have developed this discourse with a focus on the first of our case studies.

For case study 1, we can also take the more conventional approach of assuming the theoretical \(\chi^2\) distribution is appropriate and just use that to “look up” the \(p\)-value. Doing so requires we know the degrees-of-freedom (\(df\)) for our test. In most statistical analysis tools, the df is automatically computed. Certainly for this case, it’s not a hard calculation! For the \(\chi^2\) applied to a contingency table with \(m \times n\) rows and columns \(df=(m-1) \times (n-1)\), which is 9 in this case. The result is what we saw earlier

Chisq-test for independence
chisq	df	pvalue
22.577	9	0.0072

As you can see, this \(p\)-value is close to that estimated above [6]. The interpretation is the same, we would expect a \(\chi^2_9\ge\)22.577 will occur by chance 0.0072 of the time when the null hypothesis is true.