8.13. General purpose statistical tests using cogent3#

The following are a subset statistical tests available within cogent3 organised by analysis problem.

8.13.1. Analyses of categorical data#

This involves contrasting counts data against the expectation defined either by another data set or by a prior null hypothesis. The analysis of categorical data is provided by the CategoryCounts object.

Note

The methods on CategoryCounts also support estimating the \(p\)-value using a resampling statistic.

8.13.1.1. Goodness-of-fit tests#

This type of test compares counts against expected values specified via a prior hypothesis. Here I’ll analyse nucleotide counts data, testing the null hypothesis that all bases occur with equal frequency.

from cogent3.maths.stats.contingency import CategoryCounts

counts = CategoryCounts({"A": 887, "C": 547, "G": 623, "T": 535})
counts
ACGT
Observed887.00547.00623.00535.00
Expected648.00648.00648.00648.00
Residuals9.39-3.97-0.98-4.44

You can see that the object displays the observed (what we provided), expected values (all categories equally frequent), and the standardised residuals [1]. Residuals help you interpret where the observed data depart most strongly from the expected values. In this case, the largest absolute value is for the base A. That the residual is positive indicates the observed data has an excess of this base compared to expected (a negative residual indicates a deficit).

CategoryCounts provides several different methods for analysis of categorical data. In this case, we could use either the G_fit() [2] or chisq_test().

result = counts.G_fit()
result
G-test goodness-of-fit (with Williams correction)
Gdfpvalue
117.48732.68e-25
ACGT
Observed887.00547.00623.00535.00
Expected648.00648.00648.00648.00
Residuals9.39-3.97-0.98-4.44

The statistic, degrees of freedom and associated \(p\)-value are all accessible as attributes on the result object.

result.pvalue
2.6826158015006287e-25

8.13.1.2. Tests of independence#

In this type of test, our null hypothesis is that our counts occur independently of specific combinations of categories. Let’s assess whether the DNA sequence from which those counts were derived was strand symmetric. The null hypothesis (\(H_o\)) is that base counts satisfy the following: A=T, G=C. The alternate hypothesis (\(H_a\)) is they don’t, i.e. A≠T, G≠C.

In this case, we need a 2x2 table with strand on one axis and base on the other. I pick the purines as my bases on the plus strand, these will be the top-level keys in my 2D dict. The value for each base is a dict with the counts of that base on the plus and minus strands. To reiterate, the keys in the top level dict will become the row labels. The nested dict’s must all have the same keys and those keys become the column labels.

data = {"A": {"+": 887, "-": 535}, "G": {"+": 623, "-": 547}}

This can now be used to construct a CategoryCounts object.

ssymm = CategoryCounts(data)
ssymm
Observed
+-
A887535
G623547

Expected
+-
A828.4028593.5972
G681.5972488.4028

Residuals
+-
A2.0359-2.4051
G-2.24452.6515

We use the G_independence() test to assess the null that the cell counts occur as simply the product of the probabilities of their corresponding categories.

result = ssymm.G_independence()
result
G-test for independence (with Williams correction)
Gdfpvalue
21.97512.76e-06
Observed
+-
A887535
G623547

Expected
+-
A828.4028593.5972
G681.5972488.4028

Residuals
+-
A2.0359-2.4051
G-2.24452.6515

We reject \(H_o\) in this case and conclude the sequence is strand-asymmetric with respect to nucleotides.

8.13.2. Analyses of correlations#

In cases where we have bivariate data we may be interested in whether the two values are correlated. Of course, it would be remiss of me not to remind you of the limits to drawing inferences from correlations.

We will evaluate these data.

x = (44.4, 45.9, 41.9, 53.3, 44.4, 44.1, 50.7, 45.2, 60.1)
y = (2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)

Compute the Pearson product-moment correlation coefficient and its \(p\)-value (taken from the \(t\)-distribution assuming the degrees of freedom equals n-2).

from cogent3.maths.stats.test import pearson_correlation

rho, pval = pearson_correlation(x, y)
rho, pval
(0.5692780764158853, 0.10962375448181438)

If the data of interest are not normally distributed, one approach to assessing the existence of an association is to use a non-parametric test. In this case we use Kendall’s \(\tau\) (the coefficient of rank correlation), which transforms the data into ranks and compares those.

from cogent3.maths.stats.test import kendall_correlation

tau, pval = kendall_correlation(x, y, alt="two sided")
tau, pval
(0.4225771273642583, 0.11585149752593009)

8.13.3. Analyses of distributions#

8.13.3.1. Paired distributions#

Say you have paired data – observations that are coupled in some way, such as from the same individual at different time-points. Specific statistical procedures for this case include the paired \(t\)-test and the sign test. The former is a parametric statistical procedure, the latter non-parametric.

The paired \(t\)-test is used to test the null hypothesis that mean (\(\bar\mu_d\)) of the differences (\(d\)) between two samples equals 0. There can be different alternate hypotheses (which you pre-specify), e.g. \(\bar\mu_d>0\) (a one-tailed test). The test has numerous assumptions, including that \(d\) is normally distributed.

from cogent3.maths.stats.test import t_paired

x = [7.33, 7.49, 7.27, 7.93, 7.56, 7.81, 7.46, 6.94,
     7.49, 7.44, 7.95, 7.47, 7.04, 7.1, 7.64]
y = [7.53, 7.70, 7.46, 8.21, 7.81, 8.01, 7.72, 7.13,
     7.68, 7.66, 8.11, 7.66, 7.20, 7.25, 7.79]

t, pval = t_paired(x, y)
t, pval
(-19.72026594366534, 1.3014695016944488e-11)

The sign test is basically a binomial test where the frequency is 0.5. In this case, we have the same expectation – no difference between our two groups. In this case, we turn our paired observations into successes / failures (x > y) and sum the number of successes. We need this integer, and the number of “trials” (i.e. how many paired records there are). (Note the use of numpy to simplify the element-wise comparison of x and y and to sum the number of True.)

import numpy

from cogent3.maths.stats.test import sign_test

x = numpy.array(x)
y = numpy.array(y)

num_x_gt_y = (x > y).sum()
pval = sign_test(num_x_gt_y, len(x))
pval
6.103515625e-05

8.13.3.2. Distribution properties#

We can compare continuously distributed variables using standard statistical procedures, such as the two sample t-test. We can also employ non-parametric approaches to these.

Two sample t-test#

x = [134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123]
y = [70, 118, 101, 85, 107, 132, 94]
from cogent3.maths.stats.test import t_two_sample

t_two_sample(x, y)
(1.89143639744233, 0.07573012895667763)

Mann-Whitney U-test#

Like the t-test, the Mann-Whitney (MW) test compares distributions by comparing their locations. However, this is a non-parametric test. It converts the original observations into ranks and compares the means of those ranks.

from cogent3.maths.stats.test import mw_test

mw_test(x, y)
(62.5, 0.08303763192789855)

Note

There is also a bootstrap version of the MW test available cogent3.maths.stats.test.mw_boot.

Kolmogorov-Smirnov test#

The Kolmogorov-Smirnov (or KS) test is also a non-parametric statistical procedure but, unlike the Mann-Whitney test, it compares the cumulative distributions (both location and shape).

from cogent3.maths.stats.test import ks_test

k_stat, pval = ks_test(x, y)

Note

There is also a bootstrap version of the KS test available cogent3.maths.stats.test.ks_boot.

8.13.3.3. Using the Kolmogorov-Smirnov test to assess the distribution#

We evaluate whether the data from x are distributed normally. To do this, we obtain the theoretical quantiles from the normal distribution. (These are from the standard normal distribution, with mean of 0 and standard deviation of 1.)

import numpy

from cogent3.maths.stats.distribution import theoretical_quantiles

norm_quants = theoretical_quantiles(len(x), "normal")
norm_quants
array([-1.7316644 , -1.15034938, -0.8122178 , -0.54852228, -0.31863936,
       -0.10463346,  0.10463346,  0.31863936,  0.54852228,  0.8122178 ,
        1.15034938,  1.7316644 ])

So we need to transform x into standard z-scores before we perform the KS test.

n_x = numpy.array(x, dtype=float)
mean = n_x.mean()
# we specify ddof=1 to ensure the standard deviation is mathematically unbiased
std = n_x.std(ddof=1)

zscores = (n_x - mean) / std

ks_stat, pval = ks_test(zscores, norm_quants)

print(f"D={ks_stat:.4f}  p-value={pval:.4f}")
D=0.0833  p-value=1.0000

8.13.4. Quantile-quantile plots#

A graphical way for comparing whether two data sets come from the same statistical distribution. In this case, we compare x with the theoretical quantiles from the normal distribution. If the data do come from the same distribution, then their points will form a line on the diagonal [3]. In this case, the data seem very close to that – consistent with the KS test results.

In order to do the plot, the sample data must be sorted. I also add a diagonal line between the minimum and maximum points. If the data are truly on a diagonal, the data points will be scattered very close to this line.

import plotly.express as px

n_x.sort()

fig = px.scatter(
    x=norm_quants,
    y=n_x,
    width=400,
    height=400,
    labels={
        "x": "Theoretical Quantiles",
        "y": "Sample Quantiles",
    },
)

fig.add_scatter(x=[norm_quants.min(), norm_quants.max()],
                y=[n_x.min(), n_x.max()],
                mode="lines",
                showlegend=False)
fig.show()