Beyond hypothesis testing – exploratory data analysis for genomics

9.7. Beyond hypothesis testing – exploratory data analysis for genomics#

For many problems in genomics, analysing the distributions of measurements from genomic data is an effective strategy. It can even be the case that the measurement distribution to be evaluated is of \(p\)-values for tests applied to genomic data. Alternatively, we can consider the distribution of quantiles, a measurement analogous to a \(p\)-value which is obtained as simple transformation from a raw statistic.

9.7.1. The uniform distribution#

We now turn our attention to a description of the uniform distribution which we will use to introduce a measure related to the \(p\)-value as a cumulative measure of probability density.

Consider a random variable that can obtain any value in [0, 1] (including the boundaries, see uniform distribution histogram). We call such a random variable uniformly distributed if all possible values of that random variable have an equal probability of occurring. The probability of observing a value of 0.2 is equal that of observing of 0.8, 0.9, or 0.0.

9.7.2. Quantiles as distribution descriptors#

Quantiles are rank order statistics. They are locations in a sorted collection of values. One example of a quantile you are likely familiar with is the median, which cuts a distribution such that 1/2 of all values are less than it. Following this example, a quantile=0.05 is the point that is greater than 1/20th of all values. We can think of a values quantile as its relative rank with a data set which can be computed as \(\frac{r}{n}\) where \(r\) is the rank in \(n\) values.

Let’s play with the quantiles from the uniform distribution that I generated. We use the numpy.quantile function for this purpose. Since we’re using a uniform distribution, and following from the definition of this distribution, we can expect that 5% of all uniform random values will be \(\le 0.05\). Does our data support this?

from numpy import quantile

quantile(x_uniform, 0.05)
0.04955803304583062

Conversely, we expect that 5% of all uniform random values will be \(\ge 0.95\)

1 - quantile(x_uniform, 0.95)
0.04942746566771039

We generated these data using a sample size of 50,000. As we increase that sample size, you will find the estimates of the quantiles from the uniform distribution converge on their expected values. We can this statement more general – as you increase the sample size the quantile becomes an increasingly good approximation of its \(p\)-value.

Quantiles have advantages over the \(p\)-values in exploratory data analysis. Not least of which they are derived from the actual data, rather than idealised (theoretical) description. Numerous data exploratory techniques are based upon this quantity (for example Quantile-Quantile plots to compare the distributions of two data sets).