Finding motifs

6.9. Finding motifs#

So how do we find occurrences of a motif that is “noisy”? If we use the consensus approach, it’s trivial – exact string match. For the IUPAC ambiguity case, it’s also quite trivial – we could use a regular expression [1], e.g.

import re

re.findall("T[CT][ACT][GCT][AC]", "AGCGTTTCTCAGATGCA")

['TTCTC']

Alternatively, we could use a “probabilistic model” [2]. That’s what we’re going to do now.

6.9.1. Simple probability model for generating sequences#

Assume the probabilities of nucleotides are \(p_A=0.1\), \(p_C=0.2\), \(p_G=0.3\) and \(p_T=0.4\). We can calculate the probability of observing the following sequence as

seq = "AGGTC"
probs = dict(A=0.1, C=0.2, G=0.3, T=0.4)
p_seq = 1.0
for base in seq:
    p_seq *= probs[base]
p_seq

0.00072

For this probability to be an accurate reflection of the true expected frequency of seq requires that nucleotides occur independently of each other. Hence, we can just use the product of the individual nucleotide probabilities. (The alternate is a Markov process with order ≥ 1.)

When we consider modelling a sequence motif, however, we use position specific probabilities, i.e. there would be a distinct probs for each position of a motif. This is the foundation of PSSMs.

6.9.2. PSSMs: Position Specific Scoring Matrices#

Among the simplest probabilistic approaches are PSSMs. A PSSM is a matrix of log-odds ratios per position of a sequence motif. They are also referred to as profiles. They provide a means for computing the match odds for any new sequence. They are typically applied to finding transcription factor binding sites (TFBS) but also used to characterise protein domains.

Great, where can I get one? Well … recall JASPAR (the open access database of TFBS profiles) [3].

Below is the TATA box position weight matrix from JASPAR.

A  [ 61  16 352   3 354 268 360 222 155  56  83  82  82  68  77 ]
C  [145  46   0  10   0   0   3   2  44 135 147 127 118 107 101 ]
G  [152  18   2   2   5   0  20  44 157 150 128 128 128 139 140 ]
T  [ 31 309  35 374  30 121   6 121  33  48  31  52  61  75  71 ]

In this case, the PSSM has length 15. This motif is on display at JASPAR (but note their web server can be quite slow).

6.9.3. Formal definition of the probability model#

We now layout the equations for computing the probability of a sequence. To start, we make some simplifying assumptions. Namely, we assume a counts matrix reflects the true underlying probabilities of bases per position of a motif; that each base in a motif occurs independently of the other bases. We can compute the probability under the PPM (position probability matrix) model \(M\) of a sequence \(x\) with length \(L\) as:

\[P(x|M)=\Pi_{i=0}^L p_i(x_i)\]

where \(x_i\) is the base at position \(i\). This probability can also be referred to as the odds of observing the sequence given the model \(M\).

Note

the \(\Pi\) symbol means “take the product of the series”. So what \(\sum\) is to addition, \(\Pi\) is for multiplication.

6.9.3.1. A worked example#

Consider the following hypothetical matrix

A  [ 61  16 389   3 ]
C  [145  46   0  10 ]
G  [152  18   0   2 ]
T  [ 31 309   0 374 ]

which produces the following position specific probability matrix

A [ 0.16  0.04  1.00  0.01 ]
C [ 0.37  0.12  0.00  0.03 ]
G [ 0.39  0.05  0.00  0.01 ]
T [ 0.08  0.79  0.00  0.96 ]

The conditional probability of the sequence:

1. GTAT, is \(0.39 \times 0.79 \times 1.0 \times 0.96 \approx 0.3\)

2. GTCT, is \(0.39 \times 0.79 \times 0.0 \times 0.96 = 0.0\)

The latter case is potentially real – binding only happens with "A" at the 3rd position – but most likely due to a finite sized training data set. In other words, if the training set had been much larger we may have observed the other bases at that position.

6.9.3.2. Pseudo-counts – handling missing data#

Handling small sample sizes is a substantial problem [4]. The easiest way to tackle it, which we will use here, is to employ a pseudo-count. A pseudo-count is a “synthetic observation” that is added to all the elements in the counts matrix. It eliminates 0 counts and thus precludes cases such as (2) above, where a sequence is otherwise considered impossible. I’ll illustrate that using the above example.

from numpy import array

counts = array(
    [[61, 16, 389, 3], [145, 46, 0, 10], [152, 18, 0, 2], [31, 309, 0, 374]]
)

Then we add our pseudo-count.

counts += 1
counts

array([[ 62,  17, 390,   4],
       [146,  47,   1,  11],
       [153,  19,   1,   3],
       [ 32, 310,   1, 375]])

We determine our new columns sums

col_sums = counts.sum(axis=0)
col_sums

array([393, 393, 393, 393])

and produce a revised position specific probability matrix

ppm = counts / col_sums
ppm

array([[0.15776081, 0.043257  , 0.99236641, 0.01017812],
       [0.37150127, 0.11959288, 0.00254453, 0.02798982],
       [0.38931298, 0.04834606, 0.00254453, 0.00763359],
       [0.08142494, 0.78880407, 0.00254453, 0.95419847]])

In order to assess whether a sequence might be consistent with a PSSM, we need a way of scoring them. This is done by converting a the Position specific Probability Matrix (PPM) to a log-odds ratio. In short, we compare the odds of observing a base compared to its odds from a background distribution.

For convenience, define the background distribution of sequence states to be equally frequent. Then the odds ratio is:

\[OR(x|M)=\Pi_{i=0}^L \frac{p_i(x_i)}{0.25}\]

Which is expressed as a log-odds score \(S\):

\[S=\sum_{i=0}^L \log p_i(x_i) - \log 0.25\]

We then interpret the values of \(S\) as

if \(S=0\), the sequence is equally likely in the PSSM and background
if \(S<0\), the sequence is less likely under the PSSM than background
if \(S>0\), the sequence is more likely under the PSSM than background

6.9.3.3. PSSM limitations#

if the training data is limited, we need to handle zero counts which may introduce bias
we assume bases in a sequence are independent of each other