Position Specific Scoring Matrices (PSSMs)

6.12. Position Specific Scoring Matrices (PSSMs)#

A Position Specific Scoring Matrix, or PSSM, is a matrix of log-odds ratios per position of a sequence motif. (PSSMs are also called profiles.) They provide a means for computing the match odds for any new sequence. They are typically applied to finding transcription factor binding sites (TFBS) but are also used to characterise protein domains.

See Experimental procedures for detecting functionally related sequences for how the counts data are derived. To illustrate the transformation of those counts data into a PSSM, we will start with a simple worked example. First we take the counts table presented in Transformation of the data for analysis.

We convert the PWM into a PPM, but I’m restricting the examples to just 4 positions.

PPM --
Position specific probability matrix
Base \ Position	0	1	2	3
T	0.2	1.0	0.1	0.6
C	0.0	0.0	0.0	0.0
A	0.8	0.0	0.9	0.4
G	0.0	0.0	0.0	0.0

6.12.1. A worked example#

6.12.1.1. Calculating the Expected Under the Background#

Let’s use the sequence

seq1 = "GTAT"

We define a background distribution as one where all bases are equally frequent. In the above case, we then obtain the \(p(seq1|background)\) as

seq_len = len(seq1)
p_seq1_background = 0.25 ** seq_len
print(f"{p_seq1_background:.6f}")

0.003906

6.12.1.2. Calculating the Expected Under the Alternate#

In this case, I’m just doing this “manually”. First, note the base order is T, C, A, G [1]. Here’s a pseudo-code algorithm describing this calculation:

PPM is a 2D matrix with rows corresponding to bases, columns to positions
define the index order of bases as T at index 0, C index 1, A index 2, G index 3
prob_of_seq = 1.0
for seq_index in sequence
    set base as the character at seq_index
    set base_index as the index of base in bases
    probability_of_base_at_position equals PPM[base_index, seq_index]
    prob_of_seq  = prob_of_seq * probability_of_base_at_position
    if prob_of_seq is 0, exit the loop

At sequence position 0, we have base G. This has the value of 0.0, so we stop.

This raises the question of whether a G at index [0] is truly impossible? More likely, the 0 is due to the sample size of the experiment. One approach is to add a “small” number to all elements. (This is akin to imagining the next observation would have been of the unobserved type.) This number is referred to as a pseudocount and typically, a pseudocount ≤ 1 is chosen.

6.12.1.3. Adjusting the PWM with a pseudocount#

We add a pseudocount of 0.5 to the PWM and then convert to a PPM as before, producing

PPM
Position specific probability matrix after adding 0.5 to the PWM cells
Base \ Position	0	1	2	3
T	0.208	0.875	0.125	0.542
C	0.042	0.042	0.042	0.042
A	0.708	0.042	0.792	0.375
G	0.042	0.042	0.042	0.042

4 rows x 5 columns

This now leads to the following elements being taken from the table 0.042, 0.875, 0.792, 0.542, leading to

\[p(seq1|alternate)=0.042\times0.875\times0.792\times0.542\approx0.015775\]

6.12.1.4. The odds-ratio#

We can form an odds-ratio as

\[OR = \frac{p(seq1|alternate)}{p(seq1|null)}\approx4.0384\]

How should you interpret this? Look at the OR equation!

6.12.1.5. Computing the PSSM#

The PSSM is a log-odds matrix, i.e. it’s the log of the odds ratio matrix. Because we assume a background distribution of 0.25, we can compute this very simply as log2(ppm)-log2(0.25).

Base \ Position	0	1	2	3
T	-0.263	1.807	-1.000	1.115
C	-2.585	-2.585	-2.585	-2.585
A	1.503	-2.585	1.663	0.585
G	-2.585	-2.585	-2.585	-2.585

4 rows x 5 columns

6.12.1.6. Computing the PSSM score for the sequence#

We now select elements from the PSSM, just as we did above from the PPM – we use the sequence position number to specify the column of the PSSM, and the base at that position to specify the row. With that, for the sequence “GTAT”, we select the following log-odds scores: -2.585, 1.807, 1.663, 1.115.

From these, the log-odds of seq1 being derived from the experimental sample instead of the bacgkround is:

\[score = -2.585 + 1.807 + 1.663 + 1.115 = 2\]

For more on the interpretation of odds ratios, see Odds-ratio’s.

6.13. Exercises#

What does an OR equal 1 mean? What about an OR > 1? Or, an OR<1?
What does a log-odds ratio > 1 mean? What about a log-odds ratio equal to 0?
Write a function that takes a numpy array of odds-ratios and returns their \(\log_2\).
Write a function that takes a numpy array of log odds-ratios (assume the base is 2, i.e. \(log_2\)) and returns their odds ratios.