6.6. \(k\)-mers and Motifs#

6.6.1. \(k\)-mers#

One way of describing DNA sequences is simply counting the occurrences of DNA words of a fixed size. Such words are often referred to as \(k\)-mers where \(k\) refers to the word length. If we set \(k=1\), then the complete set of possible 1-mers is the listing of the nucleotides, i.e. {A, C, G, T}.

But why do it? Why is knowing \(k\)-mers informative? There are a multitude of ways in which \(k\)-mer counting is used in bioinformatics. One concerns evolutionary relatedness, and another concerns functional encoding. Before we tackle those, let’s go ahead and define \(k\)-mers in more detail.

We define the set of possible \(k\)-mers as the full enumeration of characters in a state space. In English, since DNA has four letters there are \(4^k\) distinct \(k\)-mers. So for \(k=2\), that’s \(4^2=16\). These are all the possible dinucleotides.

AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT

Setting \(k=3\), we have 64 possible trinucleotides, for \(k=4\), we have 256 tetranucleotides and so on.

We typically define the \(k\)-mer distribution of a biological sequence as the counts of all possible occurrences of each \(k\)-mer. Consider the following 11bp long DNA sequence.

'ACGGTGCCAGG'

The distribution of \(k\)-mers for \(k=1\) is

{'A': 2, 'C': 3, 'G': 5, 'T': 1}

Each value in the dict corresponds to the number of occurrences of that nucleotide.

The distribution of \(k\)-mers for \(k=2\) is

{'AC': 1, 'CG': 1, 'GG': 2, 'GT': 1, 'TG': 1, 'GC': 1, 'CC': 1, 'CA': 1, 'AG': 1}

where each value corresponds to the number of occurrences of that dinucleotide. If you sum the values you get 10, from a sequence of length 11. Wait, what … how does that work! Surely there are only 5 dinucleotides in a sequence of length 11!

We derive a \(k\)-mer distribution by considering all possible words of length \(k\), Let’s simplify the problem by working on just the first 3 nucleotides, ACG. This has two possible dinucleotides – AC and CG. Thus, we slide the start position along the sequence by 1 and take the \(k\) nucleotides from that position (see Sliding window).

So how do \(k\)-mers get used? They are used as a measure of evolutionary relatedness. The “distance” between \(k\)-mer distributions of closely related species is typically smaller than the distance between distantly related species. The genomic signature statistics of Karlin and colleagues are related to these measures. Moreover, \(k\)-mers are employed in machine learning algorithms such as that of Zhu et al (from my own lab).

6.6.2. Motifs#

Another application of \(k\)-mers is in how they relate to the concept of motifs. A motif is a short sequence that occurs multiple times in a DNA, RNA or protein sequence. The phrase can also be applied quite generally. For instance, the "ATG" motif corresponds to the start codon of protein coding genes. The phrase is typically applied to sequences that have some functional association. Arguably, this notion is exemplified by sequence logos, a statistical method used for visualisation of motifs. The DNA binding motif of the TBP protein is visually represented by a sequence logo. A related visualisation technique was developed by Zhu et al (from my own lab) for identifying sequence motifs associated with mutation processes.

As the start codon example illustrates, motifs represent a fundamental concept in the description of information encoding by DNA sequences.

6.7. Exercises#

  1. Consider the sequence seq. How many \(k\)-mers are there for \(k=1,2,3\)?

    seq = "ACG"
    
  2. For a sequence of length 7, how many \(k\)-mers are there for \(k=1,2,3\)?

  3. Write an equation for the number of \(k\)-mers in a sequence of length \(n\). When you set \(n=3, 7\) and \(k=1, 2, 3\) you should get the same answers as above.

  4. Write an algorithm that produces all the dinucleotides for seq.

  5. Then do it for all the trinucleotides in seq.

  6. The Python standard library has lots of very useful goodies. Investigate the Counter class from collections and apply it to your result from (4) and (5) to determine the \(k\)-mer counts. (Use google!)