The set data type

5.25. The set data type#

This data type represents the mathematical object of a set. A set stores unique values, no matter how the object was initialised. They also cannot be indexed (or subscripted) since that has no meaning. The elements in a set are also unordered.

Sets are defined either explicitly using {} with different members separated by commas.

a = {0, 1, 1, 3, 1}
a
{0, 1, 3}

or by calling the builtin function set() passing the data in as a series.

a = set([0, 1, 1, 3, 1])
a
{0, 1, 3}

As the latter expression indicates, the input to set() must be an iterable. This matters! The latter two are not equivalent.

a = {"ABC"}
a
{'ABC'}

The former states explicitly that the string “ABC” is a set member and the set has one member.

But using the builtin function states the items in “ABC” are set members.

b = set("ABC")
b
{'A', 'B', 'C'}

To have the latter correspond with the former requires placing the string inside another iterable.

b = set(["ABC"])
b
{'ABC'}

The set type is iterable.

dinucs = {"AA", "CG", "GA"}
for item in dinucs:
    print(item)
CG
GA
AA

Only immutable data types can be members of sets (so not lists etc..)

{"ABC", []}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 {"ABC", []}

TypeError: unhashable type: 'list'

The great power of this data type is the ability to do very succinct comparisons. These use bitwise operators. For instance, we identify the overlap between two sets using the bitwise & character (bitwise AND).

a = set("ACGGCCT")
b = set("ACGGAAA")
a & b
{'A', 'C', 'G'}

We can establish whether an object is a member of a set using the in logical operator

"X" in a
False

or whether one set is a subset of another using the < logical operator

bases = {"A", "C", "G", "T"}
b < bases
True

We can compute the difference (what nucleotides is b missing) using the standard - operator

bases - b
{'T'}

Or a “symmetric” difference using the ^ character (bitwise exclusive OR, analogous to NOT)

bases ^ b
{'T'}

We can take the union of two sets using the | character (bitwise inclusive OR).

a = {0, 2, 3}
b = {1, 4}

a | b
{0, 1, 2, 3, 4}

These operations are also available as methods on the set instances.

Having created a set, you can add new elements using the add() method.

a.add(22)
a
{0, 2, 3, 22}

Or remove elements using the remove() method.

a.remove(22)
a
{0, 2, 3}

Given that a set is mutable, you cannot have sets as part of sets. Python provides an immutable set type, frozenset that can be. This is defined using the builtin function of that name.

f = frozenset("ABCD")
f
frozenset({'A', 'B', 'C', 'D'})
a.add(f)
a
{0, 2, 3, frozenset({'A', 'B', 'C', 'D'})}

Note

Once created, a frozenset instance cannot be changed.

5.26. Exercises#

  1. For the following data, create a set using either set() or a set comprehension.

    data = ['GC', 'CA', 'AA', 'AG', 'GG', 'GA', 'AG',
            'GC', 'CC', 'CA', 'AA', 'AC', 'CA', 'AT',
            'TA', 'AA', 'AC', 'CA', 'AG']
    
  2. How many unique dinucleotides are there in data?

  3. Create a set from the following and compare it to the set you created from data. How big is the intersection of the two sets? How big is the set of symmetric differences?

    data2 = ['GC', 'CA', 'AA', 'AG', 'GG', 'GC', 'CG',
             'GC', 'CC', 'CA', 'AA', 'AC', 'CA', 'AG',
             'GA', 'AG', 'GC', 'CA', 'AG']
    
  4. Provide an example that a frozenset() can be applied to but a set() cannot. In showing this, include any errors and explain why they occur.

  5. For the following data, you want to create a set that excludes dinucleotides containing a non-canonical DNA character (see Expected Output). Solve this problem in two different ways. (a) by creating the set of unique dinucleotides and creating the correct set from that. (b) by creating an empty set, iterating over dinucleotides in data and adding them only if they consist of canonical nucleotides. Which algorithm is faster and why?

    data = ['GC', 'CA', 'AA', 'NG', 'GG', 'GA', 'AG',
            'GC', 'CC', 'CR', 'AA', 'AC', 'CA', 'NN',
            'TA', 'AA', 'AY', 'CA', 'AG']
    
    {'AC', 'GG', 'AA', 'CC', 'CA', 'GC', 'GA', 'AG', 'TA'}