5.43. numpy – numerical routines for python#

The numpy library is the foundation of the vast majority of scientific computing packages that use Python. It is popular because it provides a greatly simplified interface to complicated algorithms that have fast implementations. Conventional wisdom holds that converting a standard Python program to use numpy will deliver a 10x speedup. In fact, it can be much faster than that. But that’s not the focus of this extremely brief summary of numpy. Instead, we introduce you to the major capabilities and usage patterns that numpy enables. In short, using numpy objects can often eliminated the need for loops entirely. As a consequence, this greatly simplify the algorithms you have to write. So it’s truly worthwhile becoming familiar with this library [3].

numpy is designed for numerical calculation and the primary object the library provides is an array. The array [4] object has several key attributes, including:

  • array.ndim attribute, which indicates how many dimensions the array has

  • array.shape attribute, which indicates the number of elements on each dimension.

  • array.dtype attribute, which indicates the numpy data type [5]. This can also be determined by the input data, or by using the dtype argument.

5.43.1. Creating an array from existing data#

import numpy

data = numpy.array([0, 1, 2, 3], dtype=int)
data
array([0, 1, 2, 3])
data.ndim
1
data.shape
(4,)
data.dtype
dtype('int64')

Once created, you cannot extend an array, i.e. it’s total number of elements is immutable. However, the array “shape” (and thus dimensions) can be changed and the value at individual coordinates can be changed.

data.resize((2, 2))
data
array([[0, 1],
       [2, 3]])
data[0][0] = 42
data
array([[42,  1],
       [ 2,  3]])

5.43.2. Conversion to standard python data types#

raw = data.tolist()
raw
[[42, 1], [2, 3]]

5.43.3. Conversion to a different dtype#

There is a method on arrays for converting an array of one type into an array of a different type. For instance

x = numpy.array(["0.12", "0.33"])
x.dtype, x
(dtype('<U4'), array(['0.12', '0.33'], dtype='<U4'))
cast = x.astype(float)
cast.dtype, cast
(dtype('float64'), array([0.12, 0.33]))

So numpy has converted an array of strings into an array of 64-bit precision floats, in one line. Sweet!

5.43.4. Implicit type casting#

The dtype of an array instance dictates what assignment operations mean. For example, say we have an integer array

data.dtype, data
(dtype('int64'),
 array([[42,  1],
        [ 2,  3]]))

If we try to assign a float to the first element, it will not work because the value is implicitly cast to the dtype of the instance. In this example, only the integer component of the float 5.92132 is assigned.

data[0, 0] = 5.92132
data
array([[5, 1],
       [2, 3]])

Warning

Implicit type casting is never what you want! Because numpy does not raise an exception for this case, it is up to the programmer (you) to ensure the array dtype is appropriate. For this example, if you want to be able to assign floats to data you have to convert it to be floats firste, e.g. data.astype(float).

5.43.5. Constructing matrices#

Matrices can be specified on construction by providing, for example, lists of lists. In this example we use a list consisting of two lists, each with 4 elements. This results in a \(2\times4\) array.

data = numpy.array([[0, 1, 2, 3], [4, 5, 6, 7]])
data.shape
(2, 4)
data
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

Or, by combining other arrays [1].

a = numpy.arange(4)
a
array([0, 1, 2, 3])
b = numpy.arange(4, 8)
b
array([4, 5, 6, 7])
# from the above numpy arrays
m = numpy.array([a, b])
m
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

5.43.6. Scalar operations on arrays#

A major convenience for arrays is the ability to express element-wise operations as a single statement, instead of having to use a for loop.

Here’s an element-wise addition using a standard for loop on the raw nested list data structure.

5.43.6.1. The laborious (and slow) way#

for i in range(len(raw)):
    for j in range(len(raw[i])):
        raw[i][j] += 20
raw
[[62, 21], [22, 23]]

5.43.6.2. The simple and fast numpy way#

data += 20
data
array([[20, 21, 22, 23],
       [24, 25, 26, 27]])

Nice!

5.43.7. Standard mathematical operations on arrays#

If two or more arrays have the same shape, then element-wise operations between corresponding elements is also very simply expressed.

print("Before:", a, b, sep="\n")
c = a * b
print("After:", c, sep="\n")
Before:
[0 1 2 3]
[4 5 6 7]
After:
[ 0  5 12 21]

If they do not have a compatible shape, a ValueError exception is raised and the text indicates “… operands could not be broadcast together with shapes…”.

d = numpy.arange(5)
a * d
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 2
      1 d = numpy.arange(5)
----> 2 a * d

ValueError: operands could not be broadcast together with shapes (4,) (5,) 

5.43.8. Array iteration#

Behaves the same as iterating over a standard Python list (or tuple) with the same dimensions. This corresponds to iterating over axis=0.

for e in data:
    print(e)
[20 21 22 23]
[24 25 26 27]

5.43.9. Indexing and slicing#

In the following, we are working on this array.

array([[20, 21, 22, 23],
       [24, 25, 26, 27]])

We can select an individual element using the standard looking slice notation.

data[0][1]
21

Note that each dimension requires successive [] pairs. The numpy extended slicing notation allows using one set of [].

data[0, 1]
21

The slicing capabilities of arrays is rich and very useful! We can slice a matrix for a single column across all rows

data[:, 1] # the [1] column
array([21, 25])

or a single row across all columns. In both cases the : represents the complete set.

data[1, :] # the [1] row
array([24, 25, 26, 27])

5.43.10. Ensuring array shapes are compatible for mathematical operations#

There are rules that numpy uses to determine how arrays are broadcast together. The best resource to understanding this is the official documentation on broadcasting. That said, here’s a very condensed explanation.

When the array shapes are not the same, numpy compares the shapes element wise from right to left. The dimensions of two arrays are considered compatible when they are same or one of them is 1. Consider the arrays x and y

x = numpy.array([[0, 1], [2, 3], [4, 5], [6, 7]])
x
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])
x.shape
(4, 2)
y = numpy.array([1, 5, 9, 13])
y
array([ 1,  5,  9, 13])
y.shape
(4,)

Applying the broadcast rule, these are incompatible.

1x * y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[31], line 1
----> 1 x * y

ValueError: operands could not be broadcast together with shapes (4,2) (4,) 

This is because, the first value read from the right of x.shape is 2 and from the right of y.shape is 4.

For our example, one solution that ensures the result of the * operation has the same shape as x is to add a “new axis” to y. This can be done via a combination of slicing and using numpy.newaxis

x * y[:, numpy.newaxis]
array([[ 0,  1],
       [10, 15],
       [36, 45],
       [78, 91]])

or, equivalently, by explicitly reshaping y.

x * y.reshape((4,1))
array([[ 0,  1],
       [10, 15],
       [36, 45],
       [78, 91]])

We could also solve this using the transpose x (which flips the matrix, reversing it’s dimensions)

x.T * y
array([[ 0, 10, 36, 78],
       [ 1, 15, 45, 91]])

but this has the effect of meaning the result is also transposed with respect to the original orientation, which is typically inconvenient.

5.43.11. Array assignment#

Consider the following data.

a = numpy.array([[38, 28, 93], [96, 95, 70]])
l = a.tolist()

Assignment to individual elements of an array is more flexible than the comparable standard python objects. For instance, to assign 0 to all values of a is simply

a[:] = 0
a
array([[0, 0, 0],
       [0, 0, 0]])

Trying that on a list, however, raises an exception.

1l[:] = 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[37], line 1
----> 1 l[:] = 0

TypeError: can only assign an iterable

As the exception indicates, looping is required.

We can assign to an individual element using the numpy notation.

data[1, 2] = -99
data
array([[ 20,  21,  22,  23],
       [ 24,  25, -99,  27]])

5.43.12. Evaluation operations#

Using standard python evaluation operations on numpy arrays returns element wise bool arrays. We show uses for these below.

indices = data < 0
indices
array([[False, False, False, False],
       [False, False,  True, False]])

5.43.13. Advanced indexing#

There are two types of advanced indexing, boolean and integer.

5.43.13.1. Boolean indexing#

This applies when the object being used to slice the array is of type bool. These typically result from some array comparison operation.

m = numpy.array([[1, 2], [-3, 4], [5, -6]])
m
array([[ 1,  2],
       [-3,  4],
       [ 5, -6]])

Let’s identify all elements that are \(<0\).

negative = m < 0
negative
array([[False, False],
       [ True, False],
       [False,  True]])

The result is an array with boolean elements indicating whether the corresponding value in m satisfied (indicated by True) or not (indicated by False) the condition (\(<0\)). We can use bool arrays to slice the others with the same shape.

m[negative]
array([-3, -6])

As this shows, using a bool array for indexing on the original returns just those elements as a flat array. If you want your operation to generate a result with the same shape you need to “index in place”. For instance, you can use the index to restrict specific operations to just those elements represented by the index such as this assignment statement.

m[negative] = 0
m
array([[1, 2],
       [0, 4],
       [5, 0]])

5.43.13.2. Integer indexing#

This involves as many series of integers as there are dimensions to the array (e.g. 2 in the case of m).

Before we start using actual integer series, I’ll start by using conventional indexing to get the value of a single item. Specifically, I select row 1, column 1.

row_index = 1
col_index = 1
m[row_index, col_index]
4

We now enclose those indices in lists, such that each successive value corresponds to another row, another column. As such these sequential arrays correspond to array coordinates and thus must have the same dimension (length in our example below).

row_indices = [1, 2, 0]
col_indices = [1, 0, 1]
m[row_indices, col_indices]
array([4, 5, 2])

This corresponds to the following array coordinates: (1, 1), (2, 0), (0, 1). Thus, the returned value from advanced indexing is an array with same length as the indexing array length (3 in our case).

5.43.14. The numpy array axis#

As illustrated, the axis argument specifies whether a method / function operates on rows or columns [2].

Working on this array.

array([[ 20,  21,  22,  23],
       [ 24,  25, -99,  27]])
data.sum(axis=0)
array([ 44,  46, -77,  50])

5.43.15. Getting useful statistical quantities#

# Overall mean, all elements
data.mean()
7.875
# Unbiased estimate of standard deviation, all elements
data.std(ddof=1)
43.24163833291109
# Column means, operating on rows
data.mean(axis=0)
array([ 22. ,  23. , -38.5,  25. ])
# Row means, operating on columns
data.mean(axis=1)
array([21.5 , -5.75])

5.43.16. Linear algebra – matrix multiplication#

Matrix multiplication is a fundamental operation in linear algebra and is central to many statistical procedures (e.g. fitting linear models, taking the exponential of a matrix, likelihood of a phylogeny).

data1 = numpy.array([0, 1, 2, 3])
data2 = numpy.array([4, 5, 6, 7])

ip = numpy.inner(data1, data2)
ip
38

The @ symbol also serves as a special operator for matrix multiplication.

data1 @ data2
38

5.43.17. Conditionals on arrays#

Conditional operations on numpy arrays are important. We illustrate the utility of these operations with some simple examples.

data = numpy.array([[1, 2, 1, 9], [9, 1, 1, 3]])
matched = data > 3
matched
array([[False, False, False,  True],
       [ True, False, False, False]])

The above expression is evaluated element wise and returns a numpy array of type bool.

We use the standard Python in operator.

if 3 in data:
    print("Yes")
else:
    print("No")
Yes

We apply a conditional to an array and use the any() method, which will return True if any single element satisfied this condition.

if (data > 3).any():
    print("Yes")
else:
    print("No")
Yes

Using the all() method, which will return True only if all elements satisfied the condition.

if (data > 3).all():
    print("Yes")
else:
    print("No")
No

5.43.18. Comparisons of multiple arrays#

numpy provides tools for element-wise comparisons. This is more complicated than just using the standard python syntax.

x = numpy.array([True, False, False, True], dtype=bool)
y = numpy.array([False, False, False, True], dtype=bool)

Applying equivalence operators to arrays can result in exceptions because the result is ambiguous.

x or y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[59], line 1
----> 1 x or y

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Instead, you should use special functions which will operate element wise. Here’s a couple of examples.

numpy.logical_or(x, y)
array([ True, False, False,  True])
numpy.logical_and(x, y)
array([False, False, False,  True])

5.43.19. Using the result of array comparisons to count#

Scenario, you want to count (from multiple arrays that consist of a continuously distributed random variable) the number of times a specific threshold is reached for each “position” on a reference coordinate system.

data = [
    numpy.array([0.923, 0.022, 0.360, 0.970, 0.585]),
    numpy.array([0.480, 0.282, 0.055, 0.873, 0.960]),
]

# create an array that will be used to count how often
# a certain threshold is met
counts = numpy.zeros((5,), dtype=int)
counts
array([0, 0, 0, 0, 0])
print(data[0] > 0.5)
for da in data:
    counts[da > 0.5] += 1

counts
[ True False False  True  True]
array([1, 0, 0, 2, 2])
data = numpy.array(data)

(data > 0.5).sum(axis=0)
array([1, 0, 0, 2, 2])

5.44. Exercises#

  1. Create a list of 10 positive integers and convert it into a numpy array. Use array methods to compute the total. Divide the original array by the total to produce a normalised array, which you assign to a variable freqs. Using numpy logical operations to show that all elements are between 0 and 1. Use array methods to show the array sum is 1.

  2. Many methods on numpy arrays have an axis argument, one of which is sum(). Construct a 2-dimensional (2D) array that has the same number of rows and columns, e.g.

    [[0, 0],
     [0, 0]]
    

    is a 2D array. Assign values that make it easy to distinguish operations that operate across rows versus those which operate across columns [6]. Demonstrate this matrix serves that purpose using sum().

  3. bool data types can be summed. Create a sample array with dtype=bool and show that the sum of this array equals the number of occurrences of True.

  4. Look at the array data and identify the array coordinates where the values equal 9. Now use advanced array indexing to extract those coordinates in a single line statement.

    data = numpy.array([[1, 9, 0, 3, 9],
                        [9, 2, 8, 2, 1],
                        [3, 1, 9, 9, 5]])
    

    The result should be

    array([9, 9, 9, 9, 9])
    
  5. Same as the previous question except in a single line statement extract the values ≠9. The result should be

    array([1, 0, 3, 2, 8, 2, 1, 3, 1, 5])
    
  6. Use boolean array indexing to assign -3 to all values of data less than 2. The result should be

    array([[-3,  9, -3,  3,  9],
           [ 9,  2,  8,  2, -3],
           [ 3, -3,  9,  9,  5]])
    
  7. For the following boolean array indices, what is the result of ~indices?

    indices = numpy.array([True, True, False, True], dtype=bool)
    
  8. Convert the following code into using numpy – without for loops. After converting counts to a numpy array, my solution is 3 lines long.

    from math import log10
    
    counts = [[-4, 3, 4, -3, 4],
              [4, -1, -2, -3, 4],
              [-4, -1, 2, 0, 3],
              [2, -2, -2, -4, -5]]
    result = []
    for i in range(4):
        row = []
        for j in range(4):
            val = counts[i][j]
            val = 0 if val <= 0 else log10(val)
            row.append(val)
        result.append(row)
    
    result
    
    [[0, 0.47712125471966244, 0.6020599913279624, 0],
     [0.6020599913279624, 0, 0, 0],
     [0, 0, 0.3010299956639812, 0],
     [0.3010299956639812, 0, 0, 0]]
    
    array([[0.        , 0.47712125, 0.60205999, 0.        , 0.60205999],
           [0.60205999, 0.        , 0.        , 0.        , 0.60205999],
           [0.        , 0.        , 0.30103   , 0.        , 0.47712125],
           [0.30103   , 0.        , 0.        , 0.        , 0.        ]])
    
  9. What happens when you slice the following 1D array using newaxis on the first axis, or the second axis

    x = numpy.array([1, 9, 0, 3, 9])
    
  10. Comparing performance of pure Python and numpy implementations. Investigate usage of numpy.where() to obtain the row and column coordinates of a 2D array where the value equals 1 (that’s a one). Write a function called np_where() that takes a matrix as an argument and returns the row coordinates and column coordinates.

    First, use the following code to generate a random square matrix.

    from numpy.random import randint
    
    dim = 5
    mat = randint(0, 2, size=dim * dim)
    mat.resize(dim, dim)
    mat
    
    array([[0, 0, 0, 1, 0],
           [1, 1, 1, 0, 1],
           [1, 0, 0, 0, 0],
           [1, 1, 1, 1, 0],
           [0, 1, 0, 1, 1]])
    

    Compare np_where() to the performance of a function implemented using only pure python called py_where() that takes the matrix as an argument and returns the <row coordinates>, <column coordinates> as lists. For mat, it should return the following.

    ([0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 4, 4, 4], [3, 0, 1, 2, 4, 0, 0, 1, 2, 3, 1, 3, 4])
    

    Use the “magic” %timeit command builtin to Jupyter to assess performance of each function on the same value of mat.

    %timeit py_where(mat)
    
    7.56 µs ± 81.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    

    Then try setting dim=20 and repeat. Which is faster, and by how much?

  11. Do some googling for testing numpy arrays using assert_allclose. Then use this to check your array freqs created above sums to 1.