numpy argsort incantations
“You seem to be a big fan of the numpy.argsort method, Fabian, since you recommend it so frequently…”, my colleague Raj approached me last Friday in the lab.
So I lowered my voice and replied in conspiratorial whisper: Well, you know, numpy works a little bit like magic. Once you call an operation by its true name, it just comes true.
Here, I’m going to give away some arcane abras and cadabras that every numpilogist should know about.
Let’s create some artificial data to illustrate how argsort can be used.
import numpy as np
np.set_printoptions(precision=2)
N = 10
data = 100*np.random.rand(N)
data
>>> array([77.09, 2.78, 79.87, 57.14, 89.41, 70.59, 9.51,
0.53, 98.36, 87.73])
The primary usage of argsort
(before we venture into the deeper territories of numpilogy) is to sort the array that is given as its argument. But instead of returning the sorted array, argsort
returns a list of indices that can be used to sort the array subsequently.
I like to call this return value the order
of things.
order = np.argsort(data)
order
>>> array([7, 1, 6, 3, 5, 0, 2, 9, 4, 8])
In numpy, an array of integers can be used to index another array.
If we index data
with order
, we obtain the sorted array.
data[order]
>>> array([ 0.53, 2.78, 9.51, 57.14, 70.59, 77.09, 79.87,
87.73, 89.41, 98.36])
Now observe that order
is a list of non-repeating integers between 0 and len(data)-1. Obviously, the function \(i\rightarrow\text{order}[i]\) must then be a bijection (one-to-one and onto) and therefore possess an inverse.
How can one find this inverse? And what are its uses? Let’s first try to answer the first question.
Generally speaking, for a list of integers forward_map
the inverse inverse_map
can be defined by one of the identities:
inverse_map[forward_map] = identity
forward_map[inverse_map] = identity
In our application to ordering and sorting, identity
is the list l
of numbers with l[i]=i
.
That is identity=np.arange(N)
.
Continuing with the application to sorting, we seek inverse_order
such that
order[inverse_order] = np.arange(N)
Note that np.arange(N)
is an ordered list and contains the same integers as order
(but in a different order).
Now from what we have learned above, we know that the list of indices that sorts the elements of order
is np.argsort(order)
. Therefore
inverse_order = np.argsort(order)
inverse_order
>>> array([5, 1, 6, 3, 8, 4, 2, 0, 9, 7])
It’s easy to check the two identities:
print(order[inverse_order])
print(inverse_order[order])
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
Now that we have found the inverse_order, why is it useful? To see this let’s go back to the step, where we actually produce the sorted data:
sorted_data = data[order]
sorted_data
>>> array([ 0.53, 2.78, 9.51, 57.14, 70.59, 77.09, 79.87,
87.73, 89.41, 98.36])
Now instead of having the indexing operation data[order]
on the right-hand side of the assignment operation, numpy also allows to have the indexing operation on the left-hand side of the assignment.
sorted_data = np.empty_like(data)
sorted_data[order] = data
sorted_data
>>> array([70.59, 2.78, 9.51, 57.14, 98.36, 89.41, 79.87,
77.09, 87.73, 0.53])
Obviously something is wrong here, despite the operation being syntactically correct. As you already will have guessed, moving order
to the other side of the equation requires to take its inverse. The correct solution is
sorted_data = np.empty_like(data)
sorted_data[inverse_order] = data
sorted_data
>>> array([ 0.53, 2.78, 9.51, 57.14, 70.59, 77.09, 79.87,
87.73, 89.41, 98.36])
Writing this in mathematical form might make a little clearer what is going on. I abbreviate data
by \(d\), sorted_data
by \(s\), and order
by \(\sigma\). Rewriting sorted_data = data[order]
in this short form gives:
Now define \(j = \sigma(i)\) that is equivalent to \(i = \sigma^{-1}(j)\). Substituting \(j\) for \(i\), we obtain
\[s(\sigma^{-1}(j)) = d(j) \text{ for all } j\]Apprentice’s exercise:
As a small side note, there exists a more direct algorithm for converting the order into its inverse which runs in linear time. Can you program it?
np.arsgort
has \(N \log N\) time complexity and
is therefore a little less efficient than the direct solution. Still I like the
convenience of numpy in providing a single function that carries out the conversion for me.
Summary and Discussion
In sum, we can write the sorting operation as:
sorted_data = np.empty_like(data)
sorted_data[np.argsort(np.argsort(data))] = data
where the inner argsort
computes the permutation to sort data
and the outer argsort
takes the inverse of that permutation.
The inverse is required because the indexing happens at the left-hand side of the assignment operation.
Finally, let me discuss why such numpilogical contortions can in fact be useful. Sometimes, putting the indexing operation on the left-hand side can be super convenient, like in the following:
processed_data[chunk_indices, :] = data_chunk
Imagine that processed_data
is a very large array that stores all your results and that you don’t want to hold twice in RAM. Maybe you have broken down a large computation into multiple function calls (possibly using multiprocessing). Assume that processed_data
must hold the
results in a specific order that you specified in advance. In this case, you need to insert
the data chunk into processed_data
at the correct location. For this, it is useful to be able to convert “right-hand side” indices to “left-hand side” indices. This is especially appealing if there is already a numpy function for this that works like a charm.