4
votes

What is the most pythonic equivalent for matlab's dummyvar function in order to deal with category variables nicely?

Here is an example illustrating my problem, with a NxM matrix which denotes M different ways of partitioning N data points into <=N categories.

>> partitions
array([[1, 1, 2, 2, 1, 2, 2, 2, 1, 1],
   [1, 2, 2, 1, 2, 1, 2, 2, 2, 1],
   [1, 1, 1, 2, 2, 2, 1, 3, 3, 2]])

The task is to efficiently count the number of times that any two data points are classified into the same category and store the result in a NxN matrix. In matlab this could be accomplished as a one-liner with dummyvar which creates a column variable for each category for each partition.

>> dummyvar(partitions)*dummyvar(partitions)'
ans =
 3     2     1     1     1     1     1     0     1     2
 2     3     2     0     2     0     2     1     2     1
 1     2     3     1     1     1     3     2     1     0
 1     0     1     3     1     3     1     1     0     2
 1     2     1     1     3     1     1     1     2     2
 1     0     1     3     1     3     1     1     0     2
 1     2     3     1     1     1     3     2     1     0
 0     1     2     1     1     1     2     3     2     0
 1     2     1     0     2     0     1     2     3     1
 2     1     0     2     2     2     0     0     1     3

The most efficient way that I can think of to solve this task is writing an O(n*m) loop that emulates dummyvar's behavior. (Note that the code below prefers partition.shape[0] << partition.shape[1], which is likely to be true in general but is unsafe to assume).

dv=np.zeros((0,10))
for row in partitions:
  for val in xrange(1,np.max(row)+1):
    dv=np.vstack((dv,row==val))
np.dot(dv.T,dv)

And of course because vstack in a loop is very inefficient this can be improved by finding the desired size and creating the array to start out with, but I am really looking for a one liner to do it just as in matlab.

Edit: Some more information about what I am doing just for added context. I am writing library functions in python (where no python implementation exists) for a library for analyzing brain networks. Existing working matlab source is avaiable. Due to domain-specific constraints the roughly maximal size of the input is networks of a few thousand nodes. However, basically all of the functions I write have to scale well to large inputs.

1

1 Answers

5
votes

You can do a little broadcasting magic to get your dummy arrays fast:

>>> partitions = np.array([[1, 1, 2, 2, 1, 2, 2, 2, 1, 1],
...                        [1, 2, 2, 1, 2, 1, 2, 2, 2, 1],
...                        [1, 1, 1, 2, 2, 2, 1, 3, 3, 2]])
>>> n = np.max(partitions)
>>> d = (partitions.T[:, None, :] == np.arange(1, n+1)[:, None]).astype(np.int)
>>> d = d.reshape(partitions.shape[1], -1)
>>> d.dot(d.T)
array([[3, 2, 1, 1, 1, 1, 1, 0, 1, 2],
       [2, 3, 2, 0, 2, 0, 2, 1, 2, 1],
       [1, 2, 3, 1, 1, 1, 3, 2, 1, 0],
       [1, 0, 1, 3, 1, 3, 1, 1, 0, 2],
       [1, 2, 1, 1, 3, 1, 1, 1, 2, 2],
       [1, 0, 1, 3, 1, 3, 1, 1, 0, 2],
       [1, 2, 3, 1, 1, 1, 3, 2, 1, 0],
       [0, 1, 2, 1, 1, 1, 2, 3, 2, 0],
       [1, 2, 1, 0, 2, 0, 1, 2, 3, 1],
       [2, 1, 0, 2, 2, 2, 0, 0, 1, 3]])

There is the obvious drawback that, even if a row has only a few different values, the dummy array we are creating will have as many columns for that row as needed for the row with the most values. But unless you have huge arrays, it is probably going to be faster than any other approach.


Well, if you are after a scalable solution, you want to use a sparse array for your dummy matrix. The following code may be hard to follow if you are not familiar with the details of the CSR sparse format:

import scipy.sparse as sps
def sparse_dummyvar(partitions):
    num_rows = np.sum(np.max(partitions, axis=1))
    nnz = np.prod(partitions.shape)
    as_part = np.argsort(partitions, axis=1)
    # You could get s_part from the indices in as_part, left as
    # an exercise for the reader...
    s_part = np.sort(partitions, axis=1)
    mask = np.hstack(([[True]]*len(items_per_row),
                      s_part[:, :-1] != s_part[:, 1:]))
    indptr = np.where(mask.ravel())[0]
    indptr = np.append(indptr, nnz)

    return sps.csr_matrix((np.repeat([1], nnz), as_part.ravel(), indptr),
                          shape=(num_rows, partitions.shape[1],))

This returns the transpose of dummyvar(partitions). You could get the array without transposing simply by calling csc_matrix instead of csr_matrix and swapping the shape values. But since you only are after the product of the matrix with its transpose, and scipy converts everything to CSR format before multiplying, it is probably slightly faster like this. You can now do:

>>> dT = sparse_dummyvar(partitions)
>>> dT.T.dot(dT)
<10x10 sparse matrix of type '<type 'numpy.int32'>'
    with 84 stored elements in Compressed Sparse Column format>
>>> dT.T.dot(dT).A
array([[3, 2, 1, 1, 1, 1, 1, 0, 1, 2],
       [2, 3, 2, 0, 2, 0, 2, 1, 2, 1],
       [1, 2, 3, 1, 1, 1, 3, 2, 1, 0],
       [1, 0, 1, 3, 1, 3, 1, 1, 0, 2],
       [1, 2, 1, 1, 3, 1, 1, 1, 2, 2],
       [1, 0, 1, 3, 1, 3, 1, 1, 0, 2],
       [1, 2, 3, 1, 1, 1, 3, 2, 1, 0],
       [0, 1, 2, 1, 1, 1, 2, 3, 2, 0],
       [1, 2, 1, 0, 2, 0, 1, 2, 3, 1],
       [2, 1, 0, 2, 2, 2, 0, 0, 1, 3]])