What is the most pythonic equivalent for matlab's dummyvar function in order to deal with category variables nicely?
Here is an example illustrating my problem, with a NxM matrix which denotes M different ways of partitioning N data points into <=N categories.
>> partitions
array([[1, 1, 2, 2, 1, 2, 2, 2, 1, 1],
[1, 2, 2, 1, 2, 1, 2, 2, 2, 1],
[1, 1, 1, 2, 2, 2, 1, 3, 3, 2]])
The task is to efficiently count the number of times that any two data points are classified into the same category and store the result in a NxN matrix. In matlab this could be accomplished as a one-liner with dummyvar which creates a column variable for each category for each partition.
>> dummyvar(partitions)*dummyvar(partitions)'
ans =
3 2 1 1 1 1 1 0 1 2
2 3 2 0 2 0 2 1 2 1
1 2 3 1 1 1 3 2 1 0
1 0 1 3 1 3 1 1 0 2
1 2 1 1 3 1 1 1 2 2
1 0 1 3 1 3 1 1 0 2
1 2 3 1 1 1 3 2 1 0
0 1 2 1 1 1 2 3 2 0
1 2 1 0 2 0 1 2 3 1
2 1 0 2 2 2 0 0 1 3
The most efficient way that I can think of to solve this task is writing an O(n*m) loop that emulates dummyvar's behavior. (Note that the code below prefers partition.shape[0]
<< partition.shape[1]
, which is likely to be true in general but is unsafe to assume).
dv=np.zeros((0,10))
for row in partitions:
for val in xrange(1,np.max(row)+1):
dv=np.vstack((dv,row==val))
np.dot(dv.T,dv)
And of course because vstack in a loop is very inefficient this can be improved by finding the desired size and creating the array to start out with, but I am really looking for a one liner to do it just as in matlab.
Edit: Some more information about what I am doing just for added context. I am writing library functions in python (where no python implementation exists) for a library for analyzing brain networks. Existing working matlab source is avaiable. Due to domain-specific constraints the roughly maximal size of the input is networks of a few thousand nodes. However, basically all of the functions I write have to scale well to large inputs.