2
votes

I have some (a lot) binary encoded vectors like:

[0, 1, 0, 0, 1, 0] #But with many more elements each one

and they are all stored into a numpy (2D) array like:

[
 [0, 1, 0, 0, 1, 0],
 [0, 0, 1, 0, 0, 1],
 [0, 1, 0, 0, 1, 0],
]

I want to get a frequency table of each label set. So, in this example, the frequency table will be:

[2,1] 

Because the 1st label set has two appearances and the 2nd label set just one.

In other words, I want to implement itemfreq from Scipy or histogram from numpy, but not for single elements but for lists.

Now I have the following code implemented:

def get_label_set_freq_table(labels):
    uniques = np.empty_like(labels)
    freq_table = np.zeros(shape=labels.shape[0])
    equal = False

    for idx,row in enumerate(labels):
        for lbl_idx,label_set in enumerate(uniques):
            if np.array_equal(row,label_set):
                equal = True
                freq_table[lbl_idx] += 1
                break
        if not equal:
            uniques[idx] = row
            freq_table[idx] += 1
        equal = False

    return freq_table

being labels the binary encoded vectors.

It works well, but it's extremly low when the number of vectors is big (>58.000) and the number of elements in each one is also big (>8.000)

How can this be done in a more efficient way?

1
That doesn't look one-hot to me.Divakar
You are right, I'll edit the question to "binary" vectors. Thanks. Also @Divakar is right with the same appreciation.Alber8295

1 Answers

2
votes

I am assuming you meant an array with 1s and 0s only. For those, we can reduce each row to a scalar with binary scaling and then use np.unique -

In [52]: a
Out[52]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [53]: s = 2**np.arange(a.shape[1])

In [54]: a1D = a.dot(s)

In [55]: _, start, count = np.unique(a1D, return_index=1, return_counts=1)

In [56]: a[start]
Out[56]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1]])

In [57]: count
Out[57]: array([2, 1])

Here's a generalized one -

In [33]: unq_rows, freq = np.unique(a, axis=0, return_counts=1)

In [34]: unq_rows
Out[34]: 
array([[0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [35]: freq
Out[35]: array([1, 2])