I have some (a lot) binary encoded vectors like:
[0, 1, 0, 0, 1, 0] #But with many more elements each one
and they are all stored into a numpy (2D) array like:
[
[0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 1, 0],
]
I want to get a frequency table of each label set. So, in this example, the frequency table will be:
[2,1]
Because the 1st label set has two appearances and the 2nd label set just one.
In other words, I want to implement itemfreq from Scipy or histogram from numpy, but not for single elements but for lists.
Now I have the following code implemented:
def get_label_set_freq_table(labels):
uniques = np.empty_like(labels)
freq_table = np.zeros(shape=labels.shape[0])
equal = False
for idx,row in enumerate(labels):
for lbl_idx,label_set in enumerate(uniques):
if np.array_equal(row,label_set):
equal = True
freq_table[lbl_idx] += 1
break
if not equal:
uniques[idx] = row
freq_table[idx] += 1
equal = False
return freq_table
being labels the binary encoded vectors.
It works well, but it's extremly low when the number of vectors is big (>58.000) and the number of elements in each one is also big (>8.000)
How can this be done in a more efficient way?