4
votes

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.

What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).

enter image description here

In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.

For a new point, I attempted to compute the probability of its label in a couple ways:

  1. GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.

  2. GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:

2.904136, -60.881554, -20.824841, -30.658509

This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

1

1 Answers

1
votes

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.

Let V be your list of scores and tau be the temperature. Then,

p = np.exp(V/tau) / np.sum(np.exp(V/tau))

is your answer.

PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.