0
votes

To simplify my question, I create a dummy problem here: I have two sets of training data that are labelled with 1 and 2 respectively. Both training datasets assumed to follow mixture of Gaussian distribution. I can easily use Matlab toolbox function (gmdistribution.fit) to estimate their mean and covariance.

Then I have some testing dataset that assumed to be created with an MoG similar to training dataset 2, but with noise. I would like to calculate something like a likelihood probability that my testing dataset is more likely to be generated using the MoG of training dataset 2. In other words, I would like to get the likelihood of my testing dataset to have the label 2.

Could you please point a direction how to do this? Thanks very much.

N.B.:

  1. The sizes of my two training datasets are different.
  2. The distributions of the two training datasets are overlapped.
  3. The size of the testing dataset is much smaller than the training datasets.

Some Matlab codes:

%% Mixture of Gassian 1 (Training set 1)
mean1                                   = [1 -2];
cov1                                    = [2 0; 0 .5];
mean2                                   = [0.5 -5];
cov2                                    = [1 0; 0 1];
trainingDataset1                        = [mvnrnd(mean1, cov1, 1000); mvnrnd(mean2, cov2, 1000)];

MoGOptions                              = statset('Display', 'final');
MoGObj1                                 = gmdistribution.fit(trainingDataset1, 2, 'Options', MoGOptions);

figure,
scatter(trainingDataset1(:,1), trainingDataset1(:,2), 10, '.')
hold on
ezcontour(@(x,y)pdf(MoGObj1,[x y]), [-8 6], [-8 2]);

%% Mixture of Gassian 2 (Training set 2)
mean4                                   = [0.5 -1];
cov4                                    = [1.5 0; 0 .8];
mean5                                   = [-2 -3];
cov5                                    = [1 0; 0 1];
mean6                                   = [-4 -2];
cov6                                    = [1 0; 0 1];
trainingDataset2                        = [mvnrnd(mean4, cov4, 500); mvnrnd(mean5, cov5, 500); mvnrnd(mean6, cov6, 500)];

MoGOptions                              = statset('Display', 'final');
MoGObj2                                 = gmdistribution.fit(trainingDataset2, 2, 'Options', MoGOptions);

figure,
scatter(trainingDataset2(:,1), trainingDataset2(:,2), 10, '.')
hold on
ezcontour(@(x,y)pdf(MoGObj2,[x y]), [-8 6], [-8 2]);

%% Test set
mean7                                   = [1.1 -2.1];
cov7                                    = [2.2 0; 0 .4];
mean8                                   = [0.3 -5.4];
cov8                                    = [1.2 0; 0 1.1];
testingDataset1                         = [mvnrnd(mean7, cov7, 100); mvnrnd(mean8, cov8, 100)];

figure,
scatter(testingDataset1(:,1), testingDataset1(:,2), 10, '.')
1

1 Answers

0
votes

I find AIC and BIC to be very convenient.

try "struct(MoGObj2)" to get an ideal of the fields in your fit.

One of them is NLogL which is negative of log-likelihood. I think this is what you are looking for.

http://www.mathworks.com/help/stats/gmdistributionclass.html

Best of luck