0
votes

I'm studying about Gaussian Mixture Model and came across this code which draws a number of samples from 2 bivariate Gaussian distributions. Which I don't understand is the technique that is used in the code:

import numpy as np

# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
          .7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]

(Original link: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py)

According to this Wikipedia link, we can generate multivariate Gaussian samples by Cholesky-decomposing the covariance matrix, then multiply it with a vector composed of components drawn from standard-normal distribution.

My question is the C variable in the code is not a lower triangle matrix, so how does it make sense in the multivariate Gaussian random generation?

1
Not an answer, but just FYI: NumPy has numpy.random.multivariate_normal for generating samples from the multivariate normal distribution. - Warren Weckesser
@WarrenWeckesser Thank you. Yes, I know about numpy.random.multivariate_normal. What I want here is to understand the theory behind the code. - Anh Tuan

1 Answers

5
votes

X is a mixture of two bivariate normal distributions. Half the samples are computed with np.dot(np.random.randn(n_samples, 2), C), where C = np.array([[0., -0.1], [1.7, .4]]). This distribution is equivalent to a distribution whose covariance is C.T.dot(C). That is, you could generate a sample from the same distribution by using np.random.multivariate_normal([0, 0], C.T.dot(C), n_samples).

See these notes that I wrote some time ago: "Correlated Random Samples". (In those notes, the 3x3 matrix C is multiplied on the right by a sample with shape (3, num_samples). In other words, those notes use a transpose of the formula used here, so there the covariance matrix is C.dot(C.T).) C doesn't have to be lower triangular. But commonly, you are given a covariance matrix, and you want to find C. If you use the Cholesky decomposition to find C, then by construction it will be lower triangular.

This ipython session demonstrates that the two methods generate samples from the same distribution:

In [60]: C = np.array([[0., -0.1], [1.7, .4]])

In [61]: X1 = np.dot(np.random.randn(n_samples, 2), C)

In [62]: X2 = np.random.multivariate_normal([0, 0], C.T.dot(C), n_samples)

In [63]: plot(X1[:,0], X1[:,1], 'g*', alpha=0.2)
Out[63]: [<matplotlib.lines.Line2D at 0x113c17550>]

In [64]: plot(X2[:,0], X2[:,1], 'ko', alpha=0.2, ms=4)
Out[64]: [<matplotlib.lines.Line2D at 0x113c3ba58>]

plot