How to decompose a mixed distribution using MCMC

Question

I have data that is a 50:50 mix of a normal distribution and a constant value:

numdata = 10000
data = np.random.normal(0.0,1.0,numdata).astype(np.float32)
data[int(numdata/2):] = 0.0
plt.hist(data,30,density=True)

My task is to fit a mixture density to that data. I'm using a tfd.Mixture with tfd.Normal and tfd.Deterministic The known (in case of sample data) ratio of Normal to Deterministic is 0.5 My MCMC instead returns a ration of 0.83 in favor of Normal.

Is there a better way to fit this distribution with the correct ratio?

Here is a complete sample code:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import tensorflow as tf
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
tfd = tfp.distributions
tfb = tfp.bijectors

import numpy as np
from time import time

numdata = 10000
data = np.random.normal(0.0,1.0,numdata).astype(np.float32)
data[int(numdata/2):] = 0.0
_=plt.hist(data,30,density=True)

root = tfd.JointDistributionCoroutine.Root
def dist_fn(rv_p,rv_mu):
    rv_cat = tfd.Categorical(probs=tf.stack([rv_p, 1.-rv_p],-1))
    rv_norm  = tfd.Normal(rv_mu,1.0)
    rv_zero =  tfd.Deterministic(tf.zeros_like(rv_mu))
    
    rv_mix = tfd.Independent(
                tfd.Mixture(cat=rv_cat,
                            components=[rv_norm,rv_zero]),
                reinterpreted_batch_ndims=1)
    return rv_mix


def model_fn():
    rv_p    = yield root(tfd.Sample(tfd.Uniform(0.0,1.0),1))
    rv_mu   = yield root(tfd.Sample(tfd.Uniform(-1.,1. ),1))
    
    rv_mix  = yield dist_fn(rv_p,rv_mu)
    
jd = tfd.JointDistributionCoroutine(model_fn)
unnormalized_posterior_log_prob = lambda *args: jd.log_prob(args + (data,))

n_chains = 1

p_init = [0.3]
p_init = tf.cast(p_init,dtype=tf.float32)

mu_init = 0.1
mu_init = tf.stack([mu_init]*n_chains,axis=0)

initial_chain_state = [
    p_init,
    mu_init,
]

bijectors = [
    tfb.Sigmoid(),  # p
    tfb.Identity(),  # mu
]

step_size = 0.01

num_results = 50000
num_burnin_steps = 50000


kernel=tfp.mcmc.TransformedTransitionKernel(
    inner_kernel=tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=unnormalized_posterior_log_prob,
    num_leapfrog_steps=2,
    step_size=step_size,
    state_gradients_are_stopped=True),
    bijector=bijectors)

kernel = tfp.mcmc.SimpleStepSizeAdaptation(
    inner_kernel=kernel, num_adaptation_steps=int(num_burnin_steps * 0.8))

#XLA optim
@tf.function(autograph=False, experimental_compile=True)
def graph_sample_chain(*args, **kwargs):
  return tfp.mcmc.sample_chain(*args, **kwargs)


st = time()
trace,stats = graph_sample_chain(
      num_results=num_results,
      num_burnin_steps=num_burnin_steps,
      current_state=initial_chain_state,
      kernel=kernel)
et = time()
print(et-st)


ptrace, mutrace = trace
plt.subplot(121)
_=plt.hist(ptrace.numpy(),100,density=True)
plt.subplot(122)
_=plt.hist(mutrace.numpy(),100,density=True)
print(np.mean(ptrace),np.mean(mutrace))

The resulting distributions of p and mu are like this:

Obviously, it should have a mean of p at 0.5 I suspect there might be something wrong with model_fn(). I tried to evaluate log_prob of the model at different p values and indeed the "optimal" is around 0.83 I just don't understand why and how to fix it in order to reconstruct the original mixture.

[EDIT] A "simpler" demo code with pymc3. Still the same behavior, result is 0.83 instead of 0.5

import pymc3 as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt


numdata = 1000
data1 = np.random.normal(0.0,1.0,numdata).astype(np.float32)
data2 = np.zeros(numdata).astype(np.float32)
data = np.concatenate((data1,data2))


_=plt.hist(data,30,density=True)

with pm.Model() as model:
    norm = pm.Normal.dist(0.0,1.0)
    zero = pm.Constant.dist(0.0)
    
    components = [norm,zero]
    w = pm.Dirichlet('p', a=np.array([1, 1]))  # two mixture component weights.
    like = pm.Mixture('data', w=w, comp_dists=components, observed=data)
    
    posterior = pm.sample()
    
    idata = az.from_pymc3(posterior)
    az.plot_posterior(posterior)

merv merv · Accepted Answer · 2021-03-06T06:10:41

Incommensurability of Probability Density and Mass

The issue here is that the likelihood of coming from each model involves probability density for the Gaussian and mass for the discrete, which are not commensurate. Specifically, the computation for comparing where a zero observation came from, will involve likelihoods

P[x=0|Normal[0,1]] = 1/sqrt(2*pi) = 0.3989422804014327
P[x=0|   Zero    ] = 1

which will compare these (weighted by p) as though they have the same unit. However, the former is a density and therefore infinitesimal relative to the latter. If one ignores this incommensurability, one is effectively acting as though the Gaussian has a 40% chance of generating zeros, whereas in reality it almost never generates exactly a zero.

Workaround: Pseudo-Discrete Distribution

We need to convert the units somehow. A simple way of doing this is to approximate the discrete distribution with a continuous one, so that the likelihoods it generates are in density units. For example, using a high-precision (narrow) Gaussian or Laplace distribution centered at the discrete value yields a posterior on p centered around 0.5:

with pm.Model() as model:
    norm = pm.Normal.dist(0.0, 1.0)
    pseudo_zero = pm.Laplace.dist(0.0, 1e-16)
    
    components = [norm, pseudo_zero]
    w = pm.Dirichlet('p', a=np.array([1, 1]))  # two mixture component weights.
    like = pm.Mixture('data', w=w, comp_dists=components, observed=data)
    
    posterior = pm.sample()
    
    idata = az.from_pymc3(posterior)
    az.plot_posterior(posterior)

enter image description here

Why `p=0.83`?

The posterior we observe when mixing the discrete and continuous is not arbitrary. Here are a couple ways of getting it. For the following, we'll just use one p to denote the probability of coming from the Gaussian.

MAP Estimate

Ignoring the incommensurability, we can derive the MAP estimate for p as follows. Let's denote the combined observations by D = { D_1 | D_2 }, where D_1 is the subset from the Gaussian, etc., and n is the number of observations from each subset. Then we can write out the likelihood

P[p|D] ~ P[D|p]P[p]

Since the Dirichlet is uniform, we can ignore P[p] and expand our data

P[D|p] = P[D_1|p]P[D_2|p]
       = (Normal[D_1|0,1]*(p^n))(Normal[0|0,1]*p + 1*(1-p))^n
       = Normal[D_1|0,1]*(p^n)(0.3989*p + 1 - p)^n
       = Normal[D_1|0,1]*(p - 0.6011*(p^2))^n

Taking the derivative w.r.t. p and setting equal to zero we have

0 = n*(1-1.2021*p)(p-0.6011*p^2)^(n-1)

which takes on a (nontrivial) zero at p = 1/1.2021 = 0.8318669.

Sampling Thought Experiment

Another way of approaching this would be through a sampling experiment. Suppose we used the following scheme to sample p.

Start with a given p.
For each observation, draw a Bernoulli sample using the likelihood of the two models, weighted by the previous p value.
Compute a new p as the average of all those Bernoulli draws.
Goto step 1.

In essence, a Gibbs sampler for p, but robust to impossible observation-model assignments.

For the first iteration, let's start with p=0.5. For all the observations truly from the Gaussian, they will have zero likelihood for the discrete model, so, at a minimum, half of all our Bernoulli draws will be 1 (for the Gaussian). For all the observations coming from the discrete model, this will be a comparison of the likelihood of observing a zero in each model. This is 1 for the discrete model and 0.3989422804014327 for the Gaussian. Normalizing this, means we have Bernoulli draws with a probability of

p*0.3989/((1-p)*1 + p*0.3989)
# 0.2851742248343187

in favor of the Gaussian. Now we can update p, and here we'll just work with the expected values, namely:

p = 0.5*1 + 0.5*0.2851742248343187
# 0.6425871124171594

What happens if we start iterate this?

# likelihood for zero from normal
lnorm = np.exp(pm.Normal.dist(0,1).logp(0).eval())

# history
p_n = np.zeros(101)

# initial value
p_n[0] = 0.5

for i in range(100):
    # update
    p_n[1+i] = 0.5 + 0.5*p_n[i]*lnorm/((1-p_n[i])+p_n[i]*lnorm)

plt.plot(p_n);
p_n[100]
# 0.8318668635076404

enter image description here

Again, the expected values converge to our posterior mean of p = 0.83.

Hence, setting the fact aside that PDFs and PMFs have different units for their codomains, it appears both TFP and PyMC3 are behaving correctly.