Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twenty Nine

Spring 2026, UC Berkeley

Gibbs Sampler for Mixture Models

Known variances and proportion

We observe real-valued data y1,,yny_1, \dots, y_n and consider the model:

yii.i.d(1w)N(μ0,1)+wN(μ1,1)\begin{align} y_i \overset{\text{i.i.d}}{\sim} (1 - w) N(\mu_0, 1) + w N(\mu_1, 1) \end{align}

with ww known and fixed (e.g., w=0.3w = 0.3 or 0.7 as in the simulations) and μ0,μ1\mu_0, \mu_1 being unknown. We will describe the Gibbs sampler algorithm in this setting first, and then generalize it to the case where ww is unknown and we have variance parameters σ02\sigma_0^2 and σ12\sigma_1^2.

The log-likelihood corresponding to (1) is:

i=1nlog(pϕ(yi,μ1,1)+(1p)ϕ(yi,μ2,1)),\begin{align*} \sum_{i=1}^n \log \left(p \phi(y_i, \mu_1, 1) + (1 - p) \phi(y_i, \mu_2, 1) \right), \end{align*}

where ϕ(x,μ,σ2)\phi(x, \mu, \sigma^2) is the normal density with mean μ\mu, variance σ2\sigma^2 evaluated at xx.

We will take a standard prior for μ0,μ1\mu_0, \mu_1 e.g., μ0,μ1i.i.dN(0,C)\mu_0, \mu_1 \overset{\text{i.i.d}}{\sim} N(0, C) for a large CC. The posterior of μ0,μ1\mu_0, \mu_1 is then:

π(μ0,μ1data)ϕ(μ0,0,C)ϕ(μ1,0,C)i=1n((1w)ϕ(yi,μ0,1)+wϕ(yi,μ1,1)).\begin{align*} \pi(\mu_0, \mu_1 \mid \text{data}) \propto \phi(\mu_0, 0, C) \phi(\mu_1, 0, C) \prod_{i=1}^n \left((1 - w) \phi(y_i, \mu_0, 1) + w \phi(y_i, \mu_1, 1) \right). \end{align*}

This posterior cannot be evaluated in closed form and numerical methods need to be used. A standard approach is to use the Gibbs sampler with augmentation. First observe that the model (1) can be rewritten in the following way:

zii.i.dBernoulli(w)   and   yizi=1N(μ1,1)   and   yizi=0N(μ0,1).\begin{align*} z_i \overset{\text{i.i.d}}{\sim} \text{Bernoulli}(w) ~~ \text{ and } ~~ y_i \mid z_i = 1 \sim N(\mu_1, 1) ~~ \text{ and } ~~ y_i \mid z_i = 0 \sim N(\mu_0, 1). \end{align*}

It should be clear that, under the above model, the marginal distribution of yiy_i coincides with (1). z1,,znz_1, \dots, z_n can be thought of as unobserved latent variables which represent which of the two populations (corresponding to the distributions N(μ0,1)N(\mu_0, 1) and N(μ1,1)N(\mu_1, 1) respectively) the observation yiy_i comes from.

Gibbs sampler is implemented for jointly sampling from the posterior of μ0,μ1,z1,,zn\mu_0, \mu_1, z_1, \dots, z_n given the data. This requires being able to sample from the full conditionals

zμ0,μ1,y   and   (μ0,μ1)z,y.\begin{align*} z \mid \mu_0, \mu_1, y ~~ \text{ and } ~~ (\mu_0, \mu_1) \mid z, y. \end{align*}

where y=(y1,,yn)y = (y_1, \dots, y_n) is the data and z=(z1,,zn)z = (z_1, \dots, z_n). It is easy to see that these full conditionals can be written in closed form as follows. Given μ0,μ1,y1,,yn\mu_0, \mu_1, y_1, \dots, y_n, the variables z1,,znz_1, \dots, z_n are independent with

ziμ0,μ1,yBernoulli(wϕ(yi,μ1,1)wϕ(yi,μ1,1)+(1w)ϕ(yi,μ0,1)).\begin{align*} z_i \mid \mu_0, \mu_1, y \sim \text{Bernoulli} \left(\frac{w \phi(y_i, \mu_1, 1)}{w \phi(y_i, \mu_1, 1) + (1 - w) \phi(y_i, \mu_0, 1)}\right). \end{align*}

On the other hand, given z1,,zn,y1,,ynz_1, \dots, z_n, y_1, \dots, y_n, the variables μ0,μ1\mu_0, \mu_1 are independent with

μ1z,yN(i=1nyizin1+(1/C),1n1+(1/C)) and μ0z,yN(i=1nyi(1zi)n0+(1/C),1n0+(1/C))\begin{align*} \mu_1 \mid z, y \sim N \left( \frac{\sum_{i=1}^n y_i z_i}{n_1 + (1/C)}, \frac{1}{n_1 + (1/C)} \right) \text{ and } \mu_0 \mid z, y \sim N \left( \frac{\sum_{i=1}^n y_i (1-z_i)}{n_0 + (1/C)}, \frac{1}{n_0 + (1/C)} \right) \end{align*}

where n0n_0 is the number of ziz_i’s that are equal to 0, and n1n_1 is the number of ziz_i’s that are equal to 1. Note that in the limit as CC \rightarrow \infty, we can write

μ1z,yN(i=1nyizii=1nzi,1i=1nzi)   and   μ0z,yN(i=1nyi(1zi)i=1n(1zi),1i=1n(1zi))\begin{align*} \mu_1 \mid z, y \sim N \left(\frac{\sum_{i=1}^n y_i z_i}{\sum_{i=1}^n z_i}, \frac{1}{\sum_{i=1}^n z_i} \right) ~~ \text{ and } ~~ \mu_0 \mid z, y \sim N \left(\frac{\sum_{i=1}^n y_i (1 - z_i)}{\sum_{i=1}^n (1 -z_i)}, \frac{1}{\sum_{i=1}^n (1 - z_i)} \right) \end{align*}

Based on these full conditional distributions, the Gibbs sampler algorithm takes the following form:

  1. Initialize μ0(0),μ1(0)\mu_0^{(0)}, \mu_1^{(0)}.

  2. Repeat the following for t=0,1,2,t = 0, 1, 2, \dots:

    1. Generate z1(t),,zn(t)z_1^{(t)}, \dots, z_n^{(t)} via:

      zi(t)Bernoulli(wϕ(yi,μ1(t),1)wϕ(yi,μ1(t),1)+(1w)ϕ(yi,μ0(t),1)).\begin{align*} z_i^{(t)} \sim \text{Bernoulli} \left(\frac{w \phi(y_i, \mu^{(t)}_1, 1)}{w \phi(y_i, \mu^{(t)}_1, 1) + (1 - w) \phi(y_i, \mu^{(t)}_0, 1)}\right). \end{align*}
    2. Generate μ0(t+1),μ1(t+1)\mu_0^{(t+1)}, \mu_1^{(t+1)} via:

      μ1(t+1)N(i=1nyizi(t)i=1nzi(t),1i=1nzi(t))μ0(t+1)N(i=1nyi(1zi(t))i=1n(1zi(t)),1i=1n(1zi(t))).\begin{align*} & \mu^{(t+1)}_1 \sim N \left(\frac{\sum_{i=1}^n y_i z^{(t)}_i}{\sum_{i=1}^n z^{(t)}_i}, \frac{1}{\sum_{i=1}^n z^{(t)}_i} \right) \\ & \mu^{(t+1)}_0 \sim N \left(\frac{\sum_{i=1}^n y_i (1 - z^{(t)}_i)}{\sum_{i=1}^n (1 -z^{(t)}_i)}, \frac{1}{\sum_{i=1}^n (1 - z^{(t)}_i)} \right). \end{align*}

As we saw in the simulations in the last couple of lectures, initialization is very important. The log-likelihood can have multiple modes only one of which is the correct mode (in the sense of having large likelihood). If the Gibbs sampler is not properly initialized, the algorithm would sample from a spurious peak.

Unknown variances and proportion

Now consider the model:

yii.i.d(1w)N(μ0,σ02)+wN(μ1,σ12).\begin{align*} y_i \overset{\text{i.i.d}}{\sim} (1 - w) N(\mu_0, \sigma_0^2) + w N(\mu_1, \sigma_1^2). \end{align*}

The unknown parameters are now w,μ0,μ1,σ02,σ12w, \mu_0, \mu_1, \sigma^2_0, \sigma_1^2. We place the following independent priors on these variables:

wBeta(a,b)   and   μ0,μ1N(m,s2)   and   σ02,σ12IG(α,β).\begin{align} w \sim \text{Beta}(a, b)~~ \text{ and } ~~ \mu_0, \mu_1 \sim N(m, s^2) ~~ \text{ and } ~~ \sigma_0^2, \sigma_1^2 \sim IG(\alpha, \beta). \end{align}

The proportion parameter ww is constrained to lie in [0,1][0, 1] so it is natural to take the Beta prior for it. A standard choice here is a=b=1a = b = 1 (which corresponds to the uniform prior). For μ0,μ1\mu_0, \mu_1, we are using the N(m,s2)N(m, s^2) prior. Standard choice is m=0m = 0 and ss very large. The Inverse Gamma prior IG(α,β)IG(\alpha, \beta) corresponds to the density:

xα1exp(β/x)I{x>0}.\begin{align*} \propto x^{-\alpha - 1} \exp(-\beta/x) I\{x > 0\}. \end{align*}

The standard uninformative prior for σ\sigma is σ1I{σ>0}\propto \sigma^{-1} I\{\sigma > 0\}. The corresponding prior density for σ2\sigma^2 is also x1I{x>0}\propto x^{-1} I\{x > 0\}. This is a special case of IG(α,β)IG(\alpha, \beta) corresponding to α=β=0\alpha = \beta = 0.

The prior (12) is therefore uses conjugate families while including the standard uninformative choices as special cases. The reason for conjugacy is that the full conditional distributions corresponding to the posterior distribution can be written in closed form. We shall look at the formulae in the next lecture.