STAT 238 - Bayesian Statistics Lecture Thirty
Spring 2026, UC Berkeley
The Gibbs Sampler for Gaussian Mixture Models ¶ We observe real-valued data y 1 , … , y n y_1, \dots, y_n y 1 , … , y n and we consider the model:
y i ∼ i.i.d ∑ j = 1 k w j N ( μ j , σ j 2 ) \begin{align*}
y_i \overset{\text{i.i.d}}{\sim} \sum_{j=1}^k w_j N(\mu_j, \sigma_j^2)
\end{align*} y i ∼ i.i.d j = 1 ∑ k w j N ( μ j , σ j 2 ) with unknown parameters ( w 1 , … , w k ) (w_1, \dots, w_k) ( w 1 , … , w k ) and ( μ j , σ j 2 ) (\mu_j, \sigma_j^2) ( μ j , σ j 2 ) for j = 1 , … , k j = 1, \dots, k j = 1 , … , k . ( w 1 , … , w k ) (w_1, \dots, w_k) ( w 1 , … , w k ) is a probablity vector (i.e., w j ≥ 0 w_j \geq 0 w j ≥ 0 and ∑ j w j = 1 \sum_j w_j = 1 ∑ j w j = 1 ).
We use the following priors:
( w 1 , … , w k ) ∼ Dirichlet ( a 1 , … , a k ) μ 1 , … , μ k ∼ i.i.d N ( m , s 2 ) σ 1 2 , … , σ k 2 ∼ i.i.d I G ( α , β ) . \begin{align*}
& (w_1, \dots, w_k) \sim \text{Dirichlet}(a_1, \dots, a_k) \\
& \mu_1, \dots, \mu_k \overset{\text{i.i.d}}{\sim} N(m, s^2) \\
& \sigma_1^2, \dots, \sigma_k^2 \overset{\text{i.i.d}}{\sim}
IG(\alpha, \beta).
\end{align*} ( w 1 , … , w k ) ∼ Dirichlet ( a 1 , … , a k ) μ 1 , … , μ k ∼ i.i.d N ( m , s 2 ) σ 1 2 , … , σ k 2 ∼ i.i.d I G ( α , β ) . The default choices are a j = 1 a_j = 1 a j = 1 for all j j j , m = 0 m = 0 m = 0 and s 2 s^2 s 2 to be large, and α , β \alpha, \beta α , β to be near zero. Note that the density of the Inverse Gamma distribution I G ( α , β ) IG(\alpha, \beta) I G ( α , β ) is given by x − α − 1 exp ( − β / x ) I { x > 0 } x^{-\alpha - 1} \exp(-\beta/x) I\{x > 0\} x − α − 1 exp ( − β / x ) I { x > 0 } . It is easy to check that this is simply the density of X − 1 X^{-1} X − 1 where X ∼ Gamma ( α , β ) X \sim
\text{Gamma}(\alpha, \beta) X ∼ Gamma ( α , β ) .
To use the Gibbs sampler, we introduce latent variables z 1 , … , z n z_1, \dots,
z_n z 1 , … , z n which represent group memberships. Specifically, z i z_i z i represents which of the k k k Gaussians the observation y i y_i y i is coming from. More precisely, z i z_i z i takes the values 1 , … , k 1, \dots, k 1 , … , k with probabilities w 1 , … , w k w_1, \dots, w_k w 1 , … , w k (i.e., z i ∼ Categorical ( w ) z_i \sim \text{Categorical}(w) z i ∼ Categorical ( w ) ), and
y i ∣ z i = j ∼ N ( μ j , σ j 2 ) . \begin{align*}
y_i \mid z_i = j \sim N(\mu_j, \sigma_j^2).
\end{align*} y i ∣ z i = j ∼ N ( μ j , σ j 2 ) . The Gibbs sampler will then jointly sample from all the unknown variables:
w 1 , … , w k , μ 1 , σ 1 2 , … , μ k , σ k 2 , z 1 , … , z n ∣ y 1 , … , y n \begin{align*}
w_1, \dots, w_k, \mu_1, \sigma_1^2, \dots, \mu_k, \sigma_k^2, z_1,
\dots, z_n \mid y_1, \dots, y_n
\end{align*} w 1 , … , w k , μ 1 , σ 1 2 , … , μ k , σ k 2 , z 1 , … , z n ∣ y 1 , … , y n The full conditionals corresponding to all the variables can be written in closed form as described below. We will use the notation w = ( w 1 , … , w k ) w
= (w_1, \dots, w_k) w = ( w 1 , … , w k ) , μ = ( μ 1 , … , μ k ) \mu = (\mu_1, \dots, \mu_k) μ = ( μ 1 , … , μ k ) and σ = ( σ 1 , … , σ k ) \sigma =
(\sigma_1, \dots, \sigma_k) σ = ( σ 1 , … , σ k ) , also y = ( y 1 , … , y n ) y = (y_1, \dots, y_n) y = ( y 1 , … , y n ) and z = ( z 1 , … , z n ) z =
(z_1, \dots, z_n) z = ( z 1 , … , z n ) .
For the full conditional corresponding to z 1 , … , z n z_1, \dots, z_n z 1 , … , z n (i.e., the conditional distribution of z 1 , … , z n z_1, \dots, z_n z 1 , … , z n given y 1 , … , y n y_1, \dots, y_n y 1 , … , y n as well as w 1 , … , w k , μ 1 , σ 1 2 , … , μ k , σ k 2 w_1, \dots, w_k, \mu_1, \sigma_1^2, \dots, \mu_k,
\sigma_k^2 w 1 , … , w k , μ 1 , σ 1 2 , … , μ k , σ k 2 ) is given by:
P { z i = j ∣ y i , w , μ , σ } = r i j : = w j ϕ ( y i , μ j , σ j 2 ) ∑ j = 1 k w j ϕ ( y i , μ j , σ j 2 ) . \begin{align*}
\P \left\{z_i = j \mid y_i, w, \mu, \sigma \right\} = r_{ij} := \frac{w_j
\phi(y_i, \mu_j, \sigma_j^2)}{\sum_{j=1}^k w_j
\phi(y_i, \mu_j, \sigma_j^2)}.
\end{align*} P { z i = j ∣ y i , w , μ , σ } = r ij := ∑ j = 1 k w j ϕ ( y i , μ j , σ j 2 ) w j ϕ ( y i , μ j , σ j 2 ) . In other words,
z i ∣ y i , w , μ , σ ∼ Categorical ( r i ) where r i = ( r i 1 , … , r i k ) . \begin{align*}
z_i \mid y_i, w, \mu, \sigma \sim \text{Categorical}(r_i) ~~
\text{where $r_i = (r_{i1}, \dots, r_{ik})$}.
\end{align*} z i ∣ y i , w , μ , σ ∼ Categorical ( r i ) where r i = ( r i 1 , … , r ik ) . The full conditional of w = ( w 1 , … , w k ) w = (w_1, \dots, w_k) w = ( w 1 , … , w k ) (i.e., the conditional distribution of w w w given z , y , μ , σ z, y, \mu, \sigma z , y , μ , σ ) is:
w ∣ z , y , μ , σ ∼ Dirichlet ( a 1 + n 1 , … , a k + n k ) where n j : = ∑ i = 1 n I { z i = j } . \begin{align*}
w \mid z, y, \mu, \sigma \sim \text{Dirichlet}(a_1 + n_1, \dots, a_k
+ n_k) ~~ \text{where $n_j := \sum_{i=1}^n I\{z_i = j\}$}.
\end{align*} w ∣ z , y , μ , σ ∼ Dirichlet ( a 1 + n 1 , … , a k + n k ) where n j := ∑ i = 1 n I { z i = j } . The full conditional of μ j \mu_j μ j (i.e., the conditional distribution of μ j \mu_j μ j given z , y , σ z, y, \sigma z , y , σ ) is:
μ j ∣ z , y , σ ∼ N ( m / s 2 + ∑ i : z i = j y i / σ j 2 1 / s 2 + n j / σ j 2 , 1 1 / s 2 + n j / σ j 2 ) . \begin{align*}
\mu_j \mid z, y, \sigma \sim N \left(\frac{m/s^2 + \sum_{i : z_i =
j} y_i/\sigma_j^2}{1/s^2 + n_j/\sigma_j^2}, \frac{1}{1/s^2 +
n_j/\sigma_j^2} \right).
\end{align*} μ j ∣ z , y , σ ∼ N ( 1/ s 2 + n j / σ j 2 m / s 2 + ∑ i : z i = j y i / σ j 2 , 1/ s 2 + n j / σ j 2 1 ) . Note that we are also conditioning on σ j \sigma_j σ j above. If we do not condition on σ j \sigma_j σ j , then the conditional distribution of μ j \mu_j μ j will be t t t -distributed and not normal distributed.
The full conditional of σ j 2 \sigma_j^2 σ j 2 (i.e., the conditional distribution of σ j 2 \sigma_j^2 σ j 2 given z , y , μ z, y, \mu z , y , μ ) is:
σ j 2 ∣ z , y , σ ∼ I G ( α + n j 2 , β + 1 2 ∑ i : z i = j ( y i − μ j ) 2 ) , \begin{align*}
\sigma_j^2 \mid z, y, \sigma \sim IG \left(\alpha + \frac{n_j}{2},
\beta + \frac{1}{2} \sum_{i : z_i = j} (y_i - \mu_j)^2 \right),
\end{align*} σ j 2 ∣ z , y , σ ∼ I G ( α + 2 n j , β + 2 1 i : z i = j ∑ ( y i − μ j ) 2 ) , where again n j = ∑ i = 1 n I { z i = j } n_j = \sum_{i=1}^n I\{z_i = j\} n j = ∑ i = 1 n I { z i = j } .
So the Gibbs sampler algorithm is given by:
Initialize the parameters w ( 0 ) , μ ( 0 ) , σ ( 0 ) w^{(0)}, \mu^{(0)}, \sigma^{(0)} w ( 0 ) , μ ( 0 ) , σ ( 0 ) .
Repeat the following for t = 0 , 1 , 2 , … t = 0, 1, 2, \dots t = 0 , 1 , 2 , … :
Generate z 1 ( t ) , … , z n ( t ) z_1^{(t)}, \dots, z_n^{(t)} z 1 ( t ) , … , z n ( t ) via:
z i ( t ) ∼ Categorical ( w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) ∑ j = 1 k w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) , j = 1 , … , k ) \begin{align*}
z_i^{(t)} \sim \text{Categorical} \left(\frac{w_j^{(t)}
\phi \left(y_i, \mu^{(t)}_j, (\sigma_j^{(t)})^2 \right)}{\sum_{j=1}^k w_j^{(t)}
\phi \left(y_i, \mu^{(t)}_j, (\sigma_j^{(t)})^2 \right)}, j = 1, \dots, k \right)
\end{align*} z i ( t ) ∼ Categorical ⎝ ⎛ ∑ j = 1 k w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) , j = 1 , … , k ⎠ ⎞ Calculate
n j ( t ) : = ∑ i = 1 n I { z i ( t ) = j } for j = 1 , … , k . \begin{align*}
n_j^{(t)} := \sum_{i=1}^n I\{z_i^{(t)} = j\} ~~ \text{ for $j =
1, \dots, k$}.
\end{align*} n j ( t ) := i = 1 ∑ n I { z i ( t ) = j } for j = 1 , … , k . Generate w ( t + 1 ) = ( w 1 ( t + 1 ) , … , w k ( t + 1 ) ) w^{(t+1)} = (w_1^{(t+1)}, \dots, w_k^{(t+1)}) w ( t + 1 ) = ( w 1 ( t + 1 ) , … , w k ( t + 1 ) ) via
w ( t + 1 ) ∼ Dirichlet ( a 1 + n 1 ( t ) , … , a k + n k ( t ) ) . \begin{align*}
w^{(t+1)} \sim \text{Dirichlet}(a_1 + n_1^{(t)}, \dots, a_k +
n_k^{(t)}).
\end{align*} w ( t + 1 ) ∼ Dirichlet ( a 1 + n 1 ( t ) , … , a k + n k ( t ) ) . Generate μ 1 ( t + 1 ) , … , μ k ( t + 1 ) \mu_1^{(t+1)}, \dots, \mu_k^{(t+1)} μ 1 ( t + 1 ) , … , μ k ( t + 1 ) via
μ j ( t + 1 ) ∼ N ( m / s 2 + ∑ i : z i ( t ) = j y i / ( σ j ( t ) ) 2 1 / s 2 + n j ( t ) / ( σ j ( t ) ) 2 , 1 1 / s 2 + n j ( t ) / ( σ j ( t ) ) 2 ) . \begin{align*}
\mu_j^{(t+1)} \sim N \left(\frac{m/s^2 + \sum_{i : z^{(t)}_i =
j} y_i/(\sigma_j^{(t)})^2}{1/s^2 + n_j^{(t)}/(\sigma_j^{(t)})^2},
\frac{1}{1/s^2 +
n^{(t)}_j/(\sigma_j^{(t)})^2} \right).
\end{align*} μ j ( t + 1 ) ∼ N ⎝ ⎛ 1/ s 2 + n j ( t ) / ( σ j ( t ) ) 2 m / s 2 + ∑ i : z i ( t ) = j y i / ( σ j ( t ) ) 2 , 1/ s 2 + n j ( t ) / ( σ j ( t ) ) 2 1 ⎠ ⎞ . Generate σ 1 ( t + 1 ) , … , σ k ( t + 1 ) \sigma_1^{(t+1)}, \dots, \sigma_k^{(t+1)} σ 1 ( t + 1 ) , … , σ k ( t + 1 ) via
( σ j 2 ) ( t + 1 ) ∼ I G ( α + n j ( t ) 2 , β + 1 2 ∑ i : z i ( t ) = j ( y i − μ j ( t + 1 ) ) 2 ) , \begin{align*}
(\sigma_j^2)^{(t+1)} \sim IG \left(\alpha + \frac{n^{(t)}_j}{2},
\beta + \frac{1}{2} \sum_{i : z^{(t)}_i = j} (y_i - \mu^{(t+1)}_j)^2 \right),
\end{align*} ( σ j 2 ) ( t + 1 ) ∼ I G ⎝ ⎛ α + 2 n j ( t ) , β + 2 1 i : z i ( t ) = j ∑ ( y i − μ j ( t + 1 ) ) 2 ⎠ ⎞ , The EM Algorithm ¶ The EM algorithm can be seen as a deterministic variant of the Gibbs sampler. We shall look at the EM in more generality later. In the case of fitting finite normal mixtures, the EM algorithm is given by the following:
Initialize the parameters w ( 0 ) , μ ( 0 ) , σ ( 0 ) w^{(0)}, \mu^{(0)}, \sigma^{(0)} w ( 0 ) , μ ( 0 ) , σ ( 0 ) .
Repeat the following for t = 0 , 1 , 2 , … t = 0, 1, 2, \dots t = 0 , 1 , 2 , … until convergence:
E-step. Compute the responsibility of component j j j for observation i i i :
r i j ( t ) : = w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) ∑ l = 1 k w l ( t ) ϕ ( y i , μ l ( t ) , ( σ l ( t ) ) 2 ) , j = 1 , … , k . \begin{align*}
r_{ij}^{(t)} := \frac{w_j^{(t)}
\phi\left(y_i, \mu_j^{(t)}, (\sigma_j^{(t)})^2\right)}
{\sum_{l=1}^k w_l^{(t)}
\phi\left(y_i, \mu_l^{(t)}, (\sigma_l^{(t)})^2\right)},
\quad j = 1, \dots, k.
\end{align*} r ij ( t ) := ∑ l = 1 k w l ( t ) ϕ ( y i , μ l ( t ) , ( σ l ( t ) ) 2 ) w j ( t ) ϕ ( y i , μ j ( t ) , ( σ j ( t ) ) 2 ) , j = 1 , … , k . Compute the effective counts
n j ( t ) : = ∑ i = 1 n r i j ( t ) , j = 1 , … , k . \begin{align*}
n_j^{(t)} := \sum_{i=1}^n r_{ij}^{(t)}, \quad j = 1, \dots, k.
\end{align*} n j ( t ) := i = 1 ∑ n r ij ( t ) , j = 1 , … , k . M-step. Update the weights:
w j ( t + 1 ) : = n j ( t ) n , j = 1 , … , k . \begin{align*}
w_j^{(t+1)} := \frac{n_j^{(t)}}{n}, \quad j = 1, \dots, k.
\end{align*} w j ( t + 1 ) := n n j ( t ) , j = 1 , … , k . Update the means:
μ j ( t + 1 ) : = ∑ i = 1 n r i j ( t ) y i n j ( t ) , j = 1 , … , k . \begin{align*}
\mu_j^{(t+1)} := \frac{\sum_{i=1}^n r_{ij}^{(t)}\, y_i}{n_j^{(t)}},
\quad j = 1, \dots, k.
\end{align*} μ j ( t + 1 ) := n j ( t ) ∑ i = 1 n r ij ( t ) y i , j = 1 , … , k . Update the variances:
( σ j 2 ) ( t + 1 ) : = ∑ i = 1 n r i j ( t ) ( y i − μ j ( t + 1 ) ) 2 n j ( t ) , j = 1 , … , k . \begin{align*}
(\sigma_j^2)^{(t+1)} := \frac{\sum_{i=1}^n r_{ij}^{(t)}
\left(y_i - \mu_j^{(t+1)}\right)^2}{n_j^{(t)}},
\quad j = 1, \dots, k.
\end{align*} ( σ j 2 ) ( t + 1 ) := n j ( t ) ∑ i = 1 n r ij ( t ) ( y i − μ j ( t + 1 ) ) 2 , j = 1 , … , k . There are the following close connections between the EM algorithm and the Gibbs sampler:
The E-step replaces sampling z i ∼ Categorical ( … ) z_i \sim \text{Categorical}(\ldots) z i ∼ Categorical ( … ) with computing its expectation
r i j ( t ) = E [ I { z i = j } ∣ y i , w ( t ) , μ ( t ) , σ ( t ) ] , \begin{align*}
r_{ij}^{(t)} = \mathbb{E}\left[I\{z_i = j\} \mid y_i, w^{(t)}, \mu^{(t)},
\sigma^{(t)}\right],
\end{align*} r ij ( t ) = E [ I { z i = j } ∣ y i , w ( t ) , μ ( t ) , σ ( t ) ] , which are the same normalised weights used in the Categorical draw.
The effective count n j ( t ) = ∑ i = 1 n r i j ( t ) n_j^{(t)} = \sum_{i=1}^n r_{ij}^{(t)} n j ( t ) = ∑ i = 1 n r ij ( t ) replaces the integer count ∑ i = 1 n I { z i ( t ) = j } \sum_{i=1}^n I\{z_i^{(t)} = j\} ∑ i = 1 n I { z i ( t ) = j } .
The weight update replaces the Dirichlet draw and takes the maximum likelihood estimate
w j ( t + 1 ) = n j ( t ) n , \begin{align*}
w_j^{(t+1)} = \frac{n_j^{(t)}}{n},
\end{align*} w j ( t + 1 ) = n n j ( t ) , which is equivalent to the posterior mean of the Dirichlet as the prior becomes uninformative.
The mean update replaces the Gaussian conjugate draw and is the responsibility-weighted sample mean
μ j ( t + 1 ) = ∑ i = 1 n r i j ( t ) y i n j ( t ) , \begin{align*}
\mu_j^{(t+1)} = \frac{\sum_{i=1}^n r_{ij}^{(t)}\, y_i}{n_j^{(t)}},
\end{align*} μ j ( t + 1 ) = n j ( t ) ∑ i = 1 n r ij ( t ) y i , the maximum likelihood estimate with no prior shrinkage toward m m m .
The variance update replaces the inverse-Gamma draw and is the responsibility-weighted sample variance
( σ j 2 ) ( t + 1 ) = ∑ i = 1 n r i j ( t ) ( y i − μ j ( t + 1 ) ) 2 n j ( t ) , \begin{align*}
(\sigma_j^2)^{(t+1)} = \frac{\sum_{i=1}^n r_{ij}^{(t)}
\left(y_i - \mu_j^{(t+1)}\right)^2}{n_j^{(t)}},
\end{align*} ( σ j 2 ) ( t + 1 ) = n j ( t ) ∑ i = 1 n r ij ( t ) ( y i − μ j ( t + 1 ) ) 2 , the maximum likelihood estimate with no α , β \alpha, \beta α , β regularisation terms.