Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

Probit Regression

Consider the frogs dataset that we previously used to illustrate Logistic Regression. Here we apply fit the probit regression model to this data.

frogs = pd.read_csv("frogs.csv")

# Design matrix
X = frogs[['altitude','distance','NoOfPools','NoOfSites','avrain','meanmin','meanmax']].copy()
X['distance'] = np.log(X['distance'])
X['NoOfPools'] = np.log(X['NoOfPools'])
X = sm.add_constant(X)
y = frogs['pres.abs'].values

The GLM function in statsmodels can be used to fit the probit model.

# GLM (probit + logit)
print(sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Probit())).fit().summary())
print(sm.GLM(y, X, family=sm.families.Binomial()).fit().summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  212
Model:                            GLM   Df Residuals:                      204
Model Family:                Binomial   Df Model:                            7
Link Function:                 Probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -99.536
Date:                Mon, 06 Apr 2026   Deviance:                       199.07
Time:                        18:01:24   Pearson chi2:                     229.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.3173
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         40.9617     73.193      0.560      0.576    -102.493     184.417
altitude      -0.0090      0.021     -0.426      0.670      -0.050       0.032
distance      -0.4170      0.144     -2.896      0.004      -0.699      -0.135
NoOfPools      0.3396      0.122      2.794      0.005       0.101       0.578
NoOfSites      0.0044      0.062      0.071      0.944      -0.117       0.126
avrain        -0.0102      0.034     -0.299      0.765      -0.077       0.057
meanmin        2.9654      0.863      3.437      0.001       1.274       4.656
meanmax       -2.4367      2.700     -0.902      0.367      -7.729       2.856
==============================================================================
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  212
Model:                            GLM   Df Residuals:                      204
Model Family:                Binomial   Df Model:                            7
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -98.812
Date:                Mon, 06 Apr 2026   Deviance:                       197.62
Time:                        18:01:24   Pearson chi2:                     234.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.3219
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         40.8988    132.663      0.308      0.758    -219.116     300.914
altitude      -0.0066      0.039     -0.172      0.863      -0.082       0.069
distance      -0.7593      0.255     -2.973      0.003      -1.260      -0.259
NoOfPools      0.5727      0.216      2.649      0.008       0.149       0.997
NoOfSites     -0.0009      0.107     -0.008      0.993      -0.211       0.210
avrain        -0.0068      0.060     -0.113      0.910      -0.124       0.111
meanmin        5.3048      1.543      3.439      0.001       2.281       8.328
meanmax       -3.1730      4.840     -0.656      0.512     -12.659       6.313
==============================================================================

Below is the code for the Gibbs sampler. We are using augmentation and sampling from the joint posterior of w,βw, \beta given the data. Here wii.i.dN(xiTβ,1)w_i \overset{\text{i.i.d}}{\sim} N(x_i^T \beta, 1) and yi=I{wi>0}y_i = I\{w_i > 0\}. The conditional distribution of β\beta given ww and the data is

βw,dataN((XTX)1XTw,(XTX)1).\begin{align*} \beta \mid w, \text{data} \sim N((X^T X)^{-1} X^T w, (X^T X)^{-1}). \end{align*}

The conditional distribution of ww given β\beta and the data is:

wiindependentN(xiTβ,1) truncated to I{wi>0}=yi.\begin{align*} w_i \overset{\text{independent}}{\sim} N(x_i^T \beta, 1) \text{ truncated to } I\{w_i > 0\} = y_i. \end{align*}

In other words, if yi=1y_i = 1, then wiw_i is distributed as N(xiTβ,1)N(x_i^T \beta, 1) conditioned to be positive. And if yi=0y_i = 0, then wiw_i is distributed as N(xiTβ,1)N(x_i^T \beta, 1) conditioned to be negative. Simulating from the truncated normal distribution can be done as described in Truncated normal distribution#Generating values from the truncated normal distribution.

from scipy.stats import norm
Xm = X.values
linmod_cov = np.linalg.inv(Xm.T @ Xm)
n, p = Xm.shape
XtX_inv = linmod_cov
Xt = Xm.T
A = XtX_inv @ Xt   # (X^T X)^{-1} X^T

beta_init = A @ y
w_init = Xm @ beta_init
state = np.concatenate([beta_init, w_init])

nit = 100000
path = np.zeros((nit, p+n))
path[0] = state

for t in range(1, nit):
    if np.random.rand() < 0.5:
        #this part updates beta and leaves w fixed
        w = state[p:]
        beta = A @ w  
        path[t,:p] = np.random.multivariate_normal(beta, XtX_inv)
        path[t,p:] = w
    else:
        #this part updates w and leaves beta fixed
        beta = state[:p]
        path[t,:p] = beta

        mu = Xm @ beta   # vectorized
        u = np.random.rand(n)

        mask1 = (y == 1)
        mask0 = ~mask1

        path[t,p:][mask1] = mu[mask1] + norm.ppf(
            u[mask1] + (1-u[mask1])*norm.cdf(-mu[mask1])
        )
        path[t,p:][mask0] = mu[mask0] + norm.ppf(
            u[mask0]*norm.cdf(-mu[mask0])
        )

    state = path[t]
    if t % 2000 == 0: print(t)
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
64000
66000
68000
70000
72000
74000
76000
78000
80000
82000
84000
86000
88000
90000
92000
94000
96000
98000

Below we compute the posterior means and standard deviations.

# Posterior summaries
pth = pd.DataFrame(path[:, :p], columns=[f'b{i}' for i in range(p)])
bayes = pd.DataFrame({
    'est': pth.mean(),
    'std.err': pth.std()
})

print(bayes)
          est    std.err
b0  37.712413  73.834205
b1  -0.007667   0.021202
b2  -0.433096   0.143792
b3   0.351701   0.122949
b4   0.002300   0.062390
b5  -0.011269   0.035240
b6   3.142313   0.877072
b7  -2.375448   2.727773

Below we compare the posterior means and standard deviations, with the frequentist estimates and standard errors. Remember that the frequentist estimates and standard errors coincide with the Bayesian solution corresponding to the Laplace posterior normal approximation.

# GLM (probit)
glm_res = sm.GLM(
    y, Xm,
    family=sm.families.Binomial(link=sm.families.links.Probit())
).fit()

# Combine results
comp = pd.DataFrame({
    'Bayes_est': bayes['est'].values,
    'GLM_est': glm_res.params,
    'Bayes_se': bayes['std.err'].values,
    'GLM_se': glm_res.bse
}, index=[f'b{i}' for i in range(len(glm_res.params))])

print(comp)
    Bayes_est    GLM_est   Bayes_se     GLM_se
b0  37.712413  40.961673  73.834205  73.192736
b1  -0.007667  -0.009011   0.021202   0.021139
b2  -0.433096  -0.417042   0.143792   0.144002
b3   0.351701   0.339612   0.122949   0.121531
b4   0.002300   0.004392   0.062390   0.062026
b5  -0.011269  -0.010206   0.035240   0.034128
b6   3.142313   2.965350   0.877072   0.862714
b7  -2.375448  -2.436715   2.727773   2.700422

It is clear that the Gibbs sampler gives very similar results to those obtained by the Laplace posterior normal approximation.

Simulation Setting

The following simulation setting uses a very small sample size n=15n = 15. Here there will be more discrepancy between the Gibbs sampler and the Laplace posterior normal approximation.

n = 15
beta_0 = -4.0
beta_1 = 0.8

# for reproducibility
np.random.seed(42)

# generate x
x = np.random.normal(loc=5, scale=0.75, size=n)

# compute probit probabilities
probs = norm.cdf(beta_0 + beta_1 * x)

# generate y
y = np.random.binomial(1, probs, size=n)

print("x =", x)
print("p =", probs)
print("y =", y)
x = [5.37253561 4.89630177 5.4857664  6.14227239 4.82438497 4.82439728
 6.18440961 5.57557605 4.64789421 5.40692003 4.65243673 4.65070268
 5.1814717  3.56503982 3.70631163]
p = [0.61715929 0.46694224 0.65121882 0.81959373 0.44413563 0.44413952
 0.82831502 0.67740726 0.38909267 0.62761229 0.39048675 0.38995441
 0.55771458 0.12549086 0.15034597]
y = [1 0 1 1 0 1 1 1 0 1 0 0 1 1 1]

Below we compute the Laplace posterior normal approximation (i.e., the frequentist estimates and standard errors).

# design matrix with intercept
X = sm.add_constant(x)

# fit probit regression
model = sm.GLM(
    y,
    X,
    family=sm.families.Binomial(link=sm.families.links.Probit())
)

result = model.fit()

print(result.summary())
print("\nEstimated coefficients:")
print("beta_0_hat =", result.params[0])
print("beta_1_hat =", result.params[1])
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                   15
Model:                            GLM   Df Residuals:                       13
Model Family:                Binomial   Df Model:                            1
Link Function:                 Probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -8.9535
Date:                Mon, 06 Apr 2026   Deviance:                       17.907
Time:                        18:05:03   Pearson chi2:                     14.1
No. Iterations:                     5   Pseudo R-squ. (CS):            0.07617
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.2020      2.453     -0.898      0.369      -7.010       2.606
x1             0.5351      0.493      1.086      0.278      -0.431       1.501
==============================================================================

Estimated coefficients:
beta_0_hat = -2.201962764161696
beta_1_hat = 0.5351076507609054

Next we implement the Gibbs sampler.

n, p = X.shape
linmod_cov = np.linalg.inv(X.T @ X)
print(linmod_cov)
[[ 3.28868367 -0.64340466]
 [-0.64340466  0.12848149]]
Xm = X
XtX_inv = linmod_cov
Xt = Xm.T
A = XtX_inv @ Xt   # (X^T X)^{-1} X^T

beta_init = A @ y
w_init = Xm @ beta_init
state = np.concatenate([beta_init, w_init])

nit = 500000
path = np.zeros((nit, p+n))
path[0] = state

for t in range(1, nit):
    if np.random.rand() < 0.5:
        w = state[p:]
        beta = A @ w  
        path[t,:p] = np.random.multivariate_normal(beta, XtX_inv)
        path[t,p:] = w
    else:
        beta = state[:p]
        path[t,:p] = beta

        mu = Xm @ beta   # vectorized
        u = np.random.rand(n)

        mask1 = (y == 1)
        mask0 = ~mask1

        path[t,p:][mask1] = mu[mask1] + norm.ppf(
            u[mask1] + (1-u[mask1])*norm.cdf(-mu[mask1])
        )
        path[t,p:][mask0] = mu[mask0] + norm.ppf(
            u[mask0]*norm.cdf(-mu[mask0])
        )

    state = path[t]
    if t % 2000 == 0: print(t)
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
64000
66000
68000
70000
72000
74000
76000
78000
80000
82000
84000
86000
88000
90000
92000
94000
96000
98000
100000
102000
104000
106000
108000
110000
112000
114000
116000
118000
120000
122000
124000
126000
128000
130000
132000
134000
136000
138000
140000
142000
144000
146000
148000
150000
152000
154000
156000
158000
160000
162000
164000
166000
168000
170000
172000
174000
176000
178000
180000
182000
184000
186000
188000
190000
192000
194000
196000
198000
200000
202000
204000
206000
208000
210000
212000
214000
216000
218000
220000
222000
224000
226000
228000
230000
232000
234000
236000
238000
240000
242000
244000
246000
248000
250000
252000
254000
256000
258000
260000
262000
264000
266000
268000
270000
272000
274000
276000
278000
280000
282000
284000
286000
288000
290000
292000
294000
296000
298000
300000
302000
304000
306000
308000
310000
312000
314000
316000
318000
320000
322000
324000
326000
328000
330000
332000
334000
336000
338000
340000
342000
344000
346000
348000
350000
352000
354000
356000
358000
360000
362000
364000
366000
368000
370000
372000
374000
376000
378000
380000
382000
384000
386000
388000
390000
392000
394000
396000
398000
400000
402000
404000
406000
408000
410000
412000
414000
416000
418000
420000
422000
424000
426000
428000
430000
432000
434000
436000
438000
440000
442000
444000
446000
448000
450000
452000
454000
456000
458000
460000
462000
464000
466000
468000
470000
472000
474000
476000
478000
480000
482000
484000
486000
488000
490000
492000
494000
496000
498000

Below are the posterior means and standard errors (after removing a burnin).

# Posterior summaries with burn-in
burnin = 5000

pth = pd.DataFrame(path[burnin:, :p], columns=[f'b{i}' for i in range(p)])
bayes = pd.DataFrame({
    'est': pth.mean(),
    'std.err': pth.std()
})

print(bayes)
         est   std.err
b0 -2.376342  2.525091
b1  0.580037  0.516160

Below we print the posteior means and standard errors side by side with the frequentist estimates and standard errors.

# GLM (probit)
glm_res = sm.GLM(
    y, Xm,
    family=sm.families.Binomial(link=sm.families.links.Probit())
).fit()

# Combine results
comp = pd.DataFrame({
    'Bayes_est': bayes['est'].values,
    'GLM_est': glm_res.params,
    'Bayes_se': bayes['std.err'].values,
    'GLM_se': glm_res.bse
}, index=[f'b{i}' for i in range(len(glm_res.params))])

print(comp)
    Bayes_est   GLM_est  Bayes_se    GLM_se
b0  -2.376342 -2.201963  2.525091  2.453266
b1   0.580037  0.535108  0.516160  0.492885

Which of these two outputs better approximates the true posterior? This question is harder to answer in general but since this is a scenario with only two parameters, we can attempt to compute the actual posterior using a dense grid-based numerical approximation. We do this below.

from scipy.special import logsumexp
# -----------------------------
# Grid posterior computation
# -----------------------------

# center grid around MLE
b0_hat, b1_hat = result.params
se0, se1 = result.bse

# choose a grid range (we are placing the grid around a large range)
b0_grid = np.linspace(-12.0, 12.0, 1000)
b1_grid = np.linspace(-12.0, 12.0, 1000)

B0, B1 = np.meshgrid(b0_grid, b1_grid, indexing="xy")

log_post = np.zeros_like(B0)

# compute log-likelihood on the grid
for i in range(len(b1_grid)):
    for j in range(len(b0_grid)):
        beta = np.array([B0[i, j], B1[i, j]])
        eta = X @ beta
        p = norm.cdf(eta)

        # avoid log(0)
        p = np.clip(p, 1e-14, 1 - 1e-14)

        loglik = np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
        log_post[i, j] = loglik   # flat prior => log posterior = log likelihood + constant

# normalize numerically
log_post_norm = log_post - logsumexp(log_post)
post = np.exp(log_post_norm)

# account for grid cell area so posterior integrates to about 1
db0 = b0_grid[1] - b0_grid[0]
db1 = b1_grid[1] - b1_grid[0]
post = post / (post.sum() * db0 * db1)

# -----------------------------
# Posterior summaries from grid
# -----------------------------
# marginals
post_b0 = post.sum(axis=0) * db1
post_b1 = post.sum(axis=1) * db0

# posterior means
mean_b0 = np.sum(b0_grid * post_b0) * db0
mean_b1 = np.sum(b1_grid * post_b1) * db1

# posterior variances
var_b0 = np.sum((b0_grid - mean_b0)**2 * post_b0) * db0
var_b1 = np.sum((b1_grid - mean_b1)**2 * post_b1) * db1

print("\nGrid posterior summaries:")
print("Posterior mean beta_0 =", mean_b0)
print("Posterior sd   beta_0 =", np.sqrt(var_b0))
print("Posterior mean beta_1 =", mean_b1)
print("Posterior sd   beta_1 =", np.sqrt(var_b1))

# -----------------------------
# Plots
# -----------------------------
plt.figure(figsize=(15, 4))

# marginal for beta_0
plt.subplot(1, 3, 2)
plt.plot(b0_grid, post_b0)
plt.xlabel(r'$\beta_0$')
plt.ylabel("Density")
plt.title(r"Marginal posterior of $\beta_0$")

# marginal for beta_1
plt.subplot(1, 3, 3)
plt.plot(b1_grid, post_b1)
plt.xlabel(r'$\beta_1$')
plt.ylabel("Density")
plt.title(r"Marginal posterior of $\beta_1$")

plt.tight_layout()
plt.show()

Grid posterior summaries:
Posterior mean beta_0 = -2.371029301192905
Posterior sd   beta_0 = 2.5215968840131513
Posterior mean beta_1 = 0.5788161540054721
Posterior sd   beta_1 = 0.5159341483896762
<Figure size 1500x400 with 2 Axes>

Below we plot all the estimates (and standard errors) side-by-side for comparison.

comp = pd.DataFrame({
    'Bayes_est': bayes['est'].values[:2],
    'GLM_est': glm_res.params[:2],
    'Grid_est': [mean_b0, mean_b1],
    'Bayes_se': bayes['std.err'].values[:2],
    'GLM_se': glm_res.bse[:2],
    'Grid_se': [np.sqrt(var_b0), np.sqrt(var_b1)]
}, index=['b0', 'b1'])

print(comp)
    Bayes_est   GLM_est  Grid_est  Bayes_se    GLM_se   Grid_se
b0  -2.376342 -2.201963 -2.371029  2.525091  2.453266  2.521597
b1   0.580037  0.535108  0.578816  0.516160  0.492885  0.515934

It is clear the Gibbs sampler and grid-based posterior answers are very close to each other. This indicates that the Gibbs sampler is a much better approximation for the actual posterior.

Mixture Models

The next example we shall study is mixture models. One should be careful while using methods such as the Gibbs sampler for mixture models. Let us discuss this using a simple simulation example.

We observe data y1,,yny_1, \dots, y_n. Consider the model: yii.i.d(1p)N(μ1,1)+pN(μ2,1)y_i \overset{\text{i.i.d}}{\sim} (1 - p) N(\mu_1, 1) + p N(\mu_2, 1). The unknown parameters are μ1,μ2,p\mu_1, \mu_2, p. For simplicity, let us first assume that p(0,1)p \in (0, 1) is also known so the unknowns are μ1\mu_1 and μ2\mu_2. The likelihood is given by:

i=1n((1p)ϕ(yiμ1)+pϕ(yiμ2))\begin{align*} \prod_{i=1}^n \left((1 - p) \phi(y_i - \mu_1) + p \phi(y_i - \mu_2) \right) \end{align*}

where ϕ()\phi(\cdot) is the standard normal pdf.

Consider the following simulated dataset from this model. This simulation example is from Chapter 6 of the book “Bayesian Essentials with R” by Marin and Robert. I recommend reading this book chapter for more details.

n = 500
p = 0.7
mu1_true = 2.5
mu2_true = 0.0

np.random.seed(0)
# latent component indicators
z = np.random.rand(n) < p

# generate data
y = np.where(z, np.random.normal(mu2_true, 1, n), np.random.normal(mu1_true, 1, n))

Below we plot the histogram for this dataset along with the true density.

x = np.linspace(y.min(), y.max(), 500)

true_density = (1 - p) * norm.pdf(x, mu1_true, 1) + p * norm.pdf(x, mu2_true, 1)

plt.figure(figsize=(6,5))
plt.hist(y, bins=40, density=True, alpha=0.6)
plt.plot(x, true_density, linewidth=2)

plt.xlabel("y")
plt.ylabel("Density")
plt.title("Histogram with true mixture density")

plt.show()
<Figure size 600x500 with 1 Axes>

Bayesian methods (and also non-Bayesian methods such as the EM algorithm for computing the MLE) crucially rely on the likelihood (as well as the log-likelihood). Below we plot the log-likelihood as a function of μ1\mu_1 and μ2\mu_2.

def loglik(mu1, mu2, y, p):
    comp1 = (1 - p) * norm.pdf(y[:, None, None] - mu1)
    comp2 = p * norm.pdf(y[:, None, None] - mu2)
    return np.sum(np.log(comp1 + comp2), axis=0)

mu1_grid = np.linspace(-5, 5, 200)
mu2_grid = np.linspace(-5, 5, 200)

M1, M2 = np.meshgrid(mu1_grid, mu2_grid, indexing="xy")
LL = loglik(M1, M2, y, p)
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(M1, M2, LL, alpha=0.8)

ax.set_xlabel(r'$\mu_1$')
ax.set_ylabel(r'$\mu_2$')
ax.set_zlabel('log-likelihood')

plt.title("Log-likelihood surface")
plt.show()
<Figure size 800x600 with 1 Axes>

The log-likelihood does not seem like having a single mode; in fact, it has two modes as can be seen from the contour plot below. Notice however the scale of the z-axis in the log-likelihood plot.

plt.figure(figsize=(6,5))
plt.contour(M1, M2, LL, levels=100)
plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours")
plt.show()
<Figure size 600x500 with 1 Axes>

Of these two modes, one of them is quite close to the true values of μ1\mu_1 and μ2\mu_2 (which are 2.5 and 0 respectively). The other one is a spurious mode. We should careful with any algorithm that tries to optimize (or sample from the posterior) based on the log-likelihood because it might be attracted by the spurious mode.

If we look at the likelihood function (instead of the log-likelihood), it turns out it is extremely peaked with only one visible mode (which is close to the true values of μ1\mu_1 and μ2\mu_2). This is shown below.

# stabilize before exponentiating
LL_shifted = LL - np.max(LL)
L = np.exp(LL_shifted)

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(M1, M2, L, alpha=0.8)

ax.set_xlabel(r'$\mu_1$')
ax.set_ylabel(r'$\mu_2$')
ax.set_zlabel('likelihood (scaled)')

plt.title("Likelihood surface (scaled)")
plt.show()
<Figure size 800x600 with 1 Axes>

It is remarkable that the bimodal structure in the log-likelihood seemingly disappears when we look at the likelihood which has a very clear and sharp peak. However, algorithms cannot directly run on the likelihood because it is mostly zero except for a very tiny region around the MLE. Note also that we have normalized the likelihood here so that the maximum is equal to 1.

The EM Algorithm

Before discussing the Gibbs sampler, let us first see how the EM algorithm works. The EM algorithm is an iterative algorithm applied to the log-likelihood for computing the MLE. It is very similar in spirit to the Gibbs sampler. The EM algorithm is coded below, we shall look at the algorithm in the next lecture.

def em_gaussian_mixture(y, mu1_init, mu2_init, p=0.7,
                        max_iter=200, tol=1e-6):
    
    mu1 = mu1_init
    mu2 = mu2_init
    
    loglik_trace = []
    mu1_trace = [mu1]
    mu2_trace = [mu2]
    
    for t in range(max_iter):
        
        # ----- E-step -----
        f1 = (1 - p) * norm.pdf(y, loc=mu1, scale=1)
        f2 = p * norm.pdf(y, loc=mu2, scale=1)
        
        gamma = f2 / (f1 + f2)
        
        # ----- M-step -----
        mu1_new = np.sum((1 - gamma) * y) / np.sum(1 - gamma)
        mu2_new = np.sum(gamma * y) / np.sum(gamma)
        
        # ----- log-likelihood -----
        ll = np.sum(np.log(f1 + f2))
        loglik_trace.append(ll)
        
        mu1_trace.append(mu1_new)
        mu2_trace.append(mu2_new)
        
        # convergence
        if np.abs(mu1_new - mu1) < tol and np.abs(mu2_new - mu2) < tol:
            break
        
        mu1, mu2 = mu1_new, mu2_new
    
    return {
        "mu1": mu1,
        "mu2": mu2,
        "loglik": loglik_trace,
        "mu1_path": mu1_trace,
        "mu2_path": mu2_trace
    }

The EM algorithm needs an initial value for the parameters. Its behaviour could be different for different starting values as shown below.

#consider the following initial values for the EM. 
#these values are chosen to be far from the true values (mu1_true = 2.5, mu2_true = 0.0)
mu1_init = 0.0
mu2_init = 3.0 
res = em_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)
print(res["mu1"], res["mu2"])
-0.6608832392943692 1.481701314811594

Here the EM algorithm converged to two values which are far from the true values. We can plot the EM iterates on the contour plot of the log-likelihood as shown below.

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# EM path
mu1_path = np.array(res["mu1_path"])
mu2_path = np.array(res["mu2_path"])

plt.plot(mu1_path, mu2_path, marker='o', markersize = 5, linewidth=2)

# highlight start and end
plt.scatter(mu1_path[0], mu2_path[0], s=80, marker='x', label='start')
plt.scatter(mu1_path[-1], mu2_path[-1], s=80, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with EM trajectory")

plt.legend()
plt.show()
<Figure size 600x500 with 1 Axes>

It is clear from the above plot that the EM algorithm is converging to the spurious mode (as opposed to the correct mode). This solution does not explain the data well as is clear from the histogram of the data superimposed on the fitted density.

x = np.linspace(y.min(), y.max(), 500)
true_density = (1 - p) * norm.pdf(x, mu1_true, 1) + p * norm.pdf(x, mu2_true, 1)

mu1_alt = res["mu1"]
mu2_alt = res["mu2"]
alt_density = (1 - p) * norm.pdf(x, mu1_alt, 1) + p * norm.pdf(x, mu2_alt, 1)

plt.figure(figsize=(6,5))
plt.hist(y, bins=40, density=True, alpha=0.6)
plt.plot(x, true_density, linewidth=2, label = 'true density')
plt.plot(x, alt_density, linewidth=2, color='green', linestyle='--', label='EM fit')

plt.xlabel("y")
plt.ylabel("Density")
plt.title("Histogram with true mixture density")
plt.legend()

plt.show()
<Figure size 600x500 with 1 Axes>

Below we take a different initialization for which the EM algorithm converges to the actual MLE.

mu1_init = 4.0
mu2_init = 3.0 
res = em_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)
print(res["mu1"], res["mu2"])

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# EM path
mu1_path = np.array(res["mu1_path"])
mu2_path = np.array(res["mu2_path"])

plt.plot(mu1_path, mu2_path, marker='o', markersize = 5, linewidth=2)

# highlight start and end
plt.scatter(mu1_path[0], mu2_path[0], s=80, marker='x', label='start')
plt.scatter(mu1_path[-1], mu2_path[-1], s=80, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with EM trajectory")

plt.legend()
plt.show()
2.4614802836432386 -0.14232302968100252
<Figure size 600x500 with 1 Axes>

If we initialize the EM algorithm randomly, one can still end up at the spurious mode. To see this, run the following a few times.

mu1_init = np.random.uniform(-5, 5)
mu2_init = np.random.uniform(-5, 5)
res = em_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)
print(res["mu1"], res["mu2"])

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# EM path
mu1_path = np.array(res["mu1_path"])
mu2_path = np.array(res["mu2_path"])

plt.plot(mu1_path, mu2_path, marker='o', markersize = 5, linewidth=2)

# highlight start and end
plt.scatter(mu1_path[0], mu2_path[0], s=80, marker='x', label='start')
plt.scatter(mu1_path[-1], mu2_path[-1], s=80, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with EM trajectory")

plt.legend()
plt.show()
-0.6608846315602165 1.4816987482385464
<Figure size 600x500 with 1 Axes>

Another interesting initialization for the EM algorithm is when μ1=μ2\mu_1 = \mu_2. In this case, it turns out that the EM algorithm converges (in one step!) to (yˉ,yˉ)(\bar{y}, \bar{y}). So in this case, the algorithm does not even converge to one of the two modes. Instead it converges to the MLE in the single Gaussian model N(μ,1)N(\mu, 1). This can be verified below.

mu1_init = 4.0
mu2_init = 4.0
res = em_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)
print(res["mu1"], res["mu2"])

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# EM path
mu1_path = np.array(res["mu1_path"])
mu2_path = np.array(res["mu2_path"])

plt.plot(mu1_path, mu2_path, marker='o', markersize = 5, linewidth=2)

# highlight start and end
plt.scatter(mu1_path[0], mu2_path[0], s=80, marker='x', label='start')
plt.scatter(mu1_path[-1], mu2_path[-1], s=80, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with EM trajectory")

plt.legend()
plt.show()
0.6318113722352241 0.631811372235224
<Figure size 600x500 with 1 Axes>

The Gibbs Sampler

Next let us look at the Gibbs sampler. Gibbs sampler works by augmentation (similar to the probit regression example) and we shall look at the formulae in the next lecture. The code for the Gibbs sampler is given below.

def gibbs_gaussian_mixture(y, mu1_init, mu2_init, p=0.7,
                          tau2=100, n_iter=5000):
    
    n = len(y)
    
    mu1 = mu1_init
    mu2 = mu2_init
    
    mu1_trace = []
    mu2_trace = []
    
    # initialize Z randomly
    z = np.random.rand(n) < p
    
    for t in range(n_iter):
        
        # ----- Sample Z -----
        f1 = (1 - p) * norm.pdf(y, loc=mu1)
        f2 = p * norm.pdf(y, loc=mu2)
        
        prob = f2 / (f1 + f2)
        z = np.random.rand(n) < prob
        
        # ----- Sample mu1 -----
        y1 = y[~z]
        n1 = len(y1)
        
        var1 = 1 / (n1 + 1/tau2)
        mean1 = var1 * np.sum(y1)
        
        mu1 = np.random.normal(mean1, np.sqrt(var1))
        
        # ----- Sample mu2 -----
        y2 = y[z]
        n2 = len(y2)
        
        var2 = 1 / (n2 + 1/tau2)
        mean2 = var2 * np.sum(y2)
        
        mu2 = np.random.normal(mean2, np.sqrt(var2))
        
        mu1_trace.append(mu1)
        mu2_trace.append(mu2)
    
    return {
        "mu1": np.array(mu1_trace),
        "mu2": np.array(mu2_trace)
    }

Similar to the EM algorithm, the Gibbs sampler also needs an initialization. We can run the Gibbs sampler with the same initial values that we used previously for the EM. Let us start with the first set of initial values.

mu1_init = 0.0
mu2_init = 3.0 

res_gibbs = gibbs_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)

mu1_samples = res_gibbs["mu1"]
mu2_samples = res_gibbs["mu2"]

Let us plot the Gibbs iterates below on contours.

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# Gibbs samples
mu1_samp = res_gibbs["mu1"]
mu2_samp = res_gibbs["mu2"]

mu1_post = mu1_samp
mu2_post = mu2_samp

# plot full path (light line)
plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)

# scatter samples (posterior cloud)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

# highlight start and end
plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(mu1_samp[-1], mu2_samp[-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with Gibbs samples")

plt.legend()
plt.show()
<Figure size 600x500 with 1 Axes>

Similar to the EM algorithm, the Gibbs sampler is only exploring the first peak of the posterior (note that this peak corresponds to the spurious mode). Let us now look at the second set of initial values.

mu1_init = 4.0
mu2_init = 3.0 

res_gibbs = gibbs_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)

mu1_samples = res_gibbs["mu1"]
mu2_samples = res_gibbs["mu2"]

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# Gibbs samples
mu1_samp = res_gibbs["mu1"]
mu2_samp = res_gibbs["mu2"]

mu1_post = mu1_samp
mu2_post = mu2_samp

# plot full path (light line)
plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)

# scatter samples (posterior cloud)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

# highlight start and end
plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(mu1_samp[-1], mu2_samp[-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with Gibbs samples")

plt.legend()
plt.show()
<Figure size 600x500 with 1 Axes>

In this case, the Gibbs samples are concentrated at the correct mode. Next let us look at random initializations. Repeat the following multiple times. There will be a few times the Gibbs samples focus on the spurious peak.

mu1_init = np.random.uniform(-5, 5)
mu2_init = np.random.uniform(-5, 5)
res_gibbs = gibbs_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# Gibbs samples
mu1_samp = res_gibbs["mu1"]
mu2_samp = res_gibbs["mu2"]

mu1_post = mu1_samp
mu2_post = mu2_samp

# plot full path (light line)
plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)

# scatter samples (posterior cloud)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

# highlight start and end
plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(mu1_samp[-1], mu2_samp[-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with Gibbs samples")

plt.legend()
plt.show()
<Figure size 600x500 with 1 Axes>

Next look at the case of the initialization μ1=μ2\mu_1 = \mu_2. In this case, the EM algorithm converges in one step to (yˉ,yˉ)(\bar{y}, \bar{y}). Let us see how the Gibbs sampler performs. In this case (as can be seen from the code below), some times, the Gibbs sampler goes to the correct mode, but some other times, it will go to the spurious mode. This is an unstable initialization.

mu1_init = 4.0
mu2_init = 4.0
res_gibbs = gibbs_gaussian_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7)

plt.figure(figsize=(6,5))

# contour
plt.contour(M1, M2, LL, levels=100)

# Gibbs samples
mu1_samp = res_gibbs["mu1"]
mu2_samp = res_gibbs["mu2"]

mu1_post = mu1_samp
mu2_post = mu2_samp

# plot full path (light line)
plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)

# scatter samples (posterior cloud)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

# highlight start and end
plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(mu1_samp[-1], mu2_samp[-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with Gibbs samples")

plt.legend()
plt.show()
<Figure size 600x500 with 1 Axes>

Overall, the Gibbs sampler has very similar problems to the EM in this example. Initialization is important and it is quite possible that the Gibbs samples are exploring the wrong peak of the log-posterior.

Metropolis Algorithm

Next let us look at the performance of the Random-Walk Metropolis Algorithm. The code is given below. We are generating proposals by y=x+zy = x + z where zN(0,s2I2)z \sim N(0, s^2 I_2) for a step size ss.

def loglik(mu1, mu2, y, p):
    return np.sum(np.log((1 - p) * norm.pdf(y, mu1, 1) +
                         p * norm.pdf(y, mu2, 1)))

def rwm_mixture(y, mu1_init, mu2_init, p=0.7,
                n_iter=5000, step_size=1.0):
    
    mu1 = mu1_init
    mu2 = mu2_init
    
    mu1_trace = []
    mu2_trace = []
    
    accept = 0
    
    current_ll = loglik(mu1, mu2, y, p)
    
    for t in range(n_iter):
        
        # ----- proposal -----
        proposal = np.random.normal(0, step_size, size=2)
        mu1_prop = mu1 + proposal[0]
        mu2_prop = mu2 + proposal[1]
        
        # ----- compute log-likelihood -----
        prop_ll = loglik(mu1_prop, mu2_prop, y, p)
        
        # ----- accept/reject -----
        log_alpha = prop_ll - current_ll
        
        if np.log(np.random.rand()) < log_alpha:
            mu1, mu2 = mu1_prop, mu2_prop
            current_ll = prop_ll
            accept += 1
        
        mu1_trace.append(mu1)
        mu2_trace.append(mu2)
    
    return {
        "mu1": np.array(mu1_trace),
        "mu2": np.array(mu2_trace),
        "accept_rate": accept / n_iter
    }

The step size is crucial for the algorithm. When the step size equals 1, the Metropolis algorithm works correctly (i.e., has samples concentrated at the correct log-posterior mode) for all the initializations that we looked at previously. However the acceptance rate of the chain will be small.

#First initialization
mu1_init = 0.0
mu2_init = 3.0

#Second initialization
mu1_init = 4.0
mu2_init = 3.0

#Random Initialization
mu1_init = np.random.uniform(-5, 5)
mu2_init = np.random.uniform(-5, 5)

#Fourth Initialization
mu1_init = 4.0
mu2_init = 4.0

res_rwm = rwm_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7, n_iter=100000, step_size = 1.0)

print("Acceptance rate:", res_rwm["accept_rate"])

plt.figure(figsize=(6,5))
plt.contour(M1, M2, LL, levels=100)

burn = 0
mu1_post = res_rwm["mu1"][burn:]
mu2_post = res_rwm["mu2"][burn:]

plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(res_rwm["mu1"][-1], res_rwm["mu2"][-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with RWM samples")

plt.legend()
plt.show()
Acceptance rate: 0.01288
<Figure size 600x500 with 1 Axes>

On the other hand, when the step size is smaller (in this example, 0.5 or smaller), then the Metropolis algorithm behaves similarly to the Gibbs sampler case (with the possibility of exploring the spurious mode for certain initializations).

#First initialization
mu1_init = 0.0
mu2_init = 3.0

#Second initialization
#mu1_init = 4.0
#mu2_init = 3.0

#Random Initialization
#mu1_init = np.random.uniform(-5, 5)
#mu2_init = np.random.uniform(-5, 5)

#Fourth Initialization
#mu1_init = 4.0
#mu2_init = 4.0

res_rwm = rwm_mixture(y, mu1_init=mu1_init, mu2_init=mu2_init, p=0.7, n_iter=100000, step_size = 0.5)

print("Acceptance rate:", res_rwm["accept_rate"])

plt.figure(figsize=(6,5))
plt.contour(M1, M2, LL, levels=100)

burn = 0
mu1_post = res_rwm["mu1"][burn:]
mu2_post = res_rwm["mu2"][burn:]

plt.plot(mu1_post, mu2_post, linewidth=0.5, alpha=0.5)
plt.scatter(mu1_post, mu2_post, s=5, alpha=0.5)

plt.scatter(mu1_init, mu2_init, s=60, marker='x', label='start')
plt.scatter(res_rwm["mu1"][-1], res_rwm["mu2"][-1], s=60, marker='*', label='end')

plt.xlabel(r'$\mu_1$')
plt.ylabel(r'$\mu_2$')
plt.title("Log-likelihood contours with RWM samples")

plt.legend()
plt.show()
Acceptance rate: 0.04871
<Figure size 600x500 with 1 Axes>

Overall, the summary from this simple simulation exercise is that the Gibbs sampler is sensitive to initialization in this simple mixture model. The Random-Walk Metropolis algorithm works with a large enough scale for the proposal chain.