Gibbs Sampler - Stat 238

We shall look at how the Gibbs sampler works in some simple examples.

import numpy as np
import matplotlib.pyplot as plt

Example One: Bivariate Normal¶

Below we use the Gibbs sampler to generate samples from the bivariate normal distribution with mean $(y_1, y_2) = (0, 0)$ , variances equal to 1 and correlation equal to $\rho$ . The main message here is that Gibbs sampler works well unless $|\rho|$ is very close to one.

y1, y2 = 0, 0
rho  = 0.8
#rho = 0.9999 #this is the correlation between the two variables
state = np.array([0.0, 0.0], dtype = float) #this is the initial state
nit = 100000 #number of iterations for the Gibbs sampler

path = np.zeros((nit, 2))
path[0] = state

sd = np.sqrt(1 - rho**2) #this is the standard deviation of the conditional distribution

for t in range(1, nit):
    j = np.random.randint(2)
    if j == 0:
        path[t, 0] = y1 + rho * (state[1] - y2) + sd * np.random.randn()
        path[t, 1] = state[1]
    else:
        path[t, 0] = state[0]
        path[t, 1] = y2 + rho * (state[0] - y1) + sd * np.random.randn()
    state = path[t]

print(path[:8]) #first 8 samples from the Gibbs sampler
print(path[-8:]) #last 8 samples from the Gibbs sampler

[[ 0.          0.        ]
 [ 0.          0.32019321]
 [ 0.          0.1621067 ]
 [-0.53972888  0.1621067 ]
 [ 0.133507    0.1621067 ]
 [-0.85725147  0.1621067 ]
 [-0.85725147 -0.64325011]
 [-0.85725147 -0.5493405 ]]
[[ 0.20994665  0.63532072]
 [-0.72248697  0.63532072]
 [-0.72248697  0.24007978]
 [-1.07082005  0.24007978]
 [-0.28484312  0.24007978]
 [-0.44242363  0.24007978]
 [-0.44242363  0.54204849]
 [ 0.22792695  0.54204849]]

plt.figure(figsize=(6, 6))
plt.scatter(path[:, 0], path[:, 1], alpha=0.5)
plt.title("Scatter plot of Gibbs samples")
plt.xlabel("X1")
plt.ylabel("X2")
plt.axis("equal")
plt.show()

plt.figure(figsize=(6, 6))
nplot = 60
plt.plot(path[:nplot, 0], path[:nplot, 1], "-o", lw=0.8)   # path of the chain
plt.plot(path[0, 0], path[0, 1], "o", label="start")   # starting point
plt.title("Path of the Gibbs sampler")
plt.xlabel("X1")
plt.ylabel("X2")
plt.axis("equal")
plt.legend()
plt.show()

Below we plot the trace plots for each of the two variables $X_1$ and $X_2$ . The trace plot for $X_1$ is the plot of $X_1^{(t)}$ versus $t$ and the trace plot for $X_2$ is the plot of $X_2^{(t)}$ versus $t$ . One uses these to check if the Markov chain has mixed properly.

#Plotting the marginal distributions: 
xx = np.linspace(-4, 4, 400)
dnorm = (1 / np.sqrt(2 * np.pi)) * np.exp(-xx**2 / 2)

plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
plt.plot(path[:2000, 0], lw=0.7)
plt.title("Trace plot for X1")
plt.xlabel("Iteration")
plt.ylabel("X1")

plt.subplot(2, 1, 2)
plt.plot(path[:2000, 1], lw=0.7)
plt.title("Trace plot for X2")
plt.xlabel("Iteration")
plt.ylabel("X2")

plt.tight_layout()
plt.show()

Below we plot the histograms of the samples (separately for $X_1$ and $X_2$ ) and check how close they are to the true densities.

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(path[:, 0], bins=90, density=True, alpha=0.6)
plt.plot(xx, dnorm, lw=2)
plt.title("Histogram of X1 vs N(0,1)")
plt.xlabel("X1")
plt.ylabel("Density")

plt.subplot(1, 2, 2)
plt.hist(path[:, 1], bins=90, density=True, alpha=0.6)
plt.plot(xx, dnorm, lw=2)
plt.title("Histogram of X2 vs N(0,1)")
plt.xlabel("X2")
plt.ylabel("Density")

plt.tight_layout()
plt.show()

Now go back and change $\rho$ to a value close to 1 (say $\rho = 0.9999$ ), then the marginal histograms will be quite far from standard normal densities. This shows that the Gibbs sampler has issues with variables that are highly correlated.

Example Two: Linear Regression¶

The second example is that of linear regression. The model is $y = X \beta + \epsilon$ with $\epsilon \sim N(0, \sigma^2 I_n)$ with $\beta_j, \log \sigma$ being uniform on $(-\infty, \infty)$ .

For illustration, we use the heights dataset.

import pandas as pd
height = pd.read_csv("PearsonHeightData.txt", sep = r"\s+")
print(height)

      Father   Son
0       65.0  59.8
1       63.3  63.2
2       65.0  63.3
3       65.8  62.8
4       61.1  64.3
...      ...   ...
1073    67.0  70.8
1074    71.3  68.3
1075    71.8  69.3
1076    70.7  69.3
1077    70.3  67.0

[1078 rows x 2 columns]

plt.figure(figsize = (10, 4))
plt.scatter(height["Father"], height["Son"])
plt.xlabel("Father's height")
plt.ylabel("Son's height")
plt.title("Scatter plot of heights data")
plt.show()

n = height.shape[0]
X = np.column_stack([np.ones(n), height["Father"]]) #this is the X matrix (consisting of covariates and the column of ones)
y = height["Son"] #this is the y vector consisting of response values

The usual method of linear regression simply uses the OLS function in statsmodels.

import statsmodels.api as sm
linmod = sm.OLS(y, X).fit()
print(linmod.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    Son   R-squared:                       0.251
Model:                            OLS   Adj. R-squared:                  0.250
Method:                 Least Squares   F-statistic:                     360.9
Date:                Sat, 04 Apr 2026   Prob (F-statistic):           1.27e-69
Time:                        00:45:11   Log-Likelihood:                -2489.4
No. Observations:                1078   AIC:                             4983.
Df Residuals:                    1076   BIC:                             4993.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         33.8928      1.833     18.491      0.000      30.296      37.489
x1             0.5140      0.027     18.997      0.000       0.461       0.567
==============================================================================
Omnibus:                       17.527   Durbin-Watson:                   0.765
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.642
Skew:                          -0.052   Prob(JB):                     2.22e-07
Kurtosis:                       3.819   Cond. No.                     1.67e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.67e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The estimate of $\sigma$ is obtained as follows.

bhat = linmod.params.values
print(bhat)
Sbhat = np.sum(linmod.resid ** 2)
m = X.shape[1] - 1 #in this case, m = 1
sighat = np.sqrt(Sbhat/(n-m-1)) #this is also called the residual standard error
print(sighat)

[33.89280054  0.51400591]
2.438134379352427

The Gibbs sampler for linear regression is implemented below. It is based on the following full conditionals:

\begin{align*} \beta \mid \sigma, \text{data} \sim N_p \left(\hat{\beta}, \sigma^2 (X^T X)^{-1} \right) \end{align*}

(1)

where $\hat{\beta}$ is the least squares estimate, and

\begin{align*} \frac{1}{\sigma^2} \mid \beta, \text{data} \sim \text{Gamma} \left(\frac{n}{2}, \frac{1}{2} \|y - X \beta\|^2 \right). \end{align*}

(2)

So the Gibbs sampler algorithm is:

Initialize at arbitrary $\beta^{(0)}$ and $\sigma^{(0)}$ .
Repeat the following for $t = 1, \dots, N$ . Pick $J$ to be either 1 or 2 with equal probability. If $J =1$ , take $\beta^{(t+1)} \sim N_p(\hat{\beta}, (\sigma^{(t)})^2 (X' X)^{-1})$ and $\sigma^{(t+1)} = \sigma^{(t)}$ . If $J = 2$ , take $\beta^{(t+1)} = \beta^{(t)}$ . Generate an observation $x$ from the Gamma distribution with parameters $n/2$ and $\|y - X\beta^{(t)}\|^2/2$ . Then take $\sigma^{(t+1)} = x^{-1/2}$ .

This algorithm is implemented below.

MM = np.linalg.inv(X.T @ X)

state = np.array([-30.0, 15.0, 250.0])   # bad initial value; still works
nit = 100000
path = np.zeros((nit, 3))
path[0] = state

for t in range(1, nit):
    j = np.random.randint(2)

    if j == 0:
        Cov_mat = (state[2] ** 2) * MM
        path[t, :2] = np.random.multivariate_normal(mean=bhat, cov=Cov_mat)
        path[t, 2] = state[2]

    else:
        path[t, :2] = state[:2]
        rss = np.sum((y - state[0] - state[1] * height["Father"]) ** 2)
        gg = np.random.gamma(shape=n / 2, scale=2 / rss)
        path[t, 2] = gg ** (-0.5)

    state = path[t]

    if (t + 1) % 2000 == 0:
        print(t + 1)

Below we plot the histograms of the posterior samples (after removing burnin of 500) along with the actual marginal posterior densities (which are t-densities for $\beta_0$ and $\beta_1$ , and chi-squared for $RSS/\sigma^2$ where $RSS$ denotes the residual sum of squares).

from scipy.stats import t, chi2, norm

df = n - 2
path2 = path[500:, :]

z_samp = Sbhat / (path2[:, 2] ** 2)

scale_beta0 = sighat * np.sqrt(MM[0, 0])
scale_beta1 = sighat * np.sqrt(MM[1, 1])

x0 = np.linspace(path2[:, 0].min(), path2[:, 0].max(), 400)
x1 = np.linspace(path2[:, 1].min(), path2[:, 1].max(), 400)
xz = np.linspace(max(1e-6, z_samp.min()), z_samp.max(), 400)

plt.figure(figsize=(15, 4))

plt.subplot(1, 3, 1)
plt.hist(path2[:, 0], bins=60, density=True, alpha=0.6)
plt.plot(x0, t.pdf(x0, df=df, loc=bhat[0], scale=scale_beta0), lw=2, label="t")
plt.plot(x0, norm.pdf(x0, loc=bhat[0], scale=scale_beta0), lw=2, ls="--", label="normal")
plt.title(r"Posterior of $\beta_0$")
plt.xlabel(r"$\beta_0$")
plt.ylabel("Density")
plt.legend()

plt.subplot(1, 3, 2)
plt.hist(path2[:, 1], bins=60, density=True, alpha=0.6)
plt.plot(x1, t.pdf(x1, df=df, loc=bhat[1], scale=scale_beta1), lw=2, label="t")
plt.plot(x1, norm.pdf(x1, loc=bhat[1], scale=scale_beta1), lw=2, ls="--", label="normal")
plt.title(r"Posterior of $\beta_1$")
plt.xlabel(r"$\beta_1$")
plt.ylabel("Density")
plt.legend()

plt.subplot(1, 3, 3)
plt.hist(z_samp, bins=60, density=True, alpha=0.6)
plt.plot(xz, chi2.pdf(xz, df=df), lw=2)
plt.title(r"Posterior of RSS/$\sigma^2$")
plt.xlabel(r"RSS/$\sigma^2$")
plt.ylabel("Density")

plt.tight_layout()
plt.show()

From the above plots, it is clear that the Gibbs sampler is working well.

Example Three: Probit Regression¶

Consider the frogs dataset that we previously used for logistic regression. Below we fit the probit model to this data, using a standard software function. We fit both the probit model as well as the logistic regression model. Note that the fitted coefficients in both these models are somewhat close to each other.

frogs = pd.read_csv("frogs.csv")

# Design matrix
X = frogs[['altitude','distance','NoOfPools','NoOfSites','avrain','meanmin','meanmax']].copy()
X['distance'] = np.log(X['distance'])
X['NoOfPools'] = np.log(X['NoOfPools'])
X = sm.add_constant(X)
y = frogs['pres.abs'].values

Xm = X.values
n, p = Xm.shape
linmod_cov = np.linalg.inv(Xm.T @ Xm)

# GLM (probit + logit)
print(sm.GLM(y, Xm, family=sm.families.Binomial(link=sm.families.links.Probit())).fit().summary())
print(sm.GLM(y, Xm, family=sm.families.Binomial()).fit().summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  212
Model:                            GLM   Df Residuals:                      204
Model Family:                Binomial   Df Model:                            7
Link Function:                 Probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -99.536
Date:                Sat, 04 Apr 2026   Deviance:                       199.07
Time:                        00:52:50   Pearson chi2:                     229.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.3173
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         40.9617     73.193      0.560      0.576    -102.493     184.417
x1            -0.0090      0.021     -0.426      0.670      -0.050       0.032
x2            -0.4170      0.144     -2.896      0.004      -0.699      -0.135
x3             0.3396      0.122      2.794      0.005       0.101       0.578
x4             0.0044      0.062      0.071      0.944      -0.117       0.126
x5            -0.0102      0.034     -0.299      0.765      -0.077       0.057
x6             2.9654      0.863      3.437      0.001       1.274       4.656
x7            -2.4367      2.700     -0.902      0.367      -7.729       2.856
==============================================================================
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  212
Model:                            GLM   Df Residuals:                      204
Model Family:                Binomial   Df Model:                            7
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -98.812
Date:                Sat, 04 Apr 2026   Deviance:                       197.62
Time:                        00:52:50   Pearson chi2:                     234.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.3219
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         40.8988    132.663      0.308      0.758    -219.116     300.914
x1            -0.0066      0.039     -0.172      0.863      -0.082       0.069
x2            -0.7593      0.255     -2.973      0.003      -1.260      -0.259
x3             0.5727      0.216      2.649      0.008       0.149       0.997
x4            -0.0009      0.107     -0.008      0.993      -0.211       0.210
x5            -0.0068      0.060     -0.113      0.910      -0.124       0.111
x6             5.3048      1.543      3.439      0.001       2.281       8.328
x7            -3.1730      4.840     -0.656      0.512     -12.659       6.313
==============================================================================

In the next lecture, we shall see how to perform Bayesian inference in this model (with the uniform prior on $\beta$ ) using the Gibbs sampler.