Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twenty Six

Spring 2026, UC Berkeley

The Gibbs Sampler

Let us introduce some notation. The goal is to obtain samples from a target distribution with density π\pi (in our applications, π\pi will be the posterior density). π\pi represents the density on Rd\R^d of a random vector Θ\Theta of size d×1d \times 1:

Θπ.\Theta \sim \pi.

We assume that Θ\Theta can be decomposed as:

Θ=(Θ(1),,Θ(k))\Theta = (\Theta_{(1)}, \dots, \Theta_{(k)})

for kk subvectors Θ(1),,Θ(k)\Theta_{(1)}, \dots, \Theta_{(k)}. The jthj^{th} subvector is Θ(j)\Theta_{(j)}. We shall also denote the vector obtained by dropping the jthj^{th} subvector Θ(j)\Theta_{(j)} from Θ\Theta by Θ(j)\Theta_{-(j)}. For example,

Θ(1)=(Θ(2),,Θ(k))   and   Θ(2)=(Θ(1),Θ(3),,Θ(k)).\Theta_{-(1)} = (\Theta_{(2)}, \dots, \Theta_{(k)}) ~~ \text{ and } ~~ \Theta_{-(2)} = (\Theta_{(1)}, \Theta_{(3)}, \dots, \Theta_{(k)}).

Gibbs sampling works according to the following algorithm.

  1. Initialize with an arbitrary value θ(0)\theta^{(0)}.

  2. Repeat the following for t=1,,Nt = 1, \dots, N (for a large number NN):

    1. Pick JJ uniformly at random from 1,,k1, \dots, k. Suppose JJ turned out to be jj.

    2. Set θ(j)(t+1)=θ(j)(t)\theta^{(t+1)}_{-(j)} = \theta^{(t)}_{-(j)} i.e., θ(i)(t+1)=θ(i)(t)\theta^{(t+1)}_{(i)} = \theta^{(t)}_{(i)} for iji \neq j.

    3. θ(j)(t+1)\theta^{(t+1)}_{(j)} is randomly drawn according to the conditional distribution of Θ(j)\Theta_{(j)} given Θ(j)=θ(j)(t)\Theta_{-(j)} = \theta^{(t)}_{-(j)}.

This algorithm assumes that we have some way of generating observations from the conditional distribution of Θ(j)\Theta_{(j)} given Θ(j)\Theta_{-(j)} for each jj.

Examples of the Gibbs Sampler

In the first two examples below, the posterior can be written in closed form so MCMC is not really necessary. These are still useful for evaluating the performance of the Gibbs sampler. The third example is more non-trivial and illustrates the usefulness of the Gibbs sampler.

Bivariate Normal

We want to sample (θ1,θ2)(\theta_1, \theta_2) from the following bivariate distribution:

(θ1θ2)(y1y2)N((y1y2),(1ρρ1)).\begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix} \Biggr| \begin{pmatrix}y_1 \\ y_2 \end{pmatrix} \sim N \left(\begin{pmatrix}y_1 \\ y_2 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right).

Here y=(y1,y2)Ty = (y_1, y_2)^T is fixed. One can, of course, directly sample from this normal distribution without any MCMC. We can use this example to illustrate the performance of the Gibbs sampler. To use the Gibbs sampler, observe that

θ1θ2,yN(y1+ρ(θ2y2),1ρ2)   and   θ2θ1,yN(y2+ρ(θ1y1),1ρ2)\theta_1 | \theta_2, y \sim N(y_1 + \rho(\theta_2 - y_2), 1 - \rho^2) ~~ \text{ and } ~~ \theta_2 | \theta_1, y \sim N(y_2 + \rho(\theta_1 - y_1), 1 - \rho^2)

The Gibbs sampler algorithm then would be the following:

  1. Initialize at arbitrary θ1(0)\theta_1^{(0)} and θ2(0)\theta_2^{(0)}.

  2. Repeat the following for t=1,,Nt = 1, \dots, N:

    1. Pick JJ to be either 1 or 2 with probability 0.5.

    2. If J=1J = 1, take θ2(t+1)=θ2(t)\theta_2^{(t+1)} = \theta_2^{(t)} and generate θ1(t+1)\theta_1^{(t+1)} from the normal distribution with mean y1+ρ(θ2(t)y2)y_1 + \rho (\theta_2^{(t)} - y_2) and variance 1ρ21 - \rho^2.

    3. If J=2J = 2, take θ1(t+1)=θ1(t)\theta_1^{(t+1)} = \theta_1^{(t)} and generate θ2(t+1)\theta_2^{(t+1)} from the normal distribution with mean y2+ρ(θ1(t)y1)y_2 + \rho (\theta_1^{(t)} - y_1) and variance 1ρ21 - \rho^2.

This example is also useful for illustrating one potential problem with the Gibbs sampler. Suppose ρ=1\rho = 1. In this case θ1y1N(0,1)\theta_1 - y_1 \sim N(0, 1) and θ2y2=θ1y1\theta_2 - y_2 = \theta_1 - y_1. The Gibbs iterates in this case will be

θ1(t+1)y1=θ2(t)y2   if J=1\theta_1^{(t+1)} - y_1 = \theta_2^{(t)} - y_2 ~~~ \text{if $J = 1$}

and

θ2(t+1)y2=θ1(t)y1   if J=2\theta_2^{(t+1)} - y_2 = \theta_1^{(t)} - y_1 ~~~ \text{if $J = 2$}

This results in the chain not moving after the first iterate. For example, if the first step is:

(θ1(0)θ2(0))(y1y2+θ2(0)θ2(0)),\begin{pmatrix} \theta_1^{(0)} \\ \theta_2^{(0)}\end{pmatrix} \rightarrow \begin{pmatrix} y_1 - y_2 + \theta_2^{(0)}\\ \theta_2^{(0)}\end{pmatrix},

then the chain will stay at (y1y2+θ2(0)θ2(0))\begin{pmatrix} y_1 - y_2 + \theta_2^{(0)}\\ \theta_2^{(0)}\end{pmatrix} forever. On the other hand, if the first step is

(θ1(0)θ2(0))(θ1(0)y2y1+θ1(0)),\begin{pmatrix} \theta_1^{(0)} \\ \theta_2^{(0)}\end{pmatrix} \rightarrow \begin{pmatrix} \theta_1^{(0)}\\ y_2 - y_1 + \theta_1^{(0)}\end{pmatrix},

then the chain will stay forever at (θ1(0)y2y1+θ1(0))\begin{pmatrix} \theta_1^{(0)}\\ y_2 - y_1 + \theta_1^{(0)}\end{pmatrix}. This chain will obviously not converge unless the initial iterate already has the right distribution. A similar problem arises when ρ=1\rho = -1.

The lesson to be drawn from this is that the Gibbs sampler has issues of convergence when there is high correlation between the variables.

Linear Regression

Consider the linear regression model that we discussed previously. We observe data (y1,x1),,(yn,xn)(y_1, x_1), \dots, (y_n, x_n) with y1,,yny_1, \dots, y_n real-valued and x1,,xnRpx_1, \dots, x_n \in \R^p. We treat x1,,xnx_1, \dots, x_n as non-random and y1,,yny_1, \dots, y_n as random. The parameter is θRp+1\theta \in \R^{p+1} with θ=(β,σ)\theta = (\beta, \sigma) (here βRp\beta \in \R^p represents the vector of regression coefficients and σ\sigma is the noise standard deviation). Consider the model

yiN(xiβ,σ2)y_i \sim N(x_i' \beta, \sigma^2)

with y1,,yny_1, \dots, y_n being independent conditional on θ\theta along with the (improper) prior:

π(β,σ)1σI{σ>0}.\pi(\beta, \sigma) \propto \frac{1}{\sigma} I\{\sigma > 0\}.

The exact joint posterior is given by

π(β,σdata)σ(n+1)exp(12σ2YXβ2)I{σ>0}\pi(\beta, \sigma | \text{data}) \propto \sigma^{-(n+1)} \exp \left(\frac{-1}{2 \sigma^2} \|Y - X \beta\|^2 \right) I\{\sigma > 0\}

We can directly sample from this posterior (e.g., by first sampling from the marginal posterior of σ\sigma and then from the conditional posterior of β\beta given σ\sigma. Alternatively, one can first sample from the marginal posterior of β\beta and then from the conditional posterior of σ\sigma given β\beta).

Now let us use the Gibbs sample to simulate from (1). Note that:

π(βσ,data)exp(12σ2YXβ2)exp(12σ2XβXβ^2)=exp(12σ2(ββ^)(XX)(ββ^))\pi(\beta | \sigma, \text{data}) \propto \exp \left(\frac{-1}{2 \sigma^2} \|Y - X \beta\|^2 \right) \propto \exp \left(\frac{-1}{2 \sigma^2} \|X \beta - X \hat \beta\|^2 \right) = \exp \left(\frac{-1}{2 \sigma^2} (\beta - \hat \beta)' (X' X) (\beta - \hat \beta) \right)

which gives

βσ,dataNp(β^,σ2(XX)1).\beta | \sigma, \text{data} \sim N_p \left(\hat \beta, \sigma^2 (X' X)^{-1} \right).

Here β^\hat{\beta} is the least squares estimator. For the conditional posterior of σ\sigma given β\beta, note that

π(σβ,data)σ(n+1)exp(12σ2YXβ2)I{σ>0}.\pi(\sigma | \beta, \text{data}) \propto \sigma^{-(n+1)} \exp \left(\frac{-1}{2 \sigma^2} \|Y - X \beta\|^2 \right) I\{\sigma > 0\}.

Using the formula f1/σ2(x)=fσ(x1/2)x3/2f_{1/\sigma^2}(x) = f_{\sigma}(x^{-1/2}) x^{-3/2}, it is now easy to check that the density of 1/σ21/\sigma^2 conditional on β\beta and the data is proportional to

x(n2)/2exp(x2YXβ2)I{x>0}x^{(n-2)/2} \exp \left(\frac{-x}{2} \|Y - X \beta\|^2 \right) I \{x > 0\}

which means that

1σ2β,dataGamma(n2,12YXβ2).\frac{1}{\sigma^2} \biggr| \beta, \text{data} \sim \text{Gamma} \left(\frac{n}{2}, \frac{1}{2} \|Y - X\beta\|^2 \right).

The Gibbs sampler algorithm is thus:

  1. Initialize at arbitrary β(0)\beta^{(0)} and σ(0)\sigma^{(0)}.

  2. Repeat the following for t=1,,Nt = 1, \dots, N:

    1. Pick JJ to be either 1 or 2 with equal probability.

    2. If J=1J =1, take β(t+1)Np(β^,(σ(t))2(XX)1)\beta^{(t+1)} \sim N_p(\hat{\beta}, (\sigma^{(t)})^2 (X' X)^{-1}) and σ(t+1)=σ(t)\sigma^{(t+1)} = \sigma^{(t)}.

    3. If J=2J = 2, take β(t+1)=β(t)\beta^{(t+1)} = \beta^{(t)}. Generate an observation xx from the Gamma distribution with parameters n/2n/2 and YXβ(t)2/2\|Y - X\beta^{(t)}\|^2/2. Then take σ(t+1)=x1/2\sigma^{(t+1)} = x^{-1/2}.

Probit Regression

Our next application of the Gibbs sampler is to Probit Regression which was introduced in Homework Four. The model is given by the following. We observe data (y1,x1),,(yn,xn)(y_1, x_1), \dots, (y_n, x_n) with y1,,yny_1, \dots, y_n binary and x1,,xnRpx_1, \dots, x_n \in \R^p. x1,,xnx_1, \dots, x_n are treated non-random and y1,,yny_1, \dots, y_n random. The parameter is βRp\beta \in \R^p. The probit regression model assumes that y1,yny_1 \dots, y_n are independent given β\beta with

yiβBernoulli(Φ(xiβ))y_i | \beta \sim \text{Bernoulli} \left(\Phi(x_i' \beta) \right)

where Φ()\Phi(\cdot) is the standard normal cdf. Let π()\pi(\cdot) be our prior for β\beta. The posterior is then given by:

π(βdata)π(β)i=1n(Φ(xiβ))yi(1Φ(xiβ))1yi.\pi(\beta | \text{data}) \propto \pi(\beta) \prod_{i=1}^n \left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta) \right)^{1-y_i}.

Suppose we take the (improper) uniform prior for β\beta so that the posterior becomes

π(βdata)i=1n(Φ(xiβ))yi(1Φ(xiβ))1yi.\pi(\beta | \text{data}) \propto \prod_{i=1}^n \left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta) \right)^{1-y_i}.

It is not actually clear how to use the Gibbs sampler on the above posterior. We can try decomposing β\beta as (β1,,βp)(\beta_1, \dots, \beta_p) and then attempt to obtain samples from the conditional:

π(βjβj,data)i=1n(Φ(xiβ))yi(1Φ(xiβ))1yi.\pi(\beta_j | \beta_{-j}, \text{data}) \propto \prod_{i=1}^n \left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta) \right)^{1-y_i}.

I am not sure how to obtain samples from the above conditional.

There is a trick that people use to run the Gibbs sampler for probit regression. Let us introduce random variables w1,,wnw_1, \dots, w_n which, conditional on β\beta, are independent with

wiβN(xiβ,1).w_i | \beta \sim N(x_i' \beta, 1).

We then let

yi=I{wi>0}.y_i = I\{w_i > 0\}.

It is then clear that y1,,yny_1, \dots, y_n defined thus have the same likelihood as in the probit regression model. Let w=(w1,,wn)w = (w_1, \dots, w_n). The trick is to use the Gibbs sampler with (w,β)(w, \beta). We shall see how this is done in the next class.