Let us introduce some notation. The goal is to obtain samples from a target distribution with density π (in our applications, π will be the posterior density). π represents the density on Rd of a random vector Θ of size d×1:
Θ∼π.
We assume that Θ can be decomposed as:
Θ=(Θ(1),…,Θ(k))
for k subvectors Θ(1),…,Θ(k). The jth subvector is Θ(j). We shall also denote the vector obtained by dropping the jth subvector Θ(j) from Θ by Θ−(j). For example,
Θ−(1)=(Θ(2),…,Θ(k)) and Θ−(2)=(Θ(1),Θ(3),…,Θ(k)).
Gibbs sampling works according to the following algorithm.
Initialize with an arbitrary value θ(0).
Repeat the following for t=1,…,N (for a large number N):
Pick J uniformly at random from 1,…,k. Suppose J turned out to be j.
Set θ−(j)(t+1)=θ−(j)(t) i.e., θ(i)(t+1)=θ(i)(t) for i=j.
θ(j)(t+1) is randomly drawn according to the conditional distribution of Θ(j) given Θ−(j)=θ−(j)(t).
This algorithm assumes that we have some way of generating observations from the conditional distribution of Θ(j) given Θ−(j) for each j.
In the first two examples below, the posterior can be written in closed form so MCMC is not really necessary. These are still useful for evaluating the performance of the Gibbs sampler. The third example is more non-trivial and illustrates the usefulness of the Gibbs sampler.
We want to sample (θ1,θ2) from the following bivariate distribution:
(θ1θ2)∣∣(y1y2)∼N((y1y2),(1ρρ1)).
Here y=(y1,y2)T is fixed. One can, of course, directly sample from this normal distribution without any MCMC. We can use this example to illustrate the performance of the Gibbs sampler. To use the Gibbs sampler, observe that
θ1∣θ2,y∼N(y1+ρ(θ2−y2),1−ρ2) and θ2∣θ1,y∼N(y2+ρ(θ1−y1),1−ρ2)
The Gibbs sampler algorithm then would be the following:
Initialize at arbitrary θ1(0) and θ2(0).
Repeat the following for t=1,…,N:
Pick J to be either 1 or 2 with probability 0.5.
If J=1, take θ2(t+1)=θ2(t) and generate θ1(t+1) from the normal distribution with mean y1+ρ(θ2(t)−y2) and variance 1−ρ2.
If J=2, take θ1(t+1)=θ1(t) and generate θ2(t+1) from the normal distribution with mean y2+ρ(θ1(t)−y1) and variance 1−ρ2.
This example is also useful for illustrating one potential problem with the Gibbs sampler. Suppose ρ=1. In this case θ1−y1∼N(0,1) and θ2−y2=θ1−y1. The Gibbs iterates in this case will be
θ1(t+1)−y1=θ2(t)−y2if J=1
and
θ2(t+1)−y2=θ1(t)−y1if J=2
This results in the chain not moving after the first iterate. For example, if the first step is:
(θ1(0)θ2(0))→(y1−y2+θ2(0)θ2(0)),
then the chain will stay at (y1−y2+θ2(0)θ2(0)) forever. On the other hand, if the first step is
(θ1(0)θ2(0))→(θ1(0)y2−y1+θ1(0)),
then the chain will stay forever at (θ1(0)y2−y1+θ1(0)). This chain will obviously not converge unless the initial iterate already has the right distribution. A similar problem arises when ρ=−1.
The lesson to be drawn from this is that the Gibbs sampler has issues of convergence when there is high correlation between the variables.
Consider the linear regression model that we discussed previously. We observe data (y1,x1),…,(yn,xn) with y1,…,yn real-valued and x1,…,xn∈Rp. We treat x1,…,xn as non-random and y1,…,yn as random. The parameter is θ∈Rp+1 with θ=(β,σ) (here β∈Rp represents the vector of regression coefficients and σ is the noise standard deviation). Consider the model
yi∼N(xi′β,σ2)
with y1,…,yn being independent conditional on θ along with the (improper) prior:
We can directly sample from this posterior (e.g., by first sampling from the marginal posterior of σ and then from the conditional posterior of β given σ. Alternatively, one can first sample from the marginal posterior of β and then from the conditional posterior of σ given β).
Now let us use the Gibbs sample to simulate from (1). Note that:
Our next application of the Gibbs sampler is to Probit Regression which was introduced in Homework Four. The model is given by the following. We observe data (y1,x1),…,(yn,xn) with y1,…,yn binary and x1,…,xn∈Rp. x1,…,xn are treated non-random and y1,…,yn random. The parameter is β∈Rp. The probit regression model assumes that y1…,yn are independent given β with
yi∣β∼Bernoulli(Φ(xi′β))
where Φ(⋅) is the standard normal cdf. Let π(⋅) be our prior for β. The posterior is then given by:
Suppose we take the (improper) uniform prior for β so that the posterior becomes
π(β∣data)∝i=1∏n(Φ(xi′β))yi(1−Φ(xi′β))1−yi.
It is not actually clear how to use the Gibbs sampler on the above posterior. We can try decomposing β as (β1,…,βp) and then attempt to obtain samples from the conditional:
I am not sure how to obtain samples from the above conditional.
There is a trick that people use to run the Gibbs sampler for probit regression. Let us introduce random variables w1,…,wn which, conditional on β, are independent with
wi∣β∼N(xi′β,1).
We then let
yi=I{wi>0}.
It is then clear that y1,…,yn defined thus have the same likelihood as in the probit regression model. Let w=(w1,…,wn). The trick is to use the Gibbs sampler with (w,β). We shall see how this is done in the next class.