Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty Two

Spring 2026, UC Berkeley

We are discussing standard MCMC techniques to obtain samples from a probability π\pi. All these techniques are based on the Metropolis-Hastings scheme, where, given the current value xx of the chain, a proposal yy is first generated according to some transition kernel Q(x,y)Q(x, y), and then the proposal is accepted with probability:

min(1,π(y)Q(y,x)π(x)Q(x,y)).\begin{align} \min \left(1, \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} \right). \end{align}

Different MCMC methods differ on how the proposal kernel Q(x,y)Q(x, y) is chosen. We shall look at three main techniques for choosing Q(x,y)Q(x, y): Random Walk Metropolis (RWM), Metropolis Adjusted Langevin Algorithm (MALA) and Hamiltonian Monte Carlo (HMC).

RWM

Here yy is generated from xx via:

y=x+σz   where zN(0,Id).\begin{align*} y = x + \sigma z ~~ \text{ where $z \sim N(0, I_d)$}. \end{align*}

In other words, the proposal kernel Q(x,y)Q(x, y) is given by:

Q(x,y)=(12πσ)dexp(yx22σ2).\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^{d} \exp \left(-\frac{\|y - x\|^2}{2 \sigma^2} \right). \end{align*}

Clearly this proposal is symmetric in the sense that Q(x,y)=Q(y,x)Q(x, y) = Q(y, x). As a result the acceptance probability (1) becomes simply

min(1,π(y)π(x)).\begin{align*} \min \left(1, \frac{\pi(y)}{\pi(x)} \right). \end{align*}

When the dimension dd is small, this method works reasonably well. When dd is large, it will work only if σ\sigma is chosen very small. However, with such small σ\sigma, one would need to run the chain for a very large amount of time to ensure adequate mixing. The consequence is that RWM does not work well when dd is large.

Example: π\pi is N(0,Id)N(0, I_d)

As an illustrative example, consider the case where:

π(x)exp(x2/2).\begin{align*} \pi(x) \propto \exp(-\|x\|^2/2). \end{align*}

In this case, we have

π(y)π(x)=exp[12(x2y2)].\begin{align*} \frac{\pi(y)}{\pi(x)} = \exp \left[\frac{1}{2} \left(\|x\|^2 - \|y\|^2 \right) \right]. \end{align*}

Because the proposed value yy will be x+σzx + \sigma z for zN(0,Id)z \sim N(0, I_d), we have

π(y)π(x)=exp[12(x2x+σz2)]=exp(A)\begin{align*} \frac{\pi(y)}{\pi(x)} = \exp \left[\frac{1}{2} \left(\|x\|^2 - \|x + \sigma z\|^2\right) \right] = \exp(-A) \end{align*}

where

A:=σ<z,x>+σ22z2.\begin{align*} A := \sigma \left<z, x \right> + \frac{\sigma^2}{2} \|z\|^2. \end{align*}

Because zN(0,Id)z \sim N(0, I_d), it is easy to check that the mean and variance of AA satisfy:

EA=dσ22\begin{align*} \E A = \frac{d \sigma^2}{2} \end{align*}

and

var(A)=E(σ<z,x>+σ22(z2d))22σ2var(<z,x>)+σ42var(z2)2σ2x2+2dσ4\begin{align*} \text{var}(A) &= \E \left( \sigma \left<z, x \right> + \frac{\sigma^2}{2} (\|z\|^2 - d) \right)^2 \\ &\leq 2 \sigma^2 \text{var}(\left<z, x \right>) + \frac{\sigma^4}{2} \text{var}(\|z\|^2) \leq 2 \sigma^2\|x\|^2 + 2 d \sigma^4 \end{align*}

which means that the standard deviation of AA is O(σx+σ2d)O(\sigma \|x\| + \sigma^2 \sqrt{d}). Suppose now that x=O(d)\|x\| = O(\sqrt{d}) and σ1\sigma \leq 1, then the standard deviation of AA is:

std(A)=O(σx+σ2d)=O(σd+σ2d)=O(σd).\begin{align*} \text{std}(A) = O \left(\sigma \|x\| + \sigma^2 \sqrt{d} \right) = O \left(\sigma \sqrt{d} + \sigma^2 \sqrt{d} \right) = O(\sigma \sqrt{d}). \end{align*}

We thus have:

EA=dσ22   and   std(A)=O(σd).\begin{align*} \E A = \frac{d\sigma^2}{2} ~~ \text{ and } ~~ \text{std}(A) = O(\sigma \sqrt{d}). \end{align*}

We can now consider two regimes of σ\sigma.

  1. Case 1: σd1/2\sigma \gg d^{-1/2}. Then σd\sigma \sqrt{d} \rightarrow \infty so that EAstd(A)\E A \gg \text{std}(A). Then (by standard concentration), AA is tightly concentrated around its mean dσ2/2d\sigma^2/2 \rightarrow \infty. Therefore the acceptance probability satisfies

    α(x,y)=min(1,π(y)π(x))=min(1,eA)edσ2/20,\begin{align*} \alpha(x, y) = \min \left(1, \frac{\pi(y)}{\pi(x)} \right) = \min \left(1, e^{-A} \right) \approx e^{-d\sigma^2/2} \rightarrow 0, \end{align*}

    which means that proposals are almost always rejected and the chain cannot mix.

  2. Case 2: σ=O(d1/2)\sigma = O(d^{-1/2}). Writing σ=c/d\sigma = c/\sqrt{d} for a constant c>0c > 0, we get

    EA=c22=O(1),  std(A)=O(c)=O(1).\begin{align*} \E A = \frac{c^2}{2} = O(1), ~~ \text{std}(A) = O(c) = O(1). \end{align*}

    Now AA is an O(1)O(1) random variable fluctuating around an O(1)O(1) mean so eAe^{-A} is a positive O(1)O(1) quantity. The acceptance probability is then bounded away from zero so that chain can actually move.

The conclusion is that the acceptance probability of RWM is non-degenerate if and only if σ=O(d1/2)\sigma = O(d^{-1/2}). Equivalently, the proposal step must shrink as the dimension grows, with σc/d\sigma \sim c/\sqrt{d} being the critical scaling. This is the fundamental tuning constraint for RWM in high-dimensions. When dd is large, O(d1/2)O(d^{-1/2}) tends to be too small with the consequence that the chain takes a very large amount to samples to sample from the whole of π\pi. This makes RWM very inefficient in high-dimensions.

MALA

Here yy is generated from xx via:

y=x+σ22logπ(x)+σz   where zN(0,Id).\begin{align} y = x + \frac{\sigma^2}{2} \nabla \log \pi(x) + \sigma z ~~ \text{ where $z \sim N(0, I_d)$}. \end{align}

In other words, the proposal kernel Q(x,y)Q(x, y) is given by:

Q(x,y)=(12πσ)dexp(yx(σ2/2)logπ(x)22σ2).\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^{d} \exp \left(-\frac{\|y - x - (\sigma^2/2) \nabla \log \pi(x)\|^2}{2 \sigma^2} \right). \end{align*}

Clearly this proposal is non-symmetric in the sense that Q(x,y)Q(y,x)Q(x, y) \neq Q(y, x) so the acceptance probability is now:

min(1,π(y)Q(y,x)π(x)Q(x,y)).\begin{align*} \min \left(1, \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} \right). \end{align*}

As in RWM, it is important to tune σ\sigma appropriately. When dd is large, one still needs to choose σ\sigma small. However, σ\sigma does not have to be as small as in the case of RWM for MALA to work well. This can be easily illustrated in the N(0,Id)N(0, I_d) example.

Example: π=N(0,Id)\pi = N(0, I_d)

Here π(x)exp(x2/2)\pi(x) \propto \exp(-\|x\|^2/2) so that logπ(x)=x\nabla \log \pi(x) = -x. Then (below a=1(σ2/2)a = 1 -(\sigma^2/2)):

π(y)Q(y,x)π(x)Q(x,y)=exp(x2y22)exp(yax2xay22σ2)=exp(x2y22)exp(1a22σ2(y2x2))=exp(σ28(x2y2)).\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{\|x\|^2 - \|y\|^2}{2} \right) \exp \left(\frac{\|y - a x\|^2 - \|x - a y\|^2}{2 \sigma^2} \right) \\ &= \exp \left(\frac{\|x\|^2 - \|y\|^2}{2} \right) \exp \left(\frac{1 - a^2}{2 \sigma^2} \left(\|y\|^2 - \|x\|^2 \right) \right) \\ &= \exp \left(\frac{\sigma^2}{8} \left(\|x\|^2 - \|y\|^2 \right) \right). \end{align*}

Plugging in

y=x+12σ2logπ(x)+σz=ax+σz,\begin{align*} y = x + \frac{1}{2} \sigma^2 \nabla \log \pi(x) + \sigma z = a x + \sigma z, \end{align*}

we obtain

π(y)Q(y,x)π(x)Q(x,y)=exp(σ28(x2ax+σz2))=exp(σ28[σ2(1(σ2/4))x2σ2z22aσ<x,z>])=exp(B)\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{\sigma^2}{8} \left(\|x\|^2 - \|a x + \sigma z\|^2 \right) \right) \\ &= \exp \left(\frac{\sigma^2}{8} \left[\sigma^2 (1 - (\sigma^2/4)) \|x\|^2 - \sigma^2 \|z\|^2 - 2 a \sigma \left< x, z \right>\right] \right) = \exp(-B) \end{align*}

where

B:=σ28[σ2z2+2aσ<x,z>σ2(1(σ2/4))x2].\begin{align*} B := \frac{\sigma^2}{8} \left[ \sigma^2 \|z\|^2 + 2 a \sigma \left< x, z \right> - \sigma^2 (1 - (\sigma^2/4)) \|x\|^2 \right]. \end{align*}

Below we use CC for a generic positive constant whose value might change from specific occurrence to occurrence. Clearly

EB=σ48[d(1(σ2/4))x2]=σ48((dx2)+σ24x2).\begin{align*} \E B = \frac{\sigma^4}{8} \left[d - (1 - (\sigma^2/4)) \|x\|^2 \right] = \frac{\sigma^4}{8} \left((d - \|x\|^2) + \frac{\sigma^2}{4} \|x\|^2 \right). \end{align*}

Also

var(B)σ432(var(σ2z2)+var(2aσ<x,z>))=σ432(2σ4d+4a2σ2x2).\begin{align*} \text{var}(B) \leq \frac{\sigma^4}{32} \left(\text{var}(\sigma^2 \|z\|^2) + \text{var}(2 a \sigma \left<x, z \right>) \right) = \frac{\sigma^4}{32} \left(2 \sigma^4 d + 4 a^2 \sigma^2 \|x\|^2 \right). \end{align*}

Because a=1σ2/21a = 1 - \sigma^2/2 \leq 1, we can write

var(B)Cσ6(dσ2+x2)\begin{align*} \text{var}(B) \leq C \sigma^6 \left(d \sigma^2 + \|x\|^2 \right) \end{align*}

so that

std(B)Cσ3(σd+x).\begin{align*} \text{std}(B) \leq C \sigma^3 (\sigma \sqrt{d} + \|x\|). \end{align*}

Suppose now that xx is such that:

dd2/3x2Cd\begin{align} d - d^{2/3} \leq \|x\|^2 \leq C d \end{align}

for some constant CC. Now let

σ=d1/6\begin{align} \sigma = \ell d^{-1/6} \end{align}

for a constant \ell. Note that this value of σ\sigma is much larger than d1/2d^{-1/2} so that RWM proposals will always be rejected with this σ\sigma. But for the random variable BB, we have

EB=σ48((dx2)+σ24x2)σ48(d2/3+σ24Cd)48+C632\begin{align*} \E B = \frac{\sigma^4}{8} \left((d - \|x\|^2) + \frac{\sigma^2}{4} \|x\|^2 \right) \leq \frac{\sigma^4}{8} \left(d^{2/3} + \frac{\sigma^2}{4} C d \right) \leq \frac{\ell^4}{8} + \frac{C \ell^6}{32} \end{align*}

which is a constant. Also

std(B)Cσ4d+Cσ3xC4d2/3d+C3C\begin{align*} \text{std}(B) \leq C \sigma^4 \sqrt{d} + C \sigma^3 \|x\| \leq C \ell^4 d^{-2/3} \sqrt{d} + C \ell^3 C \end{align*}

whic his also bounded by a constant. Thus for (27) (and under the assumption (26) on xx), the random variable BB will be bounded above by a constant with high probability, which means that the acceptance probability will remain constant. This is in sharp contrast to the case of RWM where the acceptance probability will be extremely small for (27). This argument indicates that MALA will be more effective than RWM when dd is large. Specifically, MALA can be expected to work for much higher values of σ\sigma which will lead to faster mixing.

HMC

In HMC, the proposal yy is generated by solving the following ODE:

x(0)=x    x˙(0)=z    x¨(t)=logπ(x(t))    y=x(σ).\begin{align} x(0) = x ~~~~ \dot{x}(0) = z ~~~~ \ddot{x}(t) = \nabla \log \pi(x(t)) ~~~~ y = x(\sigma). \end{align}

Note that if logπ(x(t))\nabla \log \pi(x(t)) is replaced by logπ(x)\nabla \log \pi(x), then it is easy to check that yy equals the MALA proposal (15).

The ODE (30) is, in most cases, impossible to solve exactly. But in the Gaussian case, it is easy to write a formula for y=x(σ)y = x(\sigma). This is illustrated next.

π=N(0,Id)\pi = N(0, I_d)

Here the ODE (30) becomes

x(0)=x    x˙(0)=z    x¨(t)=x(t)    y=x(σ).\begin{align*} x(0) = x~~~~ \dot{x}(0) = z ~~~~ \ddot{x}(t) = -x(t) ~~~~ y = x(\sigma). \end{align*}

The general solution to x¨(t)=x(t)\ddot{x}(t) = -x(t) is given by αcost+βsint\alpha \cos t + \beta \sin t. The initial conditions x(0)=xx(0) = x and x˙(0)=z\dot{x}(0) = z then give

x(t)=xcost+zsint\begin{align*} x(t) = x \cos t + z \sin t \end{align*}

which implies that

y=x(σ)=xcosσ+zsinσ.\begin{align*} y = x(\sigma) = x \cos \sigma + z \sin \sigma. \end{align*}

The proposal transition kernel Q(x,y)Q(x, y) is therefore

Q(x,y)=(12πsinσ)dexp(yxcosσ22sin2σ).\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} |\sin \sigma|} \right)^d \exp \left(- \frac{\|y - x \cos \sigma\|^2}{2 \sin^2 \sigma} \right). \end{align*}

Note now that

π(y)Q(y,x)π(x)Q(x,y)=exp(12(x2y2))exp(yxcosσ2xycosσ22sin2σ)=exp(12(x2y2))exp(12(y2x2))=1.\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{1}{2} (\|x\|^2 - \|y\|^2) \right) \exp \left(\frac{\|y - x \cos \sigma\|^2- \|x - y \cos \sigma\|^2}{2 \sin^2 \sigma} \right) \\ &= \exp \left(\frac{1}{2} (\|x\|^2 - \|y\|^2) \right)\exp \left(\frac{1}{2} (\|y\|^2 - \|x\|^2) \right) = 1. \end{align*}

So the acceptance probability of this chain equals 1 regardless of the value of σ\sigma. This means that the HMC chain can be evolved to a large value of σ\sigma without affecting the acceptance probability at all. This is quite unlike the case for RWM and MALA where σ\sigma has to be small otherwise the acceptance probability will be very small. The possibility of taking large steps (in terms of σ\sigma) is one of the main advantages of HMC.

We shall, in the next lecture, that, for general π\pi (not just Gaussian), the HMC proposal (30) satisfies detailed balance with respect to π\pi for every σ\sigma (so that the subsequent M-H acceptance probability equals 1). In practice, however, the HMC proposal (30) cannot be directly used because this ODE cannot be solved exactly. One uses a numerical approximate solution to the ODE (30). This approximation spoils the detailed balance property however, so subsequent Metropolization is necessary. However, the acceptance probability (while not exactly equal to 1) will still be usually high for HMC even for not too small σ\sigma. More details will be provided next week.