STAT 238 - Bayesian Statistics Lecture Thirty Two

Spring 2026, UC Berkeley

We are discussing standard MCMC techniques to obtain samples from a probability $\pi$ . All these techniques are based on the Metropolis-Hastings scheme, where, given the current value $x$ of the chain, a proposal $y$ is first generated according to some transition kernel $Q(x, y)$ , and then the proposal is accepted with probability:

\begin{align} \min \left(1, \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} \right). \end{align}

(1)

Different MCMC methods differ on how the proposal kernel $Q(x, y)$ is chosen. We shall look at three main techniques for choosing $Q(x, y)$ : Random Walk Metropolis (RWM), Metropolis Adjusted Langevin Algorithm (MALA) and Hamiltonian Monte Carlo (HMC).

RWM¶

Here $y$ is generated from $x$ via:

\begin{align*} y = x + \sigma z ~~ \text{ where $z \sim N(0, I_d)$}. \end{align*}

(2)

In other words, the proposal kernel $Q(x, y)$ is given by:

\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^{d} \exp \left(-\frac{\|y - x\|^2}{2 \sigma^2} \right). \end{align*}

(3)

Clearly this proposal is symmetric in the sense that $Q(x, y) = Q(y, x)$ . As a result the acceptance probability (1) becomes simply

\begin{align*} \min \left(1, \frac{\pi(y)}{\pi(x)} \right). \end{align*}

(4)

When the dimension $d$ is small, this method works reasonably well. When $d$ is large, it will work only if $\sigma$ is chosen very small. However, with such small $\sigma$ , one would need to run the chain for a very large amount of time to ensure adequate mixing. The consequence is that RWM does not work well when $d$ is large.

Example: $\pi$ is $N(0, I_d)$ ¶

As an illustrative example, consider the case where:

\begin{align*} \pi(x) \propto \exp(-\|x\|^2/2). \end{align*}

(5)

In this case, we have

\begin{align*} \frac{\pi(y)}{\pi(x)} = \exp \left[\frac{1}{2} \left(\|x\|^2 - \|y\|^2 \right) \right]. \end{align*}

(6)

Because the proposed value $y$ will be $x + \sigma z$ for $z \sim N(0, I_d)$ , we have

\begin{align*} \frac{\pi(y)}{\pi(x)} = \exp \left[\frac{1}{2} \left(\|x\|^2 - \|x + \sigma z\|^2\right) \right] = \exp(-A) \end{align*}

(7)

where

\begin{align*} A := \sigma \left<z, x \right> + \frac{\sigma^2}{2} \|z\|^2. \end{align*}

(8)

Because $z \sim N(0, I_d)$ , it is easy to check that the mean and variance of $A$ satisfy:

\begin{align*} \E A = \frac{d \sigma^2}{2} \end{align*}

(9)

and

\begin{align*} \text{var}(A) &= \E \left( \sigma \left<z, x \right> + \frac{\sigma^2}{2} (\|z\|^2 - d) \right)^2 \\ &\leq 2 \sigma^2 \text{var}(\left<z, x \right>) + \frac{\sigma^4}{2} \text{var}(\|z\|^2) \leq 2 \sigma^2\|x\|^2 + 2 d \sigma^4 \end{align*}

(10)

which means that the standard deviation of $A$ is $O(\sigma \|x\| + \sigma^2 \sqrt{d})$ . Suppose now that $\|x\| = O(\sqrt{d})$ and $\sigma \leq 1$ , then the standard deviation of $A$ is:

\begin{align*} \text{std}(A) = O \left(\sigma \|x\| + \sigma^2 \sqrt{d} \right) = O \left(\sigma \sqrt{d} + \sigma^2 \sqrt{d} \right) = O(\sigma \sqrt{d}). \end{align*}

(11)

We thus have:

\begin{align*} \E A = \frac{d\sigma^2}{2} ~~ \text{ and } ~~ \text{std}(A) = O(\sigma \sqrt{d}). \end{align*}

(12)

We can now consider two regimes of $\sigma$ .

Case 1: $\sigma \gg d^{-1/2}$ . Then $\sigma \sqrt{d} \rightarrow \infty$ so that $\E A \gg \text{std}(A)$ . Then (by standard concentration), $A$ is tightly concentrated around its mean $d\sigma^2/2 \rightarrow \infty$ . Therefore the acceptance probability satisfies
$\begin{align*} \alpha(x, y) = \min \left(1, \frac{\pi(y)}{\pi(x)} \right) = \min \left(1, e^{-A} \right) \approx e^{-d\sigma^2/2} \rightarrow 0, \end{align*}$
(13)
which means that proposals are almost always rejected and the chain cannot mix.
Case 2: $\sigma = O(d^{-1/2})$ . Writing $\sigma = c/\sqrt{d}$ for a constant $c > 0$ , we get
$\begin{align*} \E A = \frac{c^2}{2} = O(1), ~~ \text{std}(A) = O(c) = O(1). \end{align*}$
(14)
Now $A$ is an $O(1)$ random variable fluctuating around an $O(1)$ mean so $e^{-A}$ is a positive $O(1)$ quantity. The acceptance probability is then bounded away from zero so that chain can actually move.

The conclusion is that the acceptance probability of RWM is non-degenerate if and only if $\sigma = O(d^{-1/2})$ . Equivalently, the proposal step must shrink as the dimension grows, with $\sigma \sim c/\sqrt{d}$ being the critical scaling. This is the fundamental tuning constraint for RWM in high-dimensions. When $d$ is large, $O(d^{-1/2})$ tends to be too small with the consequence that the chain takes a very large amount to samples to sample from the whole of $\pi$ . This makes RWM very inefficient in high-dimensions.

MALA¶

Here $y$ is generated from $x$ via:

\begin{align} y = x + \frac{\sigma^2}{2} \nabla \log \pi(x) + \sigma z ~~ \text{ where $z \sim N(0, I_d)$}. \end{align}

(15)

In other words, the proposal kernel $Q(x, y)$ is given by:

\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^{d} \exp \left(-\frac{\|y - x - (\sigma^2/2) \nabla \log \pi(x)\|^2}{2 \sigma^2} \right). \end{align*}

(16)

Clearly this proposal is non-symmetric in the sense that $Q(x, y) \neq Q(y, x)$ so the acceptance probability is now:

\begin{align*} \min \left(1, \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} \right). \end{align*}

(17)

As in RWM, it is important to tune $\sigma$ appropriately. When $d$ is large, one still needs to choose $\sigma$ small. However, $\sigma$ does not have to be as small as in the case of RWM for MALA to work well. This can be easily illustrated in the $N(0, I_d)$ example.

Example: $\pi = N(0, I_d)$ ¶

Here $\pi(x) \propto \exp(-\|x\|^2/2)$ so that $\nabla \log \pi(x) = -x$ . Then (below $a = 1 -(\sigma^2/2)$ ):

\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{\|x\|^2 - \|y\|^2}{2} \right) \exp \left(\frac{\|y - a x\|^2 - \|x - a y\|^2}{2 \sigma^2} \right) \\ &= \exp \left(\frac{\|x\|^2 - \|y\|^2}{2} \right) \exp \left(\frac{1 - a^2}{2 \sigma^2} \left(\|y\|^2 - \|x\|^2 \right) \right) \\ &= \exp \left(\frac{\sigma^2}{8} \left(\|x\|^2 - \|y\|^2 \right) \right). \end{align*}

(18)

Plugging in

\begin{align*} y = x + \frac{1}{2} \sigma^2 \nabla \log \pi(x) + \sigma z = a x + \sigma z, \end{align*}

(19)

we obtain

\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{\sigma^2}{8} \left(\|x\|^2 - \|a x + \sigma z\|^2 \right) \right) \\ &= \exp \left(\frac{\sigma^2}{8} \left[\sigma^2 (1 - (\sigma^2/4)) \|x\|^2 - \sigma^2 \|z\|^2 - 2 a \sigma \left< x, z \right>\right] \right) = \exp(-B) \end{align*}

(20)

where

\begin{align*} B := \frac{\sigma^2}{8} \left[ \sigma^2 \|z\|^2 + 2 a \sigma \left< x, z \right> - \sigma^2 (1 - (\sigma^2/4)) \|x\|^2 \right]. \end{align*}

(21)

Below we use $C$ for a generic positive constant whose value might change from specific occurrence to occurrence. Clearly

\begin{align*} \E B = \frac{\sigma^4}{8} \left[d - (1 - (\sigma^2/4)) \|x\|^2 \right] = \frac{\sigma^4}{8} \left((d - \|x\|^2) + \frac{\sigma^2}{4} \|x\|^2 \right). \end{align*}

(22)

Also

\begin{align*} \text{var}(B) \leq \frac{\sigma^4}{32} \left(\text{var}(\sigma^2 \|z\|^2) + \text{var}(2 a \sigma \left<x, z \right>) \right) = \frac{\sigma^4}{32} \left(2 \sigma^4 d + 4 a^2 \sigma^2 \|x\|^2 \right). \end{align*}

(23)

Because $a = 1 - \sigma^2/2 \leq 1$ , we can write

\begin{align*} \text{var}(B) \leq C \sigma^6 \left(d \sigma^2 + \|x\|^2 \right) \end{align*}

(24)

so that

\begin{align*} \text{std}(B) \leq C \sigma^3 (\sigma \sqrt{d} + \|x\|). \end{align*}

(25)

Suppose now that $x$ is such that:

\begin{align} d - d^{2/3} \leq \|x\|^2 \leq C d \end{align}

(26)

for some constant $C$ . Now let

\begin{align} \sigma = \ell d^{-1/6} \end{align}

(27)

for a constant $\ell$ . Note that this value of $\sigma$ is much larger than $d^{-1/2}$ so that RWM proposals will always be rejected with this $\sigma$ . But for the random variable $B$ , we have

\begin{align*} \E B = \frac{\sigma^4}{8} \left((d - \|x\|^2) + \frac{\sigma^2}{4} \|x\|^2 \right) \leq \frac{\sigma^4}{8} \left(d^{2/3} + \frac{\sigma^2}{4} C d \right) \leq \frac{\ell^4}{8} + \frac{C \ell^6}{32} \end{align*}

(28)

which is a constant. Also

\begin{align*} \text{std}(B) \leq C \sigma^4 \sqrt{d} + C \sigma^3 \|x\| \leq C \ell^4 d^{-2/3} \sqrt{d} + C \ell^3 C \end{align*}

(29)

whic his also bounded by a constant. Thus for (27) (and under the assumption (26) on $x$ ), the random variable $B$ will be bounded above by a constant with high probability, which means that the acceptance probability will remain constant. This is in sharp contrast to the case of RWM where the acceptance probability will be extremely small for (27). This argument indicates that MALA will be more effective than RWM when $d$ is large. Specifically, MALA can be expected to work for much higher values of $\sigma$ which will lead to faster mixing.

HMC¶

In HMC, the proposal $y$ is generated by solving the following ODE:

\begin{align} x(0) = x ~~~~ \dot{x}(0) = z ~~~~ \ddot{x}(t) = \nabla \log \pi(x(t)) ~~~~ y = x(\sigma). \end{align}

(30)

Note that if $\nabla \log \pi(x(t))$ is replaced by $\nabla \log \pi(x)$ , then it is easy to check that $y$ equals the MALA proposal (15).

The ODE (30) is, in most cases, impossible to solve exactly. But in the Gaussian case, it is easy to write a formula for $y = x(\sigma)$ . This is illustrated next.

$\pi = N(0, I_d)$ ¶

Here the ODE (30) becomes

\begin{align*} x(0) = x~~~~ \dot{x}(0) = z ~~~~ \ddot{x}(t) = -x(t) ~~~~ y = x(\sigma). \end{align*}

(31)

The general solution to $\ddot{x}(t) = -x(t)$ is given by $\alpha \cos t + \beta \sin t$ . The initial conditions $x(0) = x$ and $\dot{x}(0) = z$ then give

\begin{align*} x(t) = x \cos t + z \sin t \end{align*}

(32)

which implies that

\begin{align*} y = x(\sigma) = x \cos \sigma + z \sin \sigma. \end{align*}

(33)

The proposal transition kernel $Q(x, y)$ is therefore

\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} |\sin \sigma|} \right)^d \exp \left(- \frac{\|y - x \cos \sigma\|^2}{2 \sin^2 \sigma} \right). \end{align*}

(34)

Note now that

\begin{align*} \frac{\pi(y) Q(y, x)}{\pi(x) Q(x, y)} &= \exp \left(\frac{1}{2} (\|x\|^2 - \|y\|^2) \right) \exp \left(\frac{\|y - x \cos \sigma\|^2- \|x - y \cos \sigma\|^2}{2 \sin^2 \sigma} \right) \\ &= \exp \left(\frac{1}{2} (\|x\|^2 - \|y\|^2) \right)\exp \left(\frac{1}{2} (\|y\|^2 - \|x\|^2) \right) = 1. \end{align*}

(35)

So the acceptance probability of this chain equals 1 regardless of the value of $\sigma$ . This means that the HMC chain can be evolved to a large value of $\sigma$ without affecting the acceptance probability at all. This is quite unlike the case for RWM and MALA where $\sigma$ has to be small otherwise the acceptance probability will be very small. The possibility of taking large steps (in terms of $\sigma$ ) is one of the main advantages of HMC.

We shall, in the next lecture, that, for general $\pi$ (not just Gaussian), the HMC proposal (30) satisfies detailed balance with respect to $\pi$ for every $\sigma$ (so that the subsequent M-H acceptance probability equals 1). In practice, however, the HMC proposal (30) cannot be directly used because this ODE cannot be solved exactly. One uses a numerical approximate solution to the ODE (30). This approximation spoils the detailed balance property however, so subsequent Metropolization is necessary. However, the acceptance probability (while not exactly equal to 1) will still be usually high for HMC even for not too small $\sigma$ . More details will be provided next week.