Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty One

Spring 2026, UC Berkeley

We will study the MALA and HMC algorithms. MALA stands for Metropolis Adjusted Langevin Algorithm. HMC stands for Hamiltonian Monte Carlo. Before we discuss MALA, let us first recall the basics of markov chains.

Recap: Markov Chains, detailed balance and the Metropolis-Hastings Algorithm

Let SS be a state space. A sequence of random variables Θ(0),Θ(1),\Theta^{(0)}, \Theta^{(1)}, \dots taking values in SS is a Markov Chain on SS provided the conditional distribution of Θ(t+1)\Theta^{(t+1)} given Θ(t)=θ(t),Θ(t1)=θ(t1),,Θ(0)=θ(0)\Theta^{(t)} = \theta^{(t)}, \Theta^{(t-1)} = \theta^{(t-1)}, \dots, \Theta^{(0)} = \theta^{(0)} only depends on θ(t)\theta^{(t)}. Further, if this conditional distribution is the same for all tt, then the Markov chain is said to be time-homogeneous. We will only deal with time-homogeneous Markov Chains in this course.

A (time-homogeneous) Markov Chain is described via the conditional distribution of Θ(t+1)\Theta^{(t+1)} given Θ(t)=x\Theta^{(t)} = x, which is described by a probability transition kernel P(x,y)P(x, y). If SS is discrete, P(x,y)P(x, y) is the conditional probability of Θ(t+1)=y\Theta^{(t+1)} = y given Θ(t)=x\Theta^{(t)} = x. If SS is an open subset of Rd\R^d, P(x,y)P(x, y) represents the conditional density of Θ(t+1)\Theta^{(t+1)} given Θ(t)=x\Theta^{(t)} = x.

The Markov chain given by P(x,y)P(x, y) satisfies detailed balance with respect to a probability π(x)\pi(x) on SS if the following condition is satisfied:

π(y)P(y,x)=π(x)P(x,y)  for all x,yS.\begin{align} \pi(y) P(y, x) = \pi(x) P(x, y) ~~\text{for all $x, y \in S$}. \end{align}

This detailed balance condition implies that π\pi is stationary for the chain PP in the sense that Θ(t)π\Theta^{(t)} \sim \pi implies Θ(t+1)π\Theta^{(t+1)} \sim \pi.

Stationarity along with the additional condition of irreducibility (which means that for every pair of points xx and yy in the state space, there is a positive probability that the chain goes from xx to yy in finite time) imply ergodicity which means:

limN1Nt=1Ng(Θ(t))=g(θ)dπ(θ)\begin{align*} \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{t=1}^N g(\Theta^{(t)}) = \int g(\theta) d\pi(\theta) \end{align*}

almost surely.

Metropolis-Hastings is the most widely used method for constructing markov chains satisfying detailed balance with respect to π\pi. This method starts with an arbitrary transition kernel Q(x,y)Q(x, y) that may not even have anything to do with π\pi (in particular, it is not at all necessary that QQ satisfies detailed balance with respect to π\pi). Metropolis-Hastings uses Q(,)Q(\cdot, \cdot) and π()\pi(\cdot) to construct a new Markov chain {Θ(t)}\{\Theta^{(t)}\} which moves as follows. Given Θ(t)=x\Theta^{(t)} = x, first use Q(x,)Q(x, \cdot) to generate yy, then take Θ(t+1)\Theta^{(t+1)} to be:

Undefined control sequence: \1 at position 123: …(x,y)}\right), \̲1̲em]
x, & \text{…

\Theta^{(t+1)} =
\begin{cases}
y, & \text{with probability } \min\!\left(1, \frac{\pi(y)\,Q(y,x)}{\pi(x)\,Q(x,y)}\right), \1em]
x, & \text{with probability } 1 - \min\!\left(1, \frac{\pi(y)\,Q(y,x)}{\pi(x)\,Q(x,y)}\right).
\end{cases}

The resulting Metropolis-Hastings chain satisfies detailed balance with respect to π\pi, as we saw previously in Lecture 25.

Random Walk Metropolis (RWM)

The Random Walk Metropolis (RWM) algorithm is arguably the simplest example of a markov chain which satisfies detailed balance with respect to π\pi. It uses the transition kernel Q(x,y)Q(x, y) which generates proposals via:

y=x+σz  where zN(0,Id).\begin{align} y = x + \sigma z ~~ \text{where $z \sim N(0, I_d)$}. \end{align}

In other words, the proposal transition kernel is given by

Q(x,y)=(12πσ)dexp(xy22σ2).\begin{align} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^d \exp \left(-\frac{\|x - y\|^2}{2\sigma^2} \right) . \end{align}

Clearly Q(x,y)=Q(y,x)Q(x, y) = Q(y, x) so the resulting acceptance probability (in the Metropolis-Hastings algorithm) is simply

min(1,π(y)π(x)).\begin{align*} \min \left(1, \frac{\pi(y)}{\pi(x)} \right). \end{align*}

The choice of σ\sigma (in (5)) is crucial to the practical performance of the algorithm. If σ\sigma is not small, then Q(x,)Q(x, \cdot) may result in proposals yy that are far from xx. If π()\pi(\cdot) is highly concentrated (as is typically the case with posteriors), then π(y)\pi(y) might be much smaller than π(x)\pi(x) leading to the chain rejecting often and getting stuck. On the other hand if σ\sigma is too small, then the chain might take a long time to explore the full distribution π()\pi(\cdot).

Metropolis Adjusted Langevin Algorithm (MALA)

In MALA, the rwm proposal (4) is modified with an expression that depends on π\pi. To find this formula, let us first consider the following proposal:

y=x+g(x)+σz  with zN(0,Id)\begin{align} y = x + g(x) + \sigma z ~~\text{with $z \sim N(0, I_d)$} \end{align}

for some function g(x)g(x) depending on π(x)\pi(x). Clearly (4) can be seen as a special case of (7) corresponding to g(x)=0g(x) = 0. The proposal transition kernel corresponding to (7) is given by:

Q(x,y)=(12πσ)dexp(12σ2yxg(x)2).\begin{align*} Q(x, y) = \left(\frac{1}{\sqrt{2 \pi} \sigma} \right)^d \exp \left(-\frac{1}{2 \sigma^2} \|y - x - g(x) \|^2 \right). \end{align*}

Note that Q(x,y)Q(x, y) is no longer equal to Q(y,x)Q(y, x). The key question is how to choose g(x)g(x). It is natural to choose g(x)g(x) so that the detailed balance condition holds with respect to π\pi:

π(x)Q(x,y)π(y)Q(y,x)\begin{align*} \pi(x) Q(x, y) \approx \pi(y) Q(y, x) \end{align*}

at least approximately when yy is close to xx. Check that

Q(y,x)Q(x,y)=exp(1σ2<yx,g(x)+g(y)>+12σ2(g(x)2g(y)2)).\begin{align*} \frac{Q(y, x)}{Q(x, y)} &= \exp \left(- \frac{1}{\sigma^2}\left<y - x, g(x) + g(y) \right> + \frac{1}{2 \sigma^2} \left(\|g(x)\|^2 - \|g(y)\|^2 \right) \right). \end{align*}

As a result:

π(y)Q(y,x)π(x)Q(x,y)=exp(logπ(y)logπ(x))exp(1σ2<yx,g(x)+g(y)>+12σ2(g(x)2g(y)2)).\begin{align*} & \frac{\pi(y)Q(y, x)}{\pi(x) Q(x, y)} \\ &= \exp \left(\log \pi(y) - \log \pi(x) \right) \exp \left(- \frac{1}{\sigma^2}\left<y - x, g(x) + g(y) \right> + \frac{1}{2 \sigma^2} \left(\|g(x)\|^2 - \|g(y)\|^2 \right) \right). \end{align*}

For yy close to xx, we use the approximations:

logπ(y)logπ(x)<yx,logπ(x)>1σ2<yx,g(x)+g(y)>2σ2<yx,g(x)>12σ2(g(x)2g(y)2)0.\begin{align*} & \log \pi(y) - \log \pi(x) \approx \left<y - x, \nabla \log \pi(x) \right> \\ & \frac{1}{\sigma^2}\left<y - x, g(x) + g(y) \right> \approx \frac{2}{\sigma^2} \left<y - x, g(x) \right> \\ & \frac{1}{2 \sigma^2} \left(\|g(x)\|^2 - \|g(y)\|^2 \right) \approx 0. \end{align*}

This leads to

π(y)Q(y,x)π(x)Q(x,y)exp(<yx,logπ(x)2g(x)σ2>)\begin{align*} \frac{\pi(y)Q(y, x)}{\pi(x) Q(x, y)} \approx \exp \left(\left<y - x, \nabla \log \pi(x) - \frac{2g(x)}{\sigma^2} \right> \right) \end{align*}

when yy is close to xx. This clearly suggests taking

g(x)=σ22logπ(x).\begin{align*} g(x) = \frac{\sigma^2}{2} \nabla \log \pi(x). \end{align*}

This leads to the MALA proposal:

y=x+12σ2logπ(x)+σz   where zN(0,Id).\begin{align} y = x + \frac{1}{2} \sigma^2 \nabla \log \pi(x) + \sigma z ~~ \text{ where $z \sim N(0, I_d)$}. \end{align}

When σ\sigma is small, yy is automatically close to xx, so the above argument can be used to show that the Metropolis-Hastings acceptance ratio for this proposal is nearly equal to 1.

Mechanics Interpretation

We can interpret the RWM proposal (4) in physical terms as follows:

x(0)=x,  x˙(0)=zN(0,Id)and, motion with constant velocity z for time [0,σ]    x(σ)=RWM proposal\begin{split} & x(0) = x, ~~ \dot{x}(0) = z \sim N(0, I_d) \\ & \text{and, motion with constant velocity $z$ for time $[0, \sigma]$} \\ & \implies x(\sigma) = \text{RWM proposal} \end{split}

Similarly, we can interpret the MALA proposal (15) in physical terms as follows:

x(0)=x,  x˙(0)=zN(0,Id)and, motion with constant acceleration logπ(x) for time [0,σ]    x(σ)=MALA proposal\begin{split} & x(0) = x, ~~ \dot{x}(0) = z \sim N(0, I_d) \\ & \text{and, motion with constant acceleration $\nabla \log \pi(x)$ for time $[0, \sigma]$} \\ & \implies x(\sigma) = \text{MALA proposal} \end{split}

The physical analogy makes the difference between RWM and MALA transparent. In both cases the particle starts at xx with a random velocity zN(0,Id)z \sim N(0, I_d), and the proposal y=x(σ)y = x(\sigma) is the position after time σ\sigma. The distinction lies in the dynamics:

As a consequence, MALA proposals are systematically better aligned with π\pi than RWM proposals. In practice this allows MALA to take much larger steps σ\sigma while maintaining a reasonable acceptance rate, leading to faster exploration of the target distribution.

In the coming lectures, we shall look at HMC, which is a further modification of (17). Specifically, the HMC proposal is obtained via:

x(0)=x,  x˙(0)=zN(0,Id)x¨(t)=logπ(x(t))    x(σ)=HMC proposal\begin{split} & x(0) = x, ~~ \dot{x}(0) = z \sim N(0, I_d) \\ & \ddot{x}(t) = \nabla \log \pi(x(t)) \\ & \implies x(\sigma) = \text{HMC proposal} \end{split}

The step from MALA to HMC is precisely the step from constant acceleration to position-dependent acceleration. In MALA the acceleration logπ(x)\nabla\log\pi(x) is evaluated once at the current position xx and held fixed for the entire flight time [0,σ][0,\sigma]. In HMC the acceleration is updated continuously along the trajectory: at every time tt the particle feels logπ(x(t))\nabla\log\pi(x(t)). One downside to this time-dependent acceleration is that x(σ)x(\sigma) usually cannot be written in closed form as a function of xx and zz (unlike RWM and MALA), and some numerical integration scheme has to be used to figure out x(σ)x(\sigma).