STAT 238 - Bayesian Statistics Lecture Thirty Four

Spring 2026, UC Berkeley

Hamiltonian Monte Carlo¶

There is a given target probability density $\pi$ and our goal is to construct a Markov chain which satisfies detailed balance with respect to $\pi$ . Given a current value $x$ , HMC constructs the next value $y$ by solving the following second order ODE:

\begin{align} \ddot{x}(t) = \nabla \log \pi(x(t)) ~~ \text{ with initialization $x(0) = x$ and $\dot{x}(0) = z \sim N(0, I_d)$}. \end{align}

(1)

The ODE (30) can be alternatively written in the following first order form:

\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} v(t) \\ \nabla \log \pi(x(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ z \end{pmatrix} \end{align}

(2)

which can also be written in terms of the Hamiltonian:

\begin{align} H(x, v) = H(x_1, \dots, x_d, v_1, \dots, v_d) := -\log \pi(x) + \frac{1}{2}\|v\|^2 \end{align}

(3)

\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \frac{\partial H}{\partial v}(x(t), v(t)) \\ -\frac{\partial H}{\partial x}(x(t), v(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ z \end{pmatrix} \end{align}

(4)

Most authors (see e.g., neal2011mcmc) use the formulation (6) to describe HMC. The advantage of the formulation (30) is that it makes the connection to MALA very clear (essentially, MALA proposals are obtained by replacing $\nabla \log \pi(x(t))$ by $\nabla \log \pi(x(0))$ ).

On the other hand, the main advantage of the Hamiltonian formulation is that it also works for choices of the Hamiltonian that are different from (3). For example, consider the alternative choices of the Hamiltonian:

$H(x, v) = -\log \pi(x) + v^T M v/2$ for a positive definite matrix $v$ . Here the dynamics (6) become
$\begin{align*} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} M v(t) \\ -\log \pi(x(t)) \end{pmatrix}. \end{align*}$
(5)
because $\frac{\partial H}{\partial x} = -\nabla \log \pi(x)$ and $\frac{\partial H}{\partial v} = M v$ . This formulation of the Hamiltonian is meaningful when the variables $x_1, \dots, x_d$ have different scales. One can also make $M$ depend on $x$ . This is related to Riemannian Manifold HMC (see girolami2011riemann).
$H(x, v) = -\log \pi(x) + \|v\|_1$ . Here the dynamics (6) become
$\begin{align*} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \text{sign}(v(t)) \\ -\log \pi(x(t)) \end{pmatrix} \end{align*}$
(6)
because $\frac{\partial H}{\partial v} = \text{sign}(v)$ . Also note that $v(t) = (v_1(t), \dots, v_d(t))$ and $\text{sign}(v(t))$ is interpreted coordinatewise as $(\text{sign}(v_1(t)), \dots, \text{sign}(v_d(t)))$ . These are interesting dynamics that cannot be written in a second order form as in (30).

Hamiltonian Dynamics¶

Some understanding of Hamiltonian Dynamics will be useful for HMC. Hamiltonian Dynamics refers to the ODE (6)

\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \frac{\partial H}{\partial v}(x(t), v(t)) \\ -\frac{\partial H}{\partial x}(x(t), v(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ v \end{pmatrix} \end{align}

(7)

The above is the same as (6) except that we denote the initial velocity by $v$ (instead of $v$ in (6)). In this section, we will not use the specific form of the Hamiltonian given by (3), but we will take the more general form:

\begin{align} H(x, v) = -\log \pi(x) + K(v) \end{align}

(8)

where $K(v)$ is a symmetric function of $v$ i.e., $K(-v) = K(v)$ .

We will view the Hamiltonian dynamics (run for a fixed time $\sigma$ ) as a mapping from $(x, v)$ (the initial state) to a final state $(x(\sigma), v(\sigma))$ . We will denote this mapping by $T_{\sigma}(x, v)$ :

\begin{align*} T_{\sigma}(x, v) = (x(\sigma), v(\sigma)). \end{align*}

(9)

The space of all pairs of position and velocity $(x, v)$ will be referred to as the phase space.

The basic properties of $T_{\sigma}(\cdot)$ that we need for HMC are summarized in this section. For more details, you can refer to neal2011mcmc.

Property One: Invertibility or Reversibility¶

The map $T_{\sigma}(x, v)$ is invertible and its inverse can be written in a simple manner. Let $S(x, v) = (x, -v)$ be the operator which flips the velocity. Then

\begin{align} T_{\sigma}^{-1} = S T_{\sigma} S. \end{align}

(10)

In other words, to compute $T_{\sigma}^{-1}$ at a point $(y, w)$ , we need to take the following three steps:

First, flip the velocity to obtain $(y, -w)$ .
Second, run the Hamiltonian dynamics for time $\sigma$ starting at $(y, -w)$ to obtain $(x, -v)$ at time $\sigma$ .
Third, flip velocity again to obtain $(x, v)$ .

The formula (10) has the following simple consequence:

\begin{align} T_{\sigma} S = (T_{\sigma} S)^{-1}. \end{align}

(11)

To see this, simply note

\begin{align*} (T_{\sigma} S)^{-1} = S^{-1} T_{\sigma}^{-1} = S^{-1} S T_{\sigma} S = T_{\sigma} S. \end{align*}

(12)

A function $g$ for which $g^{-1} = g$ (which is equivalent to $g(g(x)) = x$ ) is called an involution. The above statement is therefore equivalent to saying that $T_{\sigma} S$ is an involution. Involutions have been recently used to unify many MCMC algorithms (see e.g., glatt2024sacred).

Property Two: Hamiltonian Conservation¶

Hamiltonian dynamics conserve the Hamiltonian in the sense that:

\begin{align} H(x(t), v(t)) = \text{constant}. \end{align}

(13)

To see this, just note that

\begin{align*} \frac{d}{dt} H(x(t), v(t)) &= \sum_{i=1}^d \left( \frac{\partial H}{\partial x_i} \dot{x}_i(t) + \frac{\partial H}{\partial v_i} \dot{v}_i(t) \right) \\ &= \sum_{i=1}^d \left(\frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} + \frac{\partial H}{\partial v_i} \left(- \frac{\partial H}{\partial x_i}\right) \right) = \sum_{i=1}^d \left(\frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} - \frac{\partial H}{\partial v_i} \frac{\partial H}{\partial x_i} \right) = 0 \end{align*}

(14)

Property Three: Volume Preservation¶

Volume preservation means that if we take a region $R$ in the phase space $(x, v)$ , then

\begin{align} \text{vol}(R) = \text{vol}(T_{\sigma}(R)) \end{align}

(15)

where $T_{\sigma}(R) = \{T_{\sigma}(x, v): (x, v) \in R\}$ .

The volume preservation property (15) is equivalent to

\begin{align*} \text{vol}(R) &= \text{vol}(T_{\sigma}(R)) = \int I\{(y, w) \in T_{\sigma}(R)\} d(y, w) = \int I\{T_{\sigma}^{-1} (y, w) \in R\} d(y, w). \end{align*}

(16)

Using the change of variable $(x, v) = T_{\sigma}^{-1}(y, w)$ above, we get

\begin{align*} \int I\{T_{\sigma}^{-1} (y, w) \in R\} d(y, w) = \int I\{(x, v) \in R\} |\det J T_{\sigma}(x, v)| d(x, v) \end{align*}

(17)

where $J T_{\sigma}$ denotes the Jacobian of $T_{\sigma}$ . Thus volume preservation can be proved by showing that the Jacobian $J T_{\sigma}$ has determinant equal to 1 for all $(x, v)$ :

\begin{align} \det J T_{\sigma}(x, v) = 1 ~~ \text{ for all $(x, v)$}. \end{align}

(18)

Here is a sketch of the proof of (18). Assume first that $\sigma$ is small so that $T_{\sigma}$ can be well-approximated by a linear function:

\begin{align*} T_{\sigma}(x, v) \approx \begin{pmatrix} x \\ v \end{pmatrix} + \sigma \begin{pmatrix} \frac{\partial H}{\partial v}(x, v) \\ - \frac{\partial H}{\partial x}(x, v) \end{pmatrix} \end{align*}

(19)

As a result

\begin{align*} J T_{\sigma} \approx I + \sigma \begin{pmatrix} \frac{\partial^2 H}{\partial x \partial v} & \frac{\partial^2 H}{\partial v^2} \\ -\frac{\partial^2 H}{\partial x^2} & - \frac{\partial^2 H}{\partial x \partial v} \end{pmatrix} \end{align*}

(20)

which shows that

\begin{align*} \det (J T_{\sigma}) \approx \det \begin{pmatrix} 1 + \sigma \frac{\partial^2 H}{\partial x \partial v} & \sigma\frac{\partial^2 H}{\partial v^2} \\ -\sigma\frac{\partial^2 H}{\partial x^2} & 1 - \sigma \frac{\partial^2 H}{\partial x \partial v} \end{pmatrix} = 1 + O(\sigma^2). \end{align*}

(21)

If $\sigma$ is not small, then break $[0, \sigma]$ into $N$ intervals each of length $\sigma/N$ , and then decompose $T_{\sigma}$ as $T_N \circ T_{N-1} \circ \dots \circ T_1$ where $T_j$ is the mapping corresponding to Hamiltonian dynamics from $t = (j-1)\sigma/N$ to $t = j\sigma/N$ . For each $j$ , we use the previous argument to claim $\det J T_j = 1 + O(\sigma^2/N^2)$ . Taking the product over $j$ , we get

\begin{align*} \det J T_{\sigma} = \prod_{j=1}^N \left(1 + O(\sigma^2/N^2) \right) \rightarrow 1 \end{align*}

(22)

as $N \rightarrow \infty$ . This completes the heuristic argument for (18).

Discretization of (2)¶

The ODE (2) cannot be solved exactly for most $\pi(\cdot)$ . So we have to approximately solve it using a discretization technique. We fix a step size $\epsilon$ and approximate the ODE (2) at $t = 0, \epsilon, 2 \epsilon, \dots$ . More specifically, we will construct:

\begin{align*} \begin{pmatrix} x(0) \\ v(0) \end{pmatrix}, \begin{pmatrix} x(\epsilon) \\ v(\epsilon) \end{pmatrix}, \begin{pmatrix} x(2 \epsilon) \\ v(2 \epsilon) \end{pmatrix}, \dots \end{align*}

(23)

Below we will describe how to obtain $(x(t+\epsilon), v(t + \epsilon))$ from $(x(t), v(t))$ (this process will then be iteratively applied starting at $t = 0$ ).

The standard approach to discretization of the HMC equation (2) is to use the leapfrog discretization. The leapfrog discretization uses the following formulae to construct $x(t+\epsilon)$ and $v(t + \epsilon)$ from $x(t), y(t)$ .

\begin{split} & v\left(t + \frac{\epsilon}{2}\right) = v(t) + \frac{\epsilon}{2} \nabla \log \pi(x(t)) \\ & x(t + \epsilon) = x(t) + \epsilon v\left(t + \frac{\epsilon}{2} \right) \\ & v(t + \epsilon) = v \left(t + \frac{\epsilon}{2} \right) + \frac{\epsilon}{2} \nabla \log \pi(x(t + \epsilon)) \end{split}

(24)

If we denote this map from $(x(t), y(t))$ to $(x(t + \epsilon), y(t + \epsilon))$ by $T^{\text{disc}}_{\epsilon}(x, v)$ , then it is straightforward to verify that

\begin{split} &T_{\epsilon}^{\text{disc}}(x, v)\\ &= \left(x + \epsilon v + \frac{\epsilon^2}{2} \nabla \log \pi(x), v + \frac{\epsilon}{2} \nabla \log \pi(x) + \frac{\epsilon}{2}\nabla \log \pi \left(x + \epsilon v + \frac{\epsilon^2}{2} \nabla \log \pi(x) \right) \right). \end{split}

(25)

Note that the position update above is reminiscent of MALA.

If we apply $N$ leapfrog steps in succession, then the overall mapping is given by:

\begin{align*} T^{\text{disc}, N}_{\epsilon} = T_{\epsilon}^{\text{disc}} \circ \dots \circ T_{\epsilon}^{\text{disc}}. \end{align*}

(26)

This function $T^{\text{disc}, N}_{\epsilon}$ shares the following two properties of the continuous Hamiltonian dynamics:

Invertibility or Time Reversibility: One can check that $S T_{\epsilon}^{\text{disc}, N} S$ serves as the inverse of $T_{\epsilon}^{\text{disc}, N}$ . This can be verified by proving that
$\begin{align*} \left(T_{\epsilon}^{\text{disc}} \right)^{-1} = S T_{\epsilon}^{\text{disc}} S. \end{align*}$
(27)
Given the explicit formula for $T_{\epsilon}^{\text{disc}}$ in (25), this verification is straightforward.
Volume Preserving: $T_{\epsilon}^{\text{disc}, N}$ is volume-preserving. This can be verified by showing the the determinant of the Jacobian of $T_{\epsilon}^{\text{disc}, N}$ . equals one. This, in turn, can be verified by proving that the determinant of the Jacobian of $T_{\epsilon}^{\text{disc}}$ equals 1. This can be verified directly by using the formula (25) or by noting (from (24)) that $T_{\epsilon}^{\text{disc}}$ is the composition of three mappings:
$\begin{align*} T_{\epsilon}^{\text{disc}} = A_{\epsilon/2} \circ B_{\epsilon} \circ A_{\epsilon/2} \end{align*}$
(28)
where
$\begin{align*} A_{\epsilon/2}(x, v) = \left(x, v + \frac{\epsilon}{2} \nabla \log \pi(x)\right) ~~ \text{ and } ~~ B_{\epsilon}(x, v) = (x + \epsilon v, v). \end{align*}$
(29)
These mappings are very simple and one can directly verify that they are volume preserving (i.e., the determinant of their Jacobians equals 1).

While the discretized Hamiltonian dynamics satisfy invertibility (or time-reversibility) and volume preservation, they do not satisfy Hamiltonian conservation (unlike continuous Hamiltonian dynamics). Because of this, a Metropolis acceptance correction has to be applied when implementing HMC with leapfrog discretization. This will be described in the next lecture.

On stationarity of $\pi$ for the continuous Hamiltonian Dynamics¶

For the continuous Hamiltonian dynamics given by (6), it turns that if the initial conditions $x$ and $v$ are distributed independently according to $x \sim \pi$ and $v \sim N(0, I_d)$ , then $x(\sigma)$ and $v(\sigma)$ are also independently distributed as $x(\sigma) \sim \pi$ and $v(\sigma) \sim N(0, I_d)$ for every $\sigma$ . This can be proved using the continuity equation that we discussed in the last lecture. This argument is given below.

Continuity equation¶

Consider the ODE

\begin{align} \dot{X}(t) = V(t, X(t)) ~~\text{for $t \geq 0$}. \end{align}

(30)

Here $X(t) \in \R^d$ and $V(t, \cdot) : \R^d \rightarrow \R^d$ . Suppose we initialize the ODE at $t = 0$ with a random variable having density $\rho_b$ :

\begin{align*} X(0) \sim \rho_b. \end{align*}

(31)

What them is the density of $X(t)$ ? Let $\rho(t, x)$ denote the density of $X(t)$ evaluated at $x$ . Then $\rho(t, x)$ satisfies the following PDE:

\begin{align} \frac{\partial}{\partial t} \rho(t, x) = -\nabla \cdot \left(V(t, x) \rho(t, x) \right) ~~ \text{ with initialization } \rho(0, x) = \rho_b(x), \end{align}

(32)

where

\begin{align*} \nabla \cdot \left(V(t, x) \rho(t, x) \right) = \text{div}(V(t, x) \rho(t, x)) := \sum_{i=1}^d \frac{\partial}{\partial x_i} \left(V_i(t, x) \rho(t, x) \right) \end{align*}

(33)

The PDE Theorem 1 is known as the continuity equation or Transport equation, Fokker-Planck equation (it is closely related to the Fokker-Planck and Kolmogorov Forward equations).

The continuity equation has been popular in the recent literature on generative modeling (see e.g., lai2025principles).

Application to Hamiltonian Monte Carlo¶

Observe that (6) is a special case of (30) with $x$ in (30) corresponding to $(x, v)$ and

\begin{align*} V(t, x, v) = \begin{pmatrix} \frac{\partial H}{\partial v} \\ -\frac{\partial H}{\partial x} \end{pmatrix} \end{align*}

(34)

Suppose now that $x \sim \pi$ so that the initial density of $(x(0), v(0))$ is

\begin{align*} \rho_b(x, v) = \pi(x) \phi(v) = (2 \pi)^{-d/2} \exp \left(-H(x, v) \right). \end{align*}

(35)

We then claim that

\begin{align*} \rho(t, x, v) = (2 \pi)^{-d/2} \exp(-H(x, v)) ~~ \text{for every $t \geq 0$} \end{align*}

(36)

satisfies the continuity equation Theorem 1. To see this, simply note that

\begin{align*} \nabla \cdot \left(V \rho \right) &= (2 \pi)^{-d/2} \nabla \cdot \left(e^{-H} \begin{pmatrix} \frac{\partial H}{\partial v} \\ -\frac{\partial H}{\partial x} \end{pmatrix} \right) \\ &= (2 \pi)^{-d/2} \sum_{i=1}^d \left[\frac{\partial}{\partial x_i} \left(e^{-H} \frac{\partial H}{\partial v_i} \right) - \frac{\partial}{\partial v_i} \left(e^{-H} \frac{\partial H}{\partial x_i} \right) \right] \\ &= (2 \pi)^{-d/2} \sum_{i=1}^d \left[e^{-H} \left(\frac{\partial^2 H}{\partial x_i \partial v_i} - \frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} \right) - e^{-H} \left(\frac{\partial^2 H}{\partial x_i \partial v_i} - \frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} \right)\right] = 0. \end{align*}

(37)

This implies that $\rho(t, x, v) = \rho_b(x, v)$ satisfies Theorem 1 (observe that $\frac{\partial}{\partial t} \rho(t, x, v) = 0$ because $\rho(t, x, v)$ is constant in $t$ ).

This shows that if $x \sim \pi$ , then the solution $x(\sigma)$ to (30) at any time $\sigma$ is distributed according to $\pi$ . This means that $\pi$ is stationary for the Hamiltonian Markov Chain.