Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty Four

Spring 2026, UC Berkeley

Hamiltonian Monte Carlo

There is a given target probability density π\pi and our goal is to construct a Markov chain which satisfies detailed balance with respect to π\pi. Given a current value xx, HMC constructs the next value yy by solving the following second order ODE:

x¨(t)=logπ(x(t))   with initialization x(0)=x and x˙(0)=zN(0,Id).\begin{align} \ddot{x}(t) = \nabla \log \pi(x(t)) ~~ \text{ with initialization $x(0) = x$ and $\dot{x}(0) = z \sim N(0, I_d)$}. \end{align}

The ODE (30) can be alternatively written in the following first order form:

(x˙(t)v˙(t))=(v(t)logπ(x(t))) with initialization (x(0)v(0))=(xz)\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} v(t) \\ \nabla \log \pi(x(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ z \end{pmatrix} \end{align}

which can also be written in terms of the Hamiltonian:

H(x,v)=H(x1,,xd,v1,,vd):=logπ(x)+12v2\begin{align} H(x, v) = H(x_1, \dots, x_d, v_1, \dots, v_d) := -\log \pi(x) + \frac{1}{2}\|v\|^2 \end{align}

as

(x˙(t)v˙(t))=(Hv(x(t),v(t))Hx(x(t),v(t))) with initialization (x(0)v(0))=(xz)\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \frac{\partial H}{\partial v}(x(t), v(t)) \\ -\frac{\partial H}{\partial x}(x(t), v(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ z \end{pmatrix} \end{align}

Most authors (see e.g., neal2011mcmc) use the formulation (6) to describe HMC. The advantage of the formulation (30) is that it makes the connection to MALA very clear (essentially, MALA proposals are obtained by replacing logπ(x(t))\nabla \log \pi(x(t)) by logπ(x(0))\nabla \log \pi(x(0))).

On the other hand, the main advantage of the Hamiltonian formulation is that it also works for choices of the Hamiltonian that are different from (3). For example, consider the alternative choices of the Hamiltonian:

  1. H(x,v)=logπ(x)+vTMv/2H(x, v) = -\log \pi(x) + v^T M v/2 for a positive definite matrix vv. Here the dynamics (6) become

    (x˙(t)v˙(t))=(Mv(t)logπ(x(t))).\begin{align*} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} M v(t) \\ -\log \pi(x(t)) \end{pmatrix}. \end{align*}

    because Hx=logπ(x)\frac{\partial H}{\partial x} = -\nabla \log \pi(x) and Hv=Mv\frac{\partial H}{\partial v} = M v. This formulation of the Hamiltonian is meaningful when the variables x1,,xdx_1, \dots, x_d have different scales. One can also make MM depend on xx. This is related to Riemannian Manifold HMC (see girolami2011riemann).

  2. H(x,v)=logπ(x)+v1H(x, v) = -\log \pi(x) + \|v\|_1. Here the dynamics (6) become

    (x˙(t)v˙(t))=(sign(v(t))logπ(x(t)))\begin{align*} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \text{sign}(v(t)) \\ -\log \pi(x(t)) \end{pmatrix} \end{align*}

    because Hv=sign(v)\frac{\partial H}{\partial v} = \text{sign}(v). Also note that v(t)=(v1(t),,vd(t))v(t) = (v_1(t), \dots, v_d(t)) and sign(v(t))\text{sign}(v(t)) is interpreted coordinatewise as (sign(v1(t)),,sign(vd(t)))(\text{sign}(v_1(t)), \dots, \text{sign}(v_d(t))). These are interesting dynamics that cannot be written in a second order form as in (30).

Hamiltonian Dynamics

Some understanding of Hamiltonian Dynamics will be useful for HMC. Hamiltonian Dynamics refers to the ODE (6)

(x˙(t)v˙(t))=(Hv(x(t),v(t))Hx(x(t),v(t))) with initialization (x(0)v(0))=(xv)\begin{align} \begin{pmatrix} \dot{x}(t) \\ \dot{v}(t) \end{pmatrix} = \begin{pmatrix} \frac{\partial H}{\partial v}(x(t), v(t)) \\ -\frac{\partial H}{\partial x}(x(t), v(t)) \end{pmatrix} ~\text{with initialization}~ \begin{pmatrix} x(0) \\ v(0) \end{pmatrix} = \begin{pmatrix} x \\ v \end{pmatrix} \end{align}

The above is the same as (6) except that we denote the initial velocity by vv (instead of vv in (6)). In this section, we will not use the specific form of the Hamiltonian given by (3), but we will take the more general form:

H(x,v)=logπ(x)+K(v)\begin{align} H(x, v) = -\log \pi(x) + K(v) \end{align}

where K(v)K(v) is a symmetric function of vv i.e., K(v)=K(v)K(-v) = K(v).

We will view the Hamiltonian dynamics (run for a fixed time σ\sigma) as a mapping from (x,v)(x, v) (the initial state) to a final state (x(σ),v(σ))(x(\sigma), v(\sigma)). We will denote this mapping by Tσ(x,v)T_{\sigma}(x, v):

Tσ(x,v)=(x(σ),v(σ)).\begin{align*} T_{\sigma}(x, v) = (x(\sigma), v(\sigma)). \end{align*}

The space of all pairs of position and velocity (x,v)(x, v) will be referred to as the phase space.

The basic properties of Tσ()T_{\sigma}(\cdot) that we need for HMC are summarized in this section. For more details, you can refer to neal2011mcmc.

Property One: Invertibility or Reversibility

The map Tσ(x,v)T_{\sigma}(x, v) is invertible and its inverse can be written in a simple manner. Let S(x,v)=(x,v)S(x, v) = (x, -v) be the operator which flips the velocity. Then

Tσ1=STσS.\begin{align} T_{\sigma}^{-1} = S T_{\sigma} S. \end{align}

In other words, to compute Tσ1T_{\sigma}^{-1} at a point (y,w)(y, w), we need to take the following three steps:

  1. First, flip the velocity to obtain (y,w)(y, -w).

  2. Second, run the Hamiltonian dynamics for time σ\sigma starting at (y,w)(y, -w) to obtain (x,v)(x, -v) at time σ\sigma.

  3. Third, flip velocity again to obtain (x,v)(x, v).

The formula (10) has the following simple consequence:

TσS=(TσS)1.\begin{align} T_{\sigma} S = (T_{\sigma} S)^{-1}. \end{align}

To see this, simply note

(TσS)1=S1Tσ1=S1STσS=TσS.\begin{align*} (T_{\sigma} S)^{-1} = S^{-1} T_{\sigma}^{-1} = S^{-1} S T_{\sigma} S = T_{\sigma} S. \end{align*}

A function gg for which g1=gg^{-1} = g (which is equivalent to g(g(x))=xg(g(x)) = x) is called an involution. The above statement is therefore equivalent to saying that TσST_{\sigma} S is an involution. Involutions have been recently used to unify many MCMC algorithms (see e.g., glatt2024sacred).

Property Two: Hamiltonian Conservation

Hamiltonian dynamics conserve the Hamiltonian in the sense that:

H(x(t),v(t))=constant.\begin{align} H(x(t), v(t)) = \text{constant}. \end{align}

To see this, just note that

ddtH(x(t),v(t))=i=1d(Hxix˙i(t)+Hviv˙i(t))=i=1d(HxiHvi+Hvi(Hxi))=i=1d(HxiHviHviHxi)=0\begin{align*} \frac{d}{dt} H(x(t), v(t)) &= \sum_{i=1}^d \left( \frac{\partial H}{\partial x_i} \dot{x}_i(t) + \frac{\partial H}{\partial v_i} \dot{v}_i(t) \right) \\ &= \sum_{i=1}^d \left(\frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} + \frac{\partial H}{\partial v_i} \left(- \frac{\partial H}{\partial x_i}\right) \right) = \sum_{i=1}^d \left(\frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} - \frac{\partial H}{\partial v_i} \frac{\partial H}{\partial x_i} \right) = 0 \end{align*}

Property Three: Volume Preservation

Volume preservation means that if we take a region RR in the phase space (x,v)(x, v), then

vol(R)=vol(Tσ(R))\begin{align} \text{vol}(R) = \text{vol}(T_{\sigma}(R)) \end{align}

where Tσ(R)={Tσ(x,v):(x,v)R}T_{\sigma}(R) = \{T_{\sigma}(x, v): (x, v) \in R\}.

The volume preservation property (15) is equivalent to

vol(R)=vol(Tσ(R))=I{(y,w)Tσ(R)}d(y,w)=I{Tσ1(y,w)R}d(y,w).\begin{align*} \text{vol}(R) &= \text{vol}(T_{\sigma}(R)) = \int I\{(y, w) \in T_{\sigma}(R)\} d(y, w) = \int I\{T_{\sigma}^{-1} (y, w) \in R\} d(y, w). \end{align*}

Using the change of variable (x,v)=Tσ1(y,w)(x, v) = T_{\sigma}^{-1}(y, w) above, we get

I{Tσ1(y,w)R}d(y,w)=I{(x,v)R}detJTσ(x,v)d(x,v)\begin{align*} \int I\{T_{\sigma}^{-1} (y, w) \in R\} d(y, w) = \int I\{(x, v) \in R\} |\det J T_{\sigma}(x, v)| d(x, v) \end{align*}

where JTσJ T_{\sigma} denotes the Jacobian of TσT_{\sigma}. Thus volume preservation can be proved by showing that the Jacobian JTσJ T_{\sigma} has determinant equal to 1 for all (x,v)(x, v):

detJTσ(x,v)=1   for all (x,v).\begin{align} \det J T_{\sigma}(x, v) = 1 ~~ \text{ for all $(x, v)$}. \end{align}

Here is a sketch of the proof of (18). Assume first that σ\sigma is small so that TσT_{\sigma} can be well-approximated by a linear function:

Tσ(x,v)(xv)+σ(Hv(x,v)Hx(x,v))\begin{align*} T_{\sigma}(x, v) \approx \begin{pmatrix} x \\ v \end{pmatrix} + \sigma \begin{pmatrix} \frac{\partial H}{\partial v}(x, v) \\ - \frac{\partial H}{\partial x}(x, v) \end{pmatrix} \end{align*}

As a result

JTσI+σ(2Hxv2Hv22Hx22Hxv)\begin{align*} J T_{\sigma} \approx I + \sigma \begin{pmatrix} \frac{\partial^2 H}{\partial x \partial v} & \frac{\partial^2 H}{\partial v^2} \\ -\frac{\partial^2 H}{\partial x^2} & - \frac{\partial^2 H}{\partial x \partial v} \end{pmatrix} \end{align*}

which shows that

det(JTσ)det(1+σ2Hxvσ2Hv2σ2Hx21σ2Hxv)=1+O(σ2).\begin{align*} \det (J T_{\sigma}) \approx \det \begin{pmatrix} 1 + \sigma \frac{\partial^2 H}{\partial x \partial v} & \sigma\frac{\partial^2 H}{\partial v^2} \\ -\sigma\frac{\partial^2 H}{\partial x^2} & 1 - \sigma \frac{\partial^2 H}{\partial x \partial v} \end{pmatrix} = 1 + O(\sigma^2). \end{align*}

If σ\sigma is not small, then break [0,σ][0, \sigma] into NN intervals each of length σ/N\sigma/N, and then decompose TσT_{\sigma} as TNTN1T1T_N \circ T_{N-1} \circ \dots \circ T_1 where TjT_j is the mapping corresponding to Hamiltonian dynamics from t=(j1)σ/Nt = (j-1)\sigma/N to t=jσ/Nt = j\sigma/N. For each jj, we use the previous argument to claim detJTj=1+O(σ2/N2)\det J T_j = 1 + O(\sigma^2/N^2). Taking the product over jj, we get

detJTσ=j=1N(1+O(σ2/N2))1\begin{align*} \det J T_{\sigma} = \prod_{j=1}^N \left(1 + O(\sigma^2/N^2) \right) \rightarrow 1 \end{align*}

as NN \rightarrow \infty. This completes the heuristic argument for (18).

Discretization of (2)

The ODE (2) cannot be solved exactly for most π()\pi(\cdot). So we have to approximately solve it using a discretization technique. We fix a step size ϵ\epsilon and approximate the ODE (2) at t=0,ϵ,2ϵ,t = 0, \epsilon, 2 \epsilon, \dots. More specifically, we will construct:

(x(0)v(0)),(x(ϵ)v(ϵ)),(x(2ϵ)v(2ϵ)),\begin{align*} \begin{pmatrix} x(0) \\ v(0) \end{pmatrix}, \begin{pmatrix} x(\epsilon) \\ v(\epsilon) \end{pmatrix}, \begin{pmatrix} x(2 \epsilon) \\ v(2 \epsilon) \end{pmatrix}, \dots \end{align*}

Below we will describe how to obtain (x(t+ϵ),v(t+ϵ))(x(t+\epsilon), v(t + \epsilon)) from (x(t),v(t))(x(t), v(t)) (this process will then be iteratively applied starting at t=0t = 0).

The standard approach to discretization of the HMC equation (2) is to use the leapfrog discretization. The leapfrog discretization uses the following formulae to construct x(t+ϵ)x(t+\epsilon) and v(t+ϵ)v(t + \epsilon) from x(t),y(t)x(t), y(t).

v(t+ϵ2)=v(t)+ϵ2logπ(x(t))x(t+ϵ)=x(t)+ϵv(t+ϵ2)v(t+ϵ)=v(t+ϵ2)+ϵ2logπ(x(t+ϵ))\begin{split} & v\left(t + \frac{\epsilon}{2}\right) = v(t) + \frac{\epsilon}{2} \nabla \log \pi(x(t)) \\ & x(t + \epsilon) = x(t) + \epsilon v\left(t + \frac{\epsilon}{2} \right) \\ & v(t + \epsilon) = v \left(t + \frac{\epsilon}{2} \right) + \frac{\epsilon}{2} \nabla \log \pi(x(t + \epsilon)) \end{split}

If we denote this map from (x(t),y(t))(x(t), y(t)) to (x(t+ϵ),y(t+ϵ))(x(t + \epsilon), y(t + \epsilon)) by Tϵdisc(x,v)T^{\text{disc}}_{\epsilon}(x, v), then it is straightforward to verify that

Tϵdisc(x,v)=(x+ϵv+ϵ22logπ(x),v+ϵ2logπ(x)+ϵ2logπ(x+ϵv+ϵ22logπ(x))).\begin{split} &T_{\epsilon}^{\text{disc}}(x, v)\\ &= \left(x + \epsilon v + \frac{\epsilon^2}{2} \nabla \log \pi(x), v + \frac{\epsilon}{2} \nabla \log \pi(x) + \frac{\epsilon}{2}\nabla \log \pi \left(x + \epsilon v + \frac{\epsilon^2}{2} \nabla \log \pi(x) \right) \right). \end{split}

Note that the position update above is reminiscent of MALA.

If we apply NN leapfrog steps in succession, then the overall mapping is given by:

Tϵdisc,N=TϵdiscTϵdisc.\begin{align*} T^{\text{disc}, N}_{\epsilon} = T_{\epsilon}^{\text{disc}} \circ \dots \circ T_{\epsilon}^{\text{disc}}. \end{align*}

This function Tϵdisc,NT^{\text{disc}, N}_{\epsilon} shares the following two properties of the continuous Hamiltonian dynamics:

  1. Invertibility or Time Reversibility: One can check that STϵdisc,NSS T_{\epsilon}^{\text{disc}, N} S serves as the inverse of Tϵdisc,NT_{\epsilon}^{\text{disc}, N}. This can be verified by proving that

    (Tϵdisc)1=STϵdiscS.\begin{align*} \left(T_{\epsilon}^{\text{disc}} \right)^{-1} = S T_{\epsilon}^{\text{disc}} S. \end{align*}

    Given the explicit formula for TϵdiscT_{\epsilon}^{\text{disc}} in (25), this verification is straightforward.

  2. Volume Preserving: Tϵdisc,NT_{\epsilon}^{\text{disc}, N} is volume-preserving. This can be verified by showing the the determinant of the Jacobian of Tϵdisc,NT_{\epsilon}^{\text{disc}, N}. equals one. This, in turn, can be verified by proving that the determinant of the Jacobian of TϵdiscT_{\epsilon}^{\text{disc}} equals 1. This can be verified directly by using the formula (25) or by noting (from (24)) that TϵdiscT_{\epsilon}^{\text{disc}} is the composition of three mappings:

    Tϵdisc=Aϵ/2BϵAϵ/2\begin{align*} T_{\epsilon}^{\text{disc}} = A_{\epsilon/2} \circ B_{\epsilon} \circ A_{\epsilon/2} \end{align*}

    where

    Aϵ/2(x,v)=(x,v+ϵ2logπ(x))   and   Bϵ(x,v)=(x+ϵv,v).\begin{align*} A_{\epsilon/2}(x, v) = \left(x, v + \frac{\epsilon}{2} \nabla \log \pi(x)\right) ~~ \text{ and } ~~ B_{\epsilon}(x, v) = (x + \epsilon v, v). \end{align*}

    These mappings are very simple and one can directly verify that they are volume preserving (i.e., the determinant of their Jacobians equals 1).

While the discretized Hamiltonian dynamics satisfy invertibility (or time-reversibility) and volume preservation, they do not satisfy Hamiltonian conservation (unlike continuous Hamiltonian dynamics). Because of this, a Metropolis acceptance correction has to be applied when implementing HMC with leapfrog discretization. This will be described in the next lecture.

On stationarity of π\pi for the continuous Hamiltonian Dynamics

For the continuous Hamiltonian dynamics given by (6), it turns that if the initial conditions xx and vv are distributed independently according to xπx \sim \pi and vN(0,Id)v \sim N(0, I_d), then x(σ)x(\sigma) and v(σ)v(\sigma) are also independently distributed as x(σ)πx(\sigma) \sim \pi and v(σ)N(0,Id)v(\sigma) \sim N(0, I_d) for every σ\sigma. This can be proved using the continuity equation that we discussed in the last lecture. This argument is given below.

Continuity equation

Consider the ODE

X˙(t)=V(t,X(t))  for t0.\begin{align} \dot{X}(t) = V(t, X(t)) ~~\text{for $t \geq 0$}. \end{align}

Here X(t)RdX(t) \in \R^d and V(t,):RdRdV(t, \cdot) : \R^d \rightarrow \R^d. Suppose we initialize the ODE at t=0t = 0 with a random variable having density ρb\rho_b:

X(0)ρb.\begin{align*} X(0) \sim \rho_b. \end{align*}

What them is the density of X(t)X(t)? Let ρ(t,x)\rho(t, x) denote the density of X(t)X(t) evaluated at xx. Then ρ(t,x)\rho(t, x) satisfies the following PDE:

tρ(t,x)=(V(t,x)ρ(t,x))   with initialization ρ(0,x)=ρb(x),\begin{align} \frac{\partial}{\partial t} \rho(t, x) = -\nabla \cdot \left(V(t, x) \rho(t, x) \right) ~~ \text{ with initialization } \rho(0, x) = \rho_b(x), \end{align}

where

(V(t,x)ρ(t,x))=div(V(t,x)ρ(t,x)):=i=1dxi(Vi(t,x)ρ(t,x))\begin{align*} \nabla \cdot \left(V(t, x) \rho(t, x) \right) = \text{div}(V(t, x) \rho(t, x)) := \sum_{i=1}^d \frac{\partial}{\partial x_i} \left(V_i(t, x) \rho(t, x) \right) \end{align*}

The PDE Theorem 1 is known as the continuity equation or Transport equation, Fokker-Planck equation (it is closely related to the Fokker-Planck and Kolmogorov Forward equations).

The continuity equation has been popular in the recent literature on generative modeling (see e.g., lai2025principles).

Application to Hamiltonian Monte Carlo

Observe that (6) is a special case of (30) with xx in (30) corresponding to (x,v)(x, v) and

V(t,x,v)=(HvHx)\begin{align*} V(t, x, v) = \begin{pmatrix} \frac{\partial H}{\partial v} \\ -\frac{\partial H}{\partial x} \end{pmatrix} \end{align*}

Suppose now that xπx \sim \pi so that the initial density of (x(0),v(0))(x(0), v(0)) is

ρb(x,v)=π(x)ϕ(v)=(2π)d/2exp(H(x,v)).\begin{align*} \rho_b(x, v) = \pi(x) \phi(v) = (2 \pi)^{-d/2} \exp \left(-H(x, v) \right). \end{align*}

We then claim that

ρ(t,x,v)=(2π)d/2exp(H(x,v))  for every t0\begin{align*} \rho(t, x, v) = (2 \pi)^{-d/2} \exp(-H(x, v)) ~~ \text{for every $t \geq 0$} \end{align*}

satisfies the continuity equation Theorem 1. To see this, simply note that

(Vρ)=(2π)d/2(eH(HvHx))=(2π)d/2i=1d[xi(eHHvi)vi(eHHxi)]=(2π)d/2i=1d[eH(2HxiviHxiHvi)eH(2HxiviHxiHvi)]=0.\begin{align*} \nabla \cdot \left(V \rho \right) &= (2 \pi)^{-d/2} \nabla \cdot \left(e^{-H} \begin{pmatrix} \frac{\partial H}{\partial v} \\ -\frac{\partial H}{\partial x} \end{pmatrix} \right) \\ &= (2 \pi)^{-d/2} \sum_{i=1}^d \left[\frac{\partial}{\partial x_i} \left(e^{-H} \frac{\partial H}{\partial v_i} \right) - \frac{\partial}{\partial v_i} \left(e^{-H} \frac{\partial H}{\partial x_i} \right) \right] \\ &= (2 \pi)^{-d/2} \sum_{i=1}^d \left[e^{-H} \left(\frac{\partial^2 H}{\partial x_i \partial v_i} - \frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} \right) - e^{-H} \left(\frac{\partial^2 H}{\partial x_i \partial v_i} - \frac{\partial H}{\partial x_i} \frac{\partial H}{\partial v_i} \right)\right] = 0. \end{align*}

This implies that ρ(t,x,v)=ρb(x,v)\rho(t, x, v) = \rho_b(x, v) satisfies Theorem 1 (observe that tρ(t,x,v)=0\frac{\partial}{\partial t} \rho(t, x, v) = 0 because ρ(t,x,v)\rho(t, x, v) is constant in tt).

This shows that if xπx \sim \pi, then the solution x(σ)x(\sigma) to (30) at any time σ\sigma is distributed according to π\pi. This means that π\pi is stationary for the Hamiltonian Markov Chain.