We are discussing standard MCMC techniques to obtain samples from a probability π. All these techniques are based on the Metropolis-Hastings scheme, where, given the current value x of the chain, a proposal y is first generated according to some transition kernel Q(x,y), and then the proposal is accepted with probability:
Different MCMC methods differ on how the proposal kernel Q(x,y) is chosen. We shall look at three main techniques for choosing Q(x,y): Random Walk Metropolis (RWM), Metropolis Adjusted Langevin Algorithm (MALA) and Hamiltonian Monte Carlo (HMC).
When the dimension d is small, this method works reasonably well. When d is large, it will work only if σ is chosen very small. However, with such small σ, one would need to run the chain for a very large amount of time to ensure adequate mixing. The consequence is that RWM does not work well when d is large.
Case 1: σ≫d−1/2. Then σd→∞ so that EA≫std(A). Then (by standard concentration), A is tightly concentrated around its mean dσ2/2→∞. Therefore the acceptance probability satisfies
Now A is an O(1) random variable fluctuating around an O(1) mean so e−A is a positive O(1) quantity. The acceptance probability is then bounded away from zero so that chain can actually move.
The conclusion is that the acceptance probability of RWM is non-degenerate if and only if σ=O(d−1/2). Equivalently, the proposal step must shrink as the dimension grows, with σ∼c/d being the critical scaling. This is the fundamental tuning constraint for RWM in high-dimensions. When d is large, O(d−1/2) tends to be too small with the consequence that the chain takes a very large amount to samples to sample from the whole of π. This makes RWM very inefficient in high-dimensions.
As in RWM, it is important to tune σ appropriately. When d is large, one still needs to choose σ small. However, σ does not have to be as small as in the case of RWM for MALA to work well. This can be easily illustrated in the N(0,Id) example.
for a constant ℓ. Note that this value of σ is much larger than d−1/2 so that RWM proposals will always be rejected with this σ. But for the random variable B, we have
whic his also bounded by a constant. Thus for (27) (and under the assumption (26) on x), the random variable B will be bounded above by a constant with high probability, which means that the acceptance probability will remain constant. This is in sharp contrast to the case of RWM where the acceptance probability will be extremely small for (27). This argument indicates that MALA will be more effective than RWM when d is large. Specifically, MALA can be expected to work for much higher values of σ which will lead to faster mixing.
Note that if ∇logπ(x(t)) is replaced by ∇logπ(x), then it is easy to check that y equals the MALA proposal (15).
The ODE (30) is, in most cases, impossible to solve exactly. But in the Gaussian case, it is easy to write a formula for y=x(σ). This is illustrated next.
So the acceptance probability of this chain equals 1 regardless of the value of σ. This means that the HMC chain can be evolved to a large value of σ without affecting the acceptance probability at all. This is quite unlike the case for RWM and MALA where σ has to be small otherwise the acceptance probability will be very small. The possibility of taking large steps (in terms of σ) is one of the main advantages of HMC.
We shall, in the next lecture, that, for general π (not just Gaussian), the HMC proposal (30) satisfies detailed balance with respect to π for every σ (so that the subsequent M-H acceptance probability equals 1). In practice, however, the HMC proposal (30) cannot be directly used because this ODE cannot be solved exactly. One uses a numerical approximate solution to the ODE (30). This approximation spoils the detailed balance property however, so subsequent Metropolization is necessary. However, the acceptance probability (while not exactly equal to 1) will still be usually high for HMC even for not too small σ. More details will be provided next week.