Variational Inference is a method for approximating a posterior distribution by a simpler tractable family of distributions. It uses optimization to select a distribution q that is closest to the true posterior in some metric. In practice, variational inference can be much faster than MCMC, though sometimes less accurate in representing posterior uncertainty.
The basic idea behind variational inference is very simple. Consider the standard Bayesian setup with data given by y and parameters given by θ. The prior is fθ(θ) and the likelihood is fy∣θ(y). The posterior is then:
The denominator is the marginal density fy(y) of y. The marginal density fy(y)is also known as the Evidence. Using the marginal density for the denominator, we get
The posterior is usually intractable because of the integration involved in the computation of the Evidence (denominator). In variational inference, one chooses a simpler family of distributions Q and then employs optimization techniques to choose a distribution in Q that is as close as possible to the posterior fθ∣y(θ). The distance-measure used to measure closeness here is most commonly the following Kullback-Leibler divergence:
Recall that the KL divergence between two densities q and p is given by ∫qlog(q/p). It is always nonnegative and it equals zero if and only if q=p. It is not symmetric in general (i.e., KL(q∥p)=KL(p∥q)).
The optimization that we need to solve therefore is:
So the main goal in Variational Inference can also be said to be the maximization of the ELBO. The name ELBO comes from the fact that ELBO(q) always satisfies:
This is because of the expression (5) which says that the difference between logfy(y) and ELBO(q) equals the Kullback Leibler divergence KL(q∥fθ∣y) which is nonnegative. Because fy(y) is called the evidence and ELBO(q) gives a lower bound for the logarithm of the evidence, it is given the name Evidence Lower Bound. It is also easy to see that if we maximize ELBO(q) over all probability densities q, then we obtain equality in (9):
In other words, ELBO(q) represents the average likelihood (with respect to q) minus the KL divergence between q and the prior.
Usually, the main tasks in Variational Inference are (a) to choose a class of distributions Q, and (b) to maximize ELBO(q) (or, equivalently, to minimize F(q)) over q∈Q. We will see how to do this via some examples.
Let us now take Q to be the class of all normal distributions N(m,Σ) as m∈Rp and Σ varies over the class of all p×p positive definite matrices Σ. The entropy of the multivariate normal distribution (see Multivariate normal distribution) is:
This can be maximized over m and Σ to obtain the posterior approximation N(m^,Σ^). Here m can be any vector over Rp but Σ is constrained to be positive definite. Constrained optimization is usually difficult (for gradient-based methods) so we convert this to unconstrained optimization by taking
where L is a p×p lower-triangular matrix with non-zero diagonal entries. We can also, without loss of generality, assume that L has strictly positive diagonal entries (given any L, we can replace it by LD where D is a diagonal matrix with entries +1 if the corresponding entry of L is positive and -1 otherwise). In code, we can represent the diagonal entries of L by exp(a1),…,exp(ap). With this parametrization, we have
In order to maximize the above function, we need to make the term Eβ∼N(m,LLT)ℓ(β) more explicit. For this, we use the reparametrization trick (see Reparameterization trick) and write