Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty Six

Spring 2026, UC Berkeley

Variational Inference

Variational Inference is a method for approximating a posterior distribution by a simpler tractable family of distributions. It uses optimization to select a distribution qq that is closest to the true posterior in some metric. In practice, variational inference can be much faster than MCMC, though sometimes less accurate in representing posterior uncertainty.

The basic idea behind variational inference is very simple. Consider the standard Bayesian setup with data given by yy and parameters given by θ\theta. The prior is fθ(θ)f_{\theta}(\theta) and the likelihood is fyθ(y)f_{y \mid \theta}(y). The posterior is then:

fθy(θ)=fθ(θ)fyθ(y)fθ(θ)fyθ(y)dθ.\begin{align*} f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid \theta}(y)}{\int f_{\theta}(\theta) f_{y \mid \theta}(y) d\theta}. \end{align*}

The denominator is the marginal density fy(y)f_{y}(y) of yy. The marginal density fy(y)f_{y}(y)is also known as the Evidence. Using the marginal density for the denominator, we get

fθy(θ)=fθ(θ)fyθ(y)fy(y).\begin{align*} f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid \theta}(y)}{f_{y}(y)}. \end{align*}

The posterior is usually intractable because of the integration involved in the computation of the Evidence (denominator). In variational inference, one chooses a simpler family of distributions Q\Qcal and then employs optimization techniques to choose a distribution in Q\Qcal that is as close as possible to the posterior fθy(θ)f_{\theta \mid y}(\theta). The distance-measure used to measure closeness here is most commonly the following Kullback-Leibler divergence:

KL(qfθy)=q(θ)logq(θ)fθy(θ)dθ.\begin{align*} KL(q \| f_{\theta \mid y}) = \int q(\theta) \log \frac{q(\theta)}{f_{\theta \mid y}(\theta)} d\theta. \end{align*}

Recall that the KL divergence between two densities qq and pp is given by qlog(q/p)\int q \log (q/p). It is always nonnegative and it equals zero if and only if q=pq = p. It is not symmetric in general (i.e., KL(qp)KL(pq)KL(q \| p) \neq KL(p \| q)).

The optimization that we need to solve therefore is:

argminqQ KL(qfθy).\begin{align} \underset{q \in \Qcal}{\text{argmin}}~ KL(q \| f_{\theta \mid y}). \end{align}

The choice of this specific KL divergence is mostly for technical convenience as it makes the resulting optimization tractable. Observe that

KL(qfθy)=q(θ)logq(θ)fθy(θ)dθ=q(θ)logq(θ)fy,θ(y,θ)dθ+logfy(y).\begin{align} KL(q \| f_{\theta \mid y}) = \int q(\theta) \log \frac{q(\theta)}{f_{\theta \mid y}(\theta)} d\theta = \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta + \log f_{y}(y). \end{align}

The second term above (which involves the usually intractable fy(y)f_y(y)) does not depend on qq so the optimization (4) is equivalent to:

argminqQ q(θ)logq(θ)fy,θ(y,θ)dθ.\begin{align*} \underset{q \in \Qcal}{\text{argmin}}~ \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta. \end{align*}

The above objective function is called the Variational Free Energy, or simply, Free Energy F(q)F(q):

F(q)=q(θ)logq(θ)fy,θ(y,θ)dθ=q(θ)logfy,θ(y,θ)q(θ)dθ.\begin{align} F(q) = \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta = -\int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta. \end{align}

So the main goal in Variational Inference is to minimize the Free Energy. The negative of the Free Energy is called the Evidence Lower Bound (ELBO):

ELBO(q)=F(q)=q(θ)logfy,θ(y,θ)q(θ)dθ.\begin{align} \text{ELBO}(q) = -F(q) = \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta. \end{align}

So the main goal in Variational Inference can also be said to be the maximization of the ELBO. The name ELBO comes from the fact that ELBO(q)\text{ELBO}(q) always satisfies:

ELBO(q)=q(θ)logfy,θ(y,θ)q(θ)dθlogfy(y).\begin{align} \text{ELBO}(q) = \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta \leq \log f_y(y). \end{align}

This is because of the expression (5) which says that the difference between logfy(y)\log f_{y}(y) and ELBO(q)\text{ELBO}(q) equals the Kullback Leibler divergence KL(qfθy)KL(q \| f_{\theta \mid y}) which is nonnegative. Because fy(y)f_y(y) is called the evidence and ELBO(q)\text{ELBO}(q) gives a lower bound for the logarithm of the evidence, it is given the name Evidence Lower Bound. It is also easy to see that if we maximize ELBO(q)\text{ELBO}(q) over all probability densities qq, then we obtain equality in (9):

supqELBO(q)=logfy(y).\begin{align} \sup_q \text{ELBO}(q) = \log f_{y}(y). \end{align}

The above follows from (5) because

ELBO(q)=logfy(y)KL(qfθy).\begin{align*} \text{ELBO}(q) = \log f_{y}(y) - KL(q \| f_{\theta \mid y}). \end{align*}

Because fy,θ(y,θ)=fθ(θ)fyθ(y)f_{y, \theta}(y, \theta) = f_{\theta}(\theta) f_{y \mid \theta}(y), the ELBO has the following alternative expression:

ELBO(q)=F(q)=q(θ)logfy,θ(y,θ)q(θ)dθ=q(θ)logfyθ(y)fθ(θ)q(θ)dθ=q(θ)logfyθ(y)dθq(θ)logq(θ)fθ(θ)dθ=q(θ)logfyθ(y)dθKL(qfθ)\begin{align} \text{ELBO}(q) &= -F(q) \nonumber \\ &= \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log \frac{f_{y\mid \theta}(y) f_{\theta}(\theta)}{q(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log f_{y \mid \theta}(y) d\theta - \int q(\theta) \log \frac{q(\theta)}{f_{\theta}(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log f_{y \mid \theta}(y) d\theta - KL(q \| f_{\theta}) \end{align}

In other words, ELBO(q)\text{ELBO}(q) represents the average likelihood (with respect to qq) minus the KL divergence between qq and the prior.

Usually, the main tasks in Variational Inference are (a) to choose a class of distributions Q\Qcal, and (b) to maximize ELBO(q)\text{ELBO}(q) (or, equivalently, to minimize F(q)F(q)) over qQq \in \Qcal. We will see how to do this via some examples.

Bayesian Logistic Regression

Consider again the logistic regression setting where we observe data (xi,yi),i=1,,n(x_i, y_i), i = 1, \dots, n with xiRdx_i \in \R^d and yi{0,1}y_i \in \{0, 1\}. The likelihood model is:

yixi,βBernoulli(exiTβ1+exiTβ)\begin{align*} y_i \mid x_i, \beta \sim \text{Bernoulli}\left(\frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right) \end{align*}

and we take the improper uniform prior for β\beta. We then have:

fy,β(y,β)=i=1n(exiTβ1+exiTβ)yi(1exiTβ1+exiTβ)1yi=i=1nexp(yixiTβ)1+exp(xiTβ)=exp((β))\begin{align*} f_{y, \beta}(y, \beta) &= \prod_{i=1}^n \left(\frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right)^{y_i} \left(1 - \frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right)^{1-y_i} \\ &= \prod_{i=1}^n \frac{\exp(y_i x_i^T \beta)}{1 + \exp(x_i^T \beta)} = \exp(\ell(\beta)) \end{align*}

where (β)\ell(\beta) is the log-likelihood is

(β)=i=1n(yixiTβlog(1+exp(xiTβ))).\begin{align} \ell(\beta) = \sum_{i=1}^n \left(y_i x_i^T \beta - \log \left(1 + \exp(x_i^T \beta) \right) \right). \end{align}

Note that, because the prior is one, fy,β(y,β)f_{y, \beta}(y, \beta) equals the likelihood fyβ(y)f_{y \mid \beta}(y). We can thus calculate the ELBO as:

ELBO(q)=q(β)logfy,β(y,β)q(β)dβ=q(β)(β)dβq(β)logq(β)dβ.\begin{align*} \text{ELBO}(q) &= \int q(\beta) \log \frac{f_{y, \beta}(y, \beta)}{q(\beta)} d\beta \\ &= \int q(\beta) \ell(\beta) d\beta - \int q(\beta) \log q(\beta) d\beta. \end{align*}

The quantity qlogq-\int q \log q is known as the entropy H(q)H(q) of qq. We thus have

ELBO(q)=q(β)(β)dβH(q).\begin{align*} \text{ELBO}(q) = \int q(\beta) \ell(\beta) d\beta - H(q). \end{align*}

Let us now take Q\Qcal to be the class of all normal distributions N(m,Σ)N(m, \Sigma) as mRpm \in \R^p and Σ\Sigma varies over the class of all p×pp \times p positive definite matrices Σ\Sigma. The entropy of the multivariate normal distribution (see Multivariate normal distribution) is:

H(N(m,Σ))=p2(1+log(2π))+12logΣ\begin{align*} H(N(m, \Sigma)) = \frac{p}{2} \left(1 + \log (2 \pi) \right) + \frac{1}{2} \log |\Sigma| \end{align*}

where Σ|\Sigma| is the determinant of Σ\Sigma. We thus have:

ELBO(m,Σ)=EβN(m,Σ)(β)+12logΣ+p2(1+log(2π)).\begin{align*} \text{ELBO}(m, \Sigma) = \E_{\beta \sim N(m, \Sigma)} \ell(\beta) + \frac{1}{2} \log |\Sigma| + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

This can be maximized over mm and Σ\Sigma to obtain the posterior approximation N(m^,Σ^)N(\hat{m}, \hat{\Sigma}). Here mm can be any vector over Rp\R^p but Σ\Sigma is constrained to be positive definite. Constrained optimization is usually difficult (for gradient-based methods) so we convert this to unconstrained optimization by taking

Σ=LLT\begin{align*} \Sigma = L L^T \end{align*}

where LL is a p×pp \times p lower-triangular matrix with non-zero diagonal entries. We can also, without loss of generality, assume that LL has strictly positive diagonal entries (given any LL, we can replace it by LDL D where DD is a diagonal matrix with entries +1 if the corresponding entry of LL is positive and -1 otherwise). In code, we can represent the diagonal entries of LL by exp(a1),,exp(ap)\exp(a_1), \dots, \exp(a_p). With this parametrization, we have

Σ=LLT=L2=j=1pLjj2,\begin{align*} |\Sigma| = |L L^T| = |L|^2 = \prod_{j=1}^p L_{jj}^2, \end{align*}

and thus

ELBO(m,L)=EβN(m,LLT)(β)+j=1plogLjj+p2(1+log(2π)).\begin{align*} \text{ELBO}(m, L) = \E_{\beta \sim N(m, L L^T)} \ell(\beta) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

In order to maximize the above function, we need to make the term EβN(m,LLT)(β)\E_{\beta \sim N(m, L L^T)} \ell(\beta) more explicit. For this, we use the reparametrization trick (see Reparameterization trick) and write

β=m+Lz   where zN(0,Ip).\begin{align*} \beta = m + L z ~~ \text{ where $z \sim N(0, I_p)$}. \end{align*}

This gives

ELBO(m,L)=EzN(0,Ip)(m+Lz)+j=1plogLjj+p2(1+log(2π)).\begin{align*} \text{ELBO}(m, L) = \E_{z \sim N(0, I_p)} \ell(m + L z) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

The expectation above can be approximated by Monte Carlo. Specifically generate

z(1),,z(S)i.i.dN(0,Ip)\begin{align*} z^{(1)}, \dots, z^{(S)} \overset{\text{i.i.d}}{\sim} N(0, I_p) \end{align*}

and use the approximation:

ELBO(m,L)1Ss=1S(m+Lz(s))+j=1plogLjj+p2(1+log(2π)).\begin{align*} \text{ELBO}(m, L) \approx \frac{1}{S} \sum_{s=1}^S \ell(m + L z^{(s)}) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

This can be maximized using standard gradient-based optimization software such as PyTorch.