STAT 238 - Bayesian Statistics Lecture Thirty Six

Spring 2026, UC Berkeley

Variational Inference¶

Variational Inference is a method for approximating a posterior distribution by a simpler tractable family of distributions. It uses optimization to select a distribution $q$ that is closest to the true posterior in some metric. In practice, variational inference can be much faster than MCMC, though sometimes less accurate in representing posterior uncertainty.

The basic idea behind variational inference is very simple. Consider the standard Bayesian setup with data given by $y$ and parameters given by $\theta$ . The prior is $f_{\theta}(\theta)$ and the likelihood is $f_{y \mid \theta}(y)$ . The posterior is then:

\begin{align*} f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid \theta}(y)}{\int f_{\theta}(\theta) f_{y \mid \theta}(y) d\theta}. \end{align*}

(1)

The denominator is the marginal density $f_{y}(y)$ of $y$ . The marginal density $f_{y}(y)$ is also known as the Evidence. Using the marginal density for the denominator, we get

\begin{align*} f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid \theta}(y)}{f_{y}(y)}. \end{align*}

(2)

The posterior is usually intractable because of the integration involved in the computation of the Evidence (denominator). In variational inference, one chooses a simpler family of distributions $\Qcal$ and then employs optimization techniques to choose a distribution in $\Qcal$ that is as close as possible to the posterior $f_{\theta \mid y}(\theta)$ . The distance-measure used to measure closeness here is most commonly the following Kullback-Leibler divergence:

\begin{align*} KL(q \| f_{\theta \mid y}) = \int q(\theta) \log \frac{q(\theta)}{f_{\theta \mid y}(\theta)} d\theta. \end{align*}

(3)

Recall that the KL divergence between two densities $q$ and $p$ is given by $\int q \log (q/p)$ . It is always nonnegative and it equals zero if and only if $q = p$ . It is not symmetric in general (i.e., $KL(q \| p) \neq KL(p \| q)$ ).

The optimization that we need to solve therefore is:

\begin{align} \underset{q \in \Qcal}{\text{argmin}}~ KL(q \| f_{\theta \mid y}). \end{align}

(4)

The choice of this specific KL divergence is mostly for technical convenience as it makes the resulting optimization tractable. Observe that

\begin{align} KL(q \| f_{\theta \mid y}) = \int q(\theta) \log \frac{q(\theta)}{f_{\theta \mid y}(\theta)} d\theta = \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta + \log f_{y}(y). \end{align}

(5)

The second term above (which involves the usually intractable $f_y(y)$ ) does not depend on $q$ so the optimization (4) is equivalent to:

\begin{align*} \underset{q \in \Qcal}{\text{argmin}}~ \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta. \end{align*}

(6)

The above objective function is called the Variational Free Energy, or simply, Free Energy $F(q)$ :

\begin{align} F(q) = \int q(\theta) \log \frac{q(\theta)}{f_{y, \theta}(y, \theta)} d\theta = -\int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta. \end{align}

(7)

So the main goal in Variational Inference is to minimize the Free Energy. The negative of the Free Energy is called the Evidence Lower Bound (ELBO):

\begin{align} \text{ELBO}(q) = -F(q) = \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta. \end{align}

(8)

So the main goal in Variational Inference can also be said to be the maximization of the ELBO. The name ELBO comes from the fact that $\text{ELBO}(q)$ always satisfies:

\begin{align} \text{ELBO}(q) = \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta \leq \log f_y(y). \end{align}

(9)

This is because of the expression (5) which says that the difference between $\log f_{y}(y)$ and $\text{ELBO}(q)$ equals the Kullback Leibler divergence $KL(q \| f_{\theta \mid y})$ which is nonnegative. Because $f_y(y)$ is called the evidence and $\text{ELBO}(q)$ gives a lower bound for the logarithm of the evidence, it is given the name Evidence Lower Bound. It is also easy to see that if we maximize $\text{ELBO}(q)$ over all probability densities $q$ , then we obtain equality in (9):

\begin{align} \sup_q \text{ELBO}(q) = \log f_{y}(y). \end{align}

(10)

The above follows from (5) because

\begin{align*} \text{ELBO}(q) = \log f_{y}(y) - KL(q \| f_{\theta \mid y}). \end{align*}

(11)

Because $f_{y, \theta}(y, \theta) = f_{\theta}(\theta) f_{y \mid \theta}(y)$ , the ELBO has the following alternative expression:

\begin{align} \text{ELBO}(q) &= -F(q) \nonumber \\ &= \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log \frac{f_{y\mid \theta}(y) f_{\theta}(\theta)}{q(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log f_{y \mid \theta}(y) d\theta - \int q(\theta) \log \frac{q(\theta)}{f_{\theta}(\theta)} d\theta \nonumber \\ &= \int q(\theta) \log f_{y \mid \theta}(y) d\theta - KL(q \| f_{\theta}) \end{align}

(12)

In other words, $\text{ELBO}(q)$ represents the average likelihood (with respect to $q$ ) minus the KL divergence between $q$ and the prior.

Usually, the main tasks in Variational Inference are (a) to choose a class of distributions $\Qcal$ , and (b) to maximize $\text{ELBO}(q)$ (or, equivalently, to minimize $F(q)$ ) over $q \in \Qcal$ . We will see how to do this via some examples.

Bayesian Logistic Regression¶

Consider again the logistic regression setting where we observe data $(x_i, y_i), i = 1, \dots, n$ with $x_i \in \R^d$ and $y_i \in \{0, 1\}$ . The likelihood model is:

\begin{align*} y_i \mid x_i, \beta \sim \text{Bernoulli}\left(\frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right) \end{align*}

(13)

and we take the improper uniform prior for $\beta$ . We then have:

\begin{align*} f_{y, \beta}(y, \beta) &= \prod_{i=1}^n \left(\frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right)^{y_i} \left(1 - \frac{e^{x_i^T \beta}}{1 + e^{x_i^T \beta}} \right)^{1-y_i} \\ &= \prod_{i=1}^n \frac{\exp(y_i x_i^T \beta)}{1 + \exp(x_i^T \beta)} = \exp(\ell(\beta)) \end{align*}

(14)

where $\ell(\beta)$ is the log-likelihood is

\begin{align} \ell(\beta) = \sum_{i=1}^n \left(y_i x_i^T \beta - \log \left(1 + \exp(x_i^T \beta) \right) \right). \end{align}

(15)

Note that, because the prior is one, $f_{y, \beta}(y, \beta)$ equals the likelihood $f_{y \mid \beta}(y)$ . We can thus calculate the ELBO as:

\begin{align*} \text{ELBO}(q) &= \int q(\beta) \log \frac{f_{y, \beta}(y, \beta)}{q(\beta)} d\beta \\ &= \int q(\beta) \ell(\beta) d\beta - \int q(\beta) \log q(\beta) d\beta. \end{align*}

(16)

The quantity $-\int q \log q$ is known as the entropy $H(q)$ of $q$ . We thus have

\begin{align*} \text{ELBO}(q) = \int q(\beta) \ell(\beta) d\beta - H(q). \end{align*}

(17)

Let us now take $\Qcal$ to be the class of all normal distributions $N(m, \Sigma)$ as $m \in \R^p$ and $\Sigma$ varies over the class of all $p \times p$ positive definite matrices $\Sigma$ . The entropy of the multivariate normal distribution (see Multivariate normal distribution) is:

\begin{align*} H(N(m, \Sigma)) = \frac{p}{2} \left(1 + \log (2 \pi) \right) + \frac{1}{2} \log |\Sigma| \end{align*}

(18)

where $|\Sigma|$ is the determinant of $\Sigma$ . We thus have:

\begin{align*} \text{ELBO}(m, \Sigma) = \E_{\beta \sim N(m, \Sigma)} \ell(\beta) + \frac{1}{2} \log |\Sigma| + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

(19)

This can be maximized over $m$ and $\Sigma$ to obtain the posterior approximation $N(\hat{m}, \hat{\Sigma})$ . Here $m$ can be any vector over $\R^p$ but $\Sigma$ is constrained to be positive definite. Constrained optimization is usually difficult (for gradient-based methods) so we convert this to unconstrained optimization by taking

\begin{align*} \Sigma = L L^T \end{align*}

(20)

where $L$ is a $p \times p$ lower-triangular matrix with non-zero diagonal entries. We can also, without loss of generality, assume that $L$ has strictly positive diagonal entries (given any $L$ , we can replace it by $L D$ where $D$ is a diagonal matrix with entries +1 if the corresponding entry of $L$ is positive and -1 otherwise). In code, we can represent the diagonal entries of $L$ by $\exp(a_1), \dots, \exp(a_p)$ . With this parametrization, we have

\begin{align*} |\Sigma| = |L L^T| = |L|^2 = \prod_{j=1}^p L_{jj}^2, \end{align*}

(21)

and thus

\begin{align*} \text{ELBO}(m, L) = \E_{\beta \sim N(m, L L^T)} \ell(\beta) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

(22)

In order to maximize the above function, we need to make the term $\E_{\beta \sim N(m, L L^T)} \ell(\beta)$ more explicit. For this, we use the reparametrization trick (see Reparameterization trick) and write

\begin{align*} \beta = m + L z ~~ \text{ where $z \sim N(0, I_p)$}. \end{align*}

(23)

This gives

\begin{align*} \text{ELBO}(m, L) = \E_{z \sim N(0, I_p)} \ell(m + L z) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

(24)

The expectation above can be approximated by Monte Carlo. Specifically generate

\begin{align*} z^{(1)}, \dots, z^{(S)} \overset{\text{i.i.d}}{\sim} N(0, I_p) \end{align*}

(25)

and use the approximation:

\begin{align*} \text{ELBO}(m, L) \approx \frac{1}{S} \sum_{s=1}^S \ell(m + L z^{(s)}) + \sum_{j=1}^p \log L_{jj} + \frac{p}{2} \left(1 + \log(2 \pi) \right). \end{align*}

(26)

This can be maximized using standard gradient-based optimization software such as PyTorch.