Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty Seven

Spring 2026, UC Berkeley

Variational Inference

We looked at the basics of variational inference in the last lecture. Consider the standard Bayesian setup with data given by yy and parameters given by θ\theta. The prior is fθ(θ)f_{\theta}(\theta) and the likelihood is fyθ(y)f_{y \mid \theta}(y). The posterior is then:

fθy(θ)=fθ(θ)fyθ(y)fy(y)\begin{align*} f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid \theta}(y)}{f_y(y)} \end{align*}

where the denominator is the evidence fy(y)=fyθ(y)fθ(θ)dθf_{y}(y) = \int f_{y \mid \theta}(y) f_{\theta}(\theta) d\theta.

The most important quantity in variational inference is the ELBO defined as:

ELBO(q):=q(θ)logfy,θ(y,θ)q(θ)dθ\begin{align} \text{ELBO}(q) := \int q(\theta) \log \frac{f_{y, \theta}(y, \theta)}{q(\theta)} d\theta \end{align}

where qq is an arbitrary probability density. The key facts about the ELBO are:

ELBO(q)logfy(y)   for every q.\begin{align} \text{ELBO}(q) \leq \log f_{y}(y) ~~ \text{ for every } q. \end{align}

and

maxqELBO(q)=logfy(y)   with the maximum achieved at q(θ)=fθy(θ).\begin{align} \max_{q} \text{ELBO}(q) = \log f_y(y) ~~ \text{ with the maximum achieved at } q(\theta) = f_{\theta \mid y}(\theta). \end{align}

Both these facts are consequences of:

ELBO(q)=logfy(y)KL(qfθy).\begin{align*} \text{ELBO}(q) = \log f_{y}(y) - KL\left(q \| f_{\theta \mid y} \right). \end{align*}

Another useful fact about the ELBO (which follows directly from the definition (8) when fy,θ(y,θ)f_{y, \theta}(y, \theta) is replaced by fyθ(y)fθ(θ)f_{y \mid \theta}(y) f_{\theta}(\theta)) is:

ELBO(q)=q(θ)logfyθ(y)dθKL(qfθ).\begin{align*} \text{ELBO}(q) = \int q(\theta) \log f_{y \mid \theta}(y) d\theta - KL(q \| f_{\theta}). \end{align*}

Connection to the EM Algorithm

The relation maxqELBO(q)=logfy(y)\max_q \text{ELBO}(q) = \log f_{y}(y) (with the maximum being achieved at the posterior q(θ)=fθy(θ)q(\theta) = f_{\theta \mid y}(\theta)) can be used to understand the EM algorithm.

Consider the same Bayesian setting as in the previous section, but suppose now that there is an additional hyperparameter α\alpha, and that both the likelihood fyθ,α(y)f_{y \mid \theta, \alpha}(y) as well as the prior fθα(θ)f_{\theta \mid \alpha}(\theta) depend on α\alpha.

The ELBO now becomes:

ELBO(q,α)=q(θ)logfy,θα(y,θ)q(θ)dθ=q(θ)logfyθ,α(y)fθα(θ)q(θ)dθ\begin{align*} \text{ELBO}(q, \alpha) = \int q(\theta) \log \frac{f_{y, \theta \mid \alpha}(y, \theta)}{q(\theta)} d\theta = \int q(\theta) \log \frac{f_{y \mid \theta, \alpha}(y) f_{\theta \mid \alpha}(\theta)}{q(\theta)} d\theta \end{align*}

and equation (4) now becomes:

maxqELBO(q,α)=logfyα(y)   with the maximum achieved at q(θ)=fθy,α(θ).\begin{align} \max_{q} \text{ELBO}(q, \alpha) = \log f_{y\mid \alpha}(y) ~~ \text{ with the maximum achieved at } q(\theta) = f_{\theta \mid y, \alpha}(\theta). \end{align}

Suppose now that we want to obtain the MLE α^\hat{\alpha} of α\alpha i.e., we want to solve the optimization:

Maximizeα logfyα(y).\begin{align*} \underset{\alpha}{\text{Maximize}}~ \log f_{y \mid \alpha}(y). \end{align*}

Using (8), we can rewrite the optimization above as:

Maximizeα maxqELBO(q,α),\begin{align*} \underset{\alpha}{\text{Maximize}}~ \max_{q} \text{ELBO}(q, \alpha), \end{align*}

which can be viewed as the two parameter maximization problem:

Maximizeq,α ELBO(q,α).\begin{align*} \underset{q, \alpha}{\text{Maximize}} ~ \text{ELBO}(q, \alpha). \end{align*}

It is natural to solve this two-parameter optimization alternatively. Fix α\alpha and optimize over qq, and then fix qq and optimize over α\alpha. This leads to the following algorithm.

  1. Initialize α=α(0)\alpha = \alpha^{(0)}.

  2. Repeat the following for t=0,1,2,t = 0, 1, 2, \dots (until some kind of convergence):

    1. Take q(t)q^{(t)} to be the maximizer of ELBO(q,α(t))\text{ELBO}(q, \alpha^{(t)}) over all qq (for fixed α(t)\alpha^{(t)})

    2. Take α(t+1)\alpha^{(t+1)} to be the maximizer of ELBO(q(t),α)\text{ELBO}(q^{(t)}, \alpha) over all α\alpha.

By (8), it is clear that q(t)q^{(t)} which maximizes ELBO(q,α(t))\text{ELBO}(q, \alpha^{(t)}) is simply equal to:

q(t)=fθy,α(t).q^{(t)} = f_{\theta \mid y, \alpha^{(t)}}.

Thus the alternating minimization algorithm above is equivalent to:

  1. Initialize α=α(0)\alpha = \alpha^{(0)}.

  2. Repeat the following for t=0,1,2,t = 0, 1, 2, \dots (until some kind of convergence):

    1. Take α(t+1)\alpha^{(t+1)} to be the maximizer of ELBO(fθ,y,α(t),α)\text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}}, \alpha) over all α\alpha.

The quantity ELBO(fθ,y,α(t),α)\text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}}, \alpha) equals:

ELBO(fθ,y,α(t),α)=fθy,α(t)(θ)logfy,θα(y,θ)fθy,α(t)(θ)dθ=fθy,α(t)(θ)logfy,θα(θ)dθfθy,α(t)(θ)logfθy,α(t)(θ)dθ.\begin{align*} \text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}}, \alpha) &= \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log \frac{f_{y, \theta \mid \alpha}(y, \theta)}{f_{\theta \mid y, \alpha^{(t)}}(\theta)} d\theta \\ &= \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log f_{y, \theta \mid \alpha}(\theta) d\theta - \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log f_{\theta \mid y, \alpha^{(t)}}(\theta) d\theta. \end{align*}

The second term actually does not depend on α\alpha, so maximizing ELBO(fθ,y,α(t),α)\text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}}, \alpha) over α\alpha is equivalent to maximizing the first term above:

E(t,α):=fθy,α(t)(θ)logfy,θα(θ)dθ.\begin{align} E(t, \alpha) := \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log f_{y, \theta \mid \alpha}(\theta) d\theta. \end{align}

The alternating maximization algorithm above then becomes:

  1. Initialize α=α(0)\alpha = \alpha^{(0)}.

  2. Repeat the following for t=0,1,2,t = 0, 1, 2, \dots (until some kind of convergence):

    1. Calculate E(t,α)E(t, \alpha) given in (13)

    2. Take α(t+1)\alpha^{(t+1)} to be the maximizer of E(t,α)E(t, \alpha) over all α\alpha.

This is the EM algorithm. The calculation of E(t,α)E(t, \alpha) is called the EE-step and the maximization of E(t,α)E(t, \alpha) over α\alpha is called the MM-step. For more on this connection between Variational Inference (VI) and EM, see neal1998view.

Coordinatewise Variational Inference

The key problem that we need to solve in VI is that of maximizing ELBO(q)\text{ELBO}(q). The full maximizer of ELBO(q)\text{ELBO}(q) over qqis, as already mentioned, fθyf_{\theta \mid y}. But this posterior is intractable in general and the very reason for using variational inerference is this intractability. The strategy is to restrict the class of qq over which we look for minimizers. This hopefully will lead to a tractable minimizer which is close to the actual posterior fθyf_{\theta \mid y}.

Suppose θ\theta can be decomposed into kk separate classes of parameters as θ=(θ1,,θk)\theta = (\theta_1, \dots, \theta_k). We look for variational approximations of the form q(θ)=q1(θ1)qk(θk)q(\theta) = q_1(\theta_1) \dots q_k(\theta_k). The ELBO maximization problem is:

maximizeq1,,qk ELBO(q1,,qk)= ⁣q1(θ1)qk(θk)logfy,θ(y,θ)q1(θ1)qk(θk)dθ.\underset{q_1, \dots, q_k}{\text{maximize}}~ \text{ELBO}(q_1, \dots, q_k) = \int \dots \int q_1(\theta_1) \dots q_k(\theta_k) \log \frac{f_{y, \theta}(y, \theta)}{q_1(\theta_1) \dots q_k(\theta_k)} d\theta.

Alternating minimization leads to the following result.

The CAVI update equation (15) is equivalent to the following:

logqj(θj)=constant+Eθjlogfy,θ1,,θk(y,θ1,,θk)\begin{align} \log q_j^*(\theta_j) = \text{constant} + \E_{\theta_{-j}} \log f_{y, \theta_1, \dots, \theta_k}(y, \theta_1, \dots, \theta_k) \end{align}

where the expectation is taken with respect to θjqj\theta_{-j} \sim q_{-j}.

Proposition 1 is a consequence of the following simple fact.

Simple CAVI Example

As a simple illustration of how CAVI works, let us consider the problem of estimating μ,σ\mu, \sigma from observations y1,,yny_1, \dots, y_n that are i.i.d N(μ,σ2)N(\mu, \sigma^2). As usual, we use the prior

μ,logσi.i.duniform(,).\begin{align*} \mu, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-\infty, \infty). \end{align*}

In other words, the prior is fμ,σ2(μ,σ2)=1σ2f_{\mu, \sigma^2}(\mu, \sigma^2)=\frac{1}{\sigma^2} (note that we are treating σ2\sigma^2 and not σ\sigma as the parameter).

The joint density of the data y=(y1,,yn)y = (y_1, \dots, y_n) and the parameter θ=(μ,σ2)\theta = (\mu, \sigma^2) is given by:

fy,θ(y,θ)=(i=1n12πσexp((yiμ)22σ2))1σ2\begin{align*} f_{y, \theta}(y, \theta) = \left(\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(y_i - \mu)^2}{2 \sigma^2} \right) \right) \frac{1}{\sigma^2} \end{align*}

so that

logfy,θ(y,θ)=i=1n(yiμ)22σ2(n2+1)logσ2+constant=n2σ2(yˉμ)212σ2i=1n(yiyˉ)2(n2+1)logσ2+constant.\begin{align*} \log f_{y, \theta}(y, \theta) &= -\sum_{i=1}^n \frac{(y_i - \mu)^2}{2 \sigma^2} - \left(\frac{n}{2} + 1 \right) \log \sigma^2 + \text{constant} \\ &= -\frac{n}{2 \sigma^2} (\bar{y} - \mu)^2 - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2 - \left(\frac{n}{2} + 1 \right) \log \sigma^2 + \text{constant}. \end{align*}

CAVI approximates the posterior of θ=(μ,σ2)\theta = (\mu, \sigma^2) by a product distribution qμ(μ)qσ2(σ2)q_{\mu}(\mu) q_{\sigma^2}(\sigma^2). Below we figure out how to update qμq_{\mu} given qσ2q_{\sigma^2} and also qσ2q_{\sigma^2} given qμq_{\mu}.

By (16), given qσ2q_{\sigma^2}, we get

logqμ(μ)=n2(yˉμ)2Eqσ2(1σ2)+constant,\begin{align*} \log q^*_{\mu}(\mu) = -\frac{n}{2} (\bar{y} - \mu)^2 \E_{q_{\sigma^2}} \left(\frac{1}{\sigma^2} \right) + \text{constant}, \end{align*}

which means that qμq^*_{\mu} is the density of the normal distribution: with mean μ\mu and variance

qμ= density of N(yˉ,1nEqσ2(1/σ2)).\begin{align} q^*_{\mu} = \text{ density of } N\left(\bar{y}, \frac{1}{n \E_{q_{\sigma^2}}(1/\sigma^2)} \right). \end{align}

To get qσ2(σ2)q^*_{\sigma^2}(\sigma^2) for fixed qμq_{\mu}, we again use (16) to get

logqσ2(σ2)=12σ2(nEqμ(yˉμ)2+i=1n(yiyˉ)2)(n2+1)logσ2+constant.\begin{align*} \log q^*_{\sigma^2}(\sigma^2) = -\frac{1}{2 \sigma^2} \left(n \E_{q_{\mu}} (\bar{y} - \mu)^2 + \sum_{i=1}^n (y_i - \bar{y})^2 \right) - \left(\frac{n}{2} + 1 \right) \log \sigma^2 + \text{constant}. \end{align*}

This has the form of the Inverse Gamma density IG(a,b)IG(a, b) given by:

bσ2(a+1)logσ2.\begin{align*} -\frac{b}{\sigma^2} - (a + 1) \log \sigma^2. \end{align*}

Comparing terms, we get

qσ2= density of IG(n2,n2Eqμ(yˉμ)2+12i=1n(yiyˉ)2).\begin{align} q^*_{\sigma^2} = \text{ density of } IG \left(\frac{n}{2}, \frac{n}{2} \E_{q_{\mu}} (\bar{y} - \mu)^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right). \end{align}

(23) and (26) give the following CAVI algorithm:

  1. Initialize with some density qσ2(0)q^{(0)}_{\sigma^2} of σ2\sigma^2. Suppose s0:=(nEqσ2(0)(1/σ2))1s_0 := \left(n \E_{q^{(0)}_{\sigma^2}}(1/\sigma^2) \right)^{-1}.

  2. Repeat the following for k=0,1,2,k = 0, 1, 2, \dots:

    1. Take qμ(k+1)q_{\mu}^{(k+1)} to be the density of N(yˉ,sk2)N(\bar{y}, s_k^2).

    2. Take qσ2(k+1)q_{\sigma^2}^{(k+1)} to be the density of

      IG(n2,n2Eqμ(k+1)(yˉμ)2+12i=1n(yiyˉ)2)=IG(n2,n2sk2+12i=1n(yiyˉ)2)\begin{align*} IG \left(\frac{n}{2}, \frac{n}{2}\E_{q^{(k+1)}_{\mu}} (\bar{y} - \mu)^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right) = IG \left(\frac{n}{2}, \frac{n}{2}s_k^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right) \end{align*}
    3. Take sk+12=(nEqσ2(k+1)(1/σ2))1s_{k+1}^2 = \left(n \E_{q_{\sigma^2}^{(k+1)}} (1/\sigma^2) \right)^{-1} which leads to (using the fact Eσ2=a/b\E \sigma^{-2} = a/b when σ2IG(a,b)\sigma^2 \sim IG(a, b))

      sk+12=sk2n+1n2i=1n(yiyˉ)2.\begin{align} s_{k+1}^2 = \frac{s_k^2}{n} + \frac{1}{n^2} \sum_{i=1}^n (y_i - \bar{y})^2. \end{align}

The limit of the iteration for {sk2}\{s_k^2\} in (28) can be computed in closed form. Indeed if s2=limksk2s^2 = \lim_{k \rightarrow \infty} s_k^2, then (28) gives

s2=s2n+1n2i=1n(yiyˉ)2    s2=1n(n1)i=1n(yiyˉ)2.\begin{align*} s^2 = \frac{s^2}{n} + \frac{1}{n^2} \sum_{i=1}^n (y_i - \bar{y})^2 \implies s^2 = \frac{1}{n(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2. \end{align*}

Thus the VI solution qμ(μ)qσ2(σ2)q^*_{\mu}(\mu) q^*_{\sigma^2}(\sigma^2) is given by:

qμ= density of N(yˉ,1n(n1)i=1n(yiyˉ)2)\begin{align} q_{\mu}^* = \text{ density of } N \left(\bar{y}, \frac{1}{n(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 \right) \end{align}

and

qσ2= density of IG(n2,12(n1)i=1n(yiyˉ)2+12i=1n(yiyˉ)2)= density of IG(n2,n2(n1)i=1n(yiyˉ)2)\begin{align} q_{\sigma^2}^* &= \text{ density of } IG\left(\frac{n}{2}, \frac{1}{2(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right) \nonumber \\ &= \text{ density of } IG\left(\frac{n}{2}, \frac{n}{2(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 \right) \end{align}

We can compare these VI marginals to those of the exact posterior which we know is given by:

μdatatn1(yˉ,1n(n1)i=1n(yiyˉ)2)   and   σ2dataIG(n12,12i=1n(yiyˉ)2).\begin{align*} \mu \mid \text{data} \sim t_{n-1} \left(\bar{y}, \frac{1}{n(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 \right) ~~ \text{ and } ~~ \sigma^2 \mid \text{data} \sim IG \left(\frac{n-1}{2}, \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right). \end{align*}

Comparing the above to (30), we see that the VI marginal posterior has a narrower normal density instead of tn1t_{n-1}. Comparing the above to (31), we see that the parameters of the IG distribution are changed slightly (the changes are negligible for large nn). Note also that μ\mu and σ2\sigma^2 are independent in the actual posterior but this VI analysis assumes they are independent.