Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirty Eight

Spring 2026, UC Berkeley

As an example of variational inference, we shall look at an example of Variational Auto-Encoders (VAE) which are used for generative modeling.

Variational AutoEncoders

Consider data points y1,,yny_1, \dots, y_n, where each yiy_i is a binary vector in {0,1}dy\{0, 1\}^{d_y}. One can think of each yiy_i as an image, flattened from its 2D pixel grid into a single vector of length dyd_y, where each entry corresponds to one pixel. In general, pixel values can take many forms—for instance, grayscale images use intensities in {0,1,,255}\{0, 1, \dots, 255\}, and color images use three such values per pixel (one per RGB channel). Here we focus on the simplest case: binary images (also called black-and-white or bilevel images), in which each pixel is either off (0) or on (1). For example, a 28×2828 \times 28 binary image yields dy=282=784d_y = 28^2 = 784.

We build up a model for the data in stages. Since each coordinate of yiy_i is binary, the most basic assumption is

yipiindependentBernoulli(pi),\begin{align*} y_i \mid p_i \overset{\text{independent}}{\sim} \mathrm{Bernoulli}(p_i), \end{align*}

where pi[0,1]dyp_i \in [0, 1]^{d_y} specifies the probability that each pixel is on (Bernoulli is interpreted componentwise, so pixels are conditionally independent given pip_i). In other words, we are assuming that each yi{0,1}dyy_i \in \{0, 1\}^{d_y} is Bernoulli with its own parameter pi[0,1]dyp_i \in [0, 1]^{d_y}. We need to further model p1,,pnp_1, \dots, p_n (otherwise, there will be too many parameters to yield anything useful). Here it is more convenient to work on the logit scale: let

ηi=log ⁣pi1piRdy,\begin{align*} \eta_i = \log\!\frac{p_i}{1 - p_i} \in \mathbb{R}^{d_y}, \end{align*}

applied componentwise.

We shall assume η1,,ηn\eta_1, \dots, \eta_n are i.i.d with some common distribution on Rdy\R^{d_y}. Further, we assume that this common distribution, despite living in Rdy\mathbb{R}^{d_y}, is in fact concentrated on (or near) a much lower-dimensional set. A natural way to encode this is to write

ηi=fθ(zi)   where zii.i.d.N(0,Id),\begin{align*} \eta_i = f_\theta(z_i) ~~ \text{ where } z_i \overset{\text{i.i.d.}}{\sim} N(0, I_d), \end{align*}

where fθ:RdRdyf_\theta : \mathbb{R}^d \to \mathbb{R}^{d_y} is a neural network with parameters θ\theta and ddyd \ll d_y. The pushforward of N(0,Id)N(0, I_d) under fθf_\theta defines a dd-dimensional family of distributions on Rdy\mathbb{R}^{d_y}, and the flexibility of fθf_\theta ensures this family is rich enough to model the structure we care about. The Gaussian prior on ziz_i is essentially a convention: any continuous distribution on Rd\mathbb{R}^d can be written as a deterministic transformation of a standard Gaussian, so fixing ziN(0,Id)z_i \sim N(0, I_d) and letting fθf_\theta be flexible loses no generality (it is also analytically convenient, as we will see when designing an inference procedure). Putting the pieces together, and writing σ\sigma for the componentwise sigmoid, the overall model can be written in the following alternative way:

zii.i.dN(0,Id)   and   yiziBernoulli(σ(fθ(zi)))\begin{align*} z_i \overset{\text{i.i.d}}{\sim} N(0, I_d) ~~ \text{ and } ~~ y_i \mid z_i \sim \mathrm{Bernoulli}\bigl(\sigma(f_\theta(z_i))\bigr) \end{align*}

with, for example,

fθ(zi)=B ReLU(Azi+a)+b   with θ=(A,a,B,b)\begin{align} f_\theta(z_i) = B~\text{ReLU}(A z_i + a) + b ~~ \text{ with } \theta = (A, a, B, b) \end{align}

The unknown parameters in this model are the neural network parameters θ\theta. The latent variables z1,,znz_1, \dots, z_n are also unobserved. The ziz_i’s can be integrated out and the model becomes

yii.i.dpθ\begin{align*} y_i \overset{\text{i.i.d}}{\sim} p_{\theta} \end{align*}

where

pθ(y)=pθ(yz)φ(z)dz\begin{align*} p_{\theta}(y) = \int p_{\theta}(y \mid z) \varphi(z) dz \end{align*}

with φ(z)\varphi(z) being the density of N(0,Id)N(0, I_d), and

pθ(yz)=j=1dy(σ(fθ,j(z)))yj(1σ(fθ,j(z)))1yj   for   y=(y1,,ydy){0,1}dy.\begin{align*} p_{\theta}(y \mid z) = \prod_{j=1}^{d_y} \left(\sigma(f_{\theta, j}(z)) \right)^{y_j} \left(1 - \sigma(f_{\theta, j}(z))\right)^{1 - y_j} ~~ \text{ for } ~~ y = (y_1, \dots, y_{d_y}) \in \{0, 1\}^{d_y}. \end{align*}

The likelihood then is given by:

fdataθ(y1,,yn)=i=1n[pθ(yizi)φ(zi)dzi]\begin{align*} f_{\text{data} \mid \theta}(y_1, \dots, y_n) = \prod_{i=1}^n \left[ \int p_{\theta}(y_i \mid z_i) \varphi(z_i) dz_i \right] \end{align*}

and the corresponding log-likelihood is

logfdataθ(y1,,yn)=i=1nlog[pθ(yizi)φ(zi)dzi].\begin{align} \log f_{\text{data} \mid \theta}(y_1, \dots, y_n) = \sum_{i=1}^n \log \left[ \int p_{\theta}(y_i \mid z_i) \varphi(z_i) dz_i \right]. \end{align}

This log-likelihood cannot be directly used as the objective function in optimization software such as PyTorch because of the presence of the integral. A natural idea is to discretize the integral by sampling zi(l),l=1,,Mz^{(l)}_i, l = 1, \dots, M from N(0,Id)N(0, I_d) and forming the approximate log-likelihood:

logfdataθ(y1,,yn)i=1nlog[1Ml=1Mpθ(yizi(l))].\begin{align} \log f_{\text{data} \mid \theta}(y_1, \dots, y_n) \approx \sum_{i=1}^n \log \left[ \frac{1}{M} \sum_{l = 1}^M p_{\theta}(y_i \mid z_i^{(l)}) \right]. \end{align}

The above term (11) is unlikely to be a good approximation to the actual log-likelihood (10). This is because the integral pθ(yizi)φ(zi)dzi\int p_{\theta}(y_i \mid z_i) \varphi(z_i) dz_i in (10) is dominated by the region where pθ(yizi)p_{\theta}(y_i \mid z_i) is large i.e., by the posterior pθ(ziyi)p_{\theta}(z_i \mid y_i). This posterior typically is very concentrated in the sense that only a tiny region of ziRdz_i \in \R^d yields non-negligibe values of pθ(yizi)p_{\theta}(y_i \mid z_i). The prior ziN(0,Id)z_i \sim N(0, I_d), by contrast, spreads mass diffusely over Rd\R^d. The consequence will be that almost every sample zi(l)z_i^{(l)} lands in a region where pθ(yizi(l))p_{\theta}(y_i \mid z_i^{(l)}) is small, so that the average in (11) is dominated by one or two (or even zero) lucky samples. This makes the average in (11) a highly variance estimate of the integral in (10), and hence unreliable as an optimization objective.

A better approach is to use ideas from variational inference and to maximize the ELBO instead.

Use of the ELBO for maximization of logfdataθ(y1,,yn)\log f_{\text{data} \mid \theta}(y_1, \dots, y_n)

We shall use the following formula whose proof we saw in Lecture 36 (see also Lecture 37):

logfdataθ(y1,,yn)=maxqELBO(q,θ)\begin{align} \log f_{\text{data} \mid \theta}(y_1, \dots, y_n) = \max_{q} \text{ELBO}(q, \theta) \end{align}

where

ELBO(q,θ)= ⁣q(z1,,zn)logi=1npθ(yi,zi)q(z1,,zn)dz1dzn= ⁣q(z1,,zn)logi=1npθ(yizi)dz1dzn ⁣q(z1,,zn)logq(z1,,zn)φ(z1),,φ(zn)dz1dzn=i=1nlogpθ(yizi)qi(zi)dziKL(qφ××φ).\begin{align*} & \text{ELBO}(q, \theta) \\ &= \int \dots \int q(z_1, \dots, z_n) \log \frac{\prod_{i=1}^np_{\theta}(y_i, z_i)}{q(z_1, \dots, z_n)} dz_1 \dots dz_n \\ &= \int \dots \int q(z_1, \dots, z_n) \log \prod_{i=1}^n p_{\theta}(y_i \mid z_i) dz_1 \dots dz_n - \int \dots \int q(z_1, \dots, z_n) \log \frac{q(z_1, \dots, z_n)}{\varphi(z_1), \dots, \varphi(z_n)} dz_1 \dots dz_n \\ &= \sum_{i=1}^n \int \log p_{\theta}(y_i \mid z_i) q_i(z_i) dz_i - KL(q \| \varphi \times \dots \times \varphi). \end{align*}

where qiq_i is the ii-th marginal corresponding to qq, and φ××φ\varphi \times \dots \times \varphi denotes the distribution of (z1,,zn)(z_1, \dots, z_n) corresponding to zii.i.dN(0,Id)z_i \overset{\text{i.i.d}}{\sim} N(0, I_d).

The maximization in (12) is over all probability densities qq of z1,,znz_1, \dots, z_n. We have also seen (again in Lecture 36) that the maximum in (12) is achieved when qq is the conditional density of z1,,znz_1, \dots, z_n given y1,,yny_1, \dots, y_n and θ\theta:

q(z1,,zn)i=1n[pθ(yizi)φ(zi)].\begin{align*} q^*(z_1, \dots, z_n) \propto \prod_{i=1}^n\left[ p_{\theta}(y_i \mid z_i) \varphi(z_i) \right]. \end{align*}

It is clear that this qq^* is a product measure i=1nqi(ziyi)\prod_{i=1}^n q_i^*(z_i \mid y_i) where

qi(ziyi)pθ(yizi)φ(zi)    qi(ziyi)=pθ(yizi)φ(zi)pθ(yizi)φ(zi)dzi.\begin{align*} q_i^*(z_i \mid y_i) \propto p_{\theta}(y_i \mid z_i) \varphi(z_i) \implies q_i^*(z_i \mid y_i) = \frac{p_{\theta}(y_i \mid z_i) \varphi(z_i)}{\int p_{\theta}(y_i \mid z_i) \varphi(z_i) dz_i}. \end{align*}

However because of the somewhat complicated form of pθ(yizi)p_{\theta}(y_i \mid z_i), the density qiq_i^* above is not of a simple form (for example, it is not a Gaussian). We therefore take a simpler class of densities Q\Qcal and then maximize the ELBO over Q\Qcal. Specifically, we shall take Q\Qcal to be the class of all product densities qi,ϕ(ziyi)\prod q_{i, \phi}(z_i \mid y_i) for which each marginal qϕ(ziyi)q_{\phi}(z_i \mid y_i) is Gaussian:

qϕ(ziyi)= density of N(μϕ(yi),Σϕ(yi))\begin{align*} q_{\phi}(z_i \mid y_i) = ~\text{density of}~ N(\mu_{\phi}(y_i), \Sigma_{\phi}(y_i)) \end{align*}

with

μϕ(yi)=(μϕ,j(yi),j=1,,d)   and   Σϕ(yi)=diag(σϕ,j2(yi),j=1,,d).\begin{align*} \mu_{\phi}(y_i) = (\mu_{\phi, j}(y_i), j = 1, \dots, d) ~~ \text{ and } ~~ \Sigma_{\phi}(y_i) = \text{diag} \left(\sigma^2_{\phi, j}(y_i), j = 1, \dots, d \right). \end{align*}

With this choice Q\Qcal, the ELBO becomes

ELBO(q,θ)=i=1nlogpθ(yizi)qϕ(ziyi)dzii=1nKL(q(yi)ϕ),\begin{align*} \text{ELBO}(q, \theta) = \sum_{i=1}^n \int \log p_{\theta}(y_i \mid z_i) q_{\phi}(z_i \mid y_i) dz_i - \sum_{i=1}^n KL(q(\cdot \mid y_i) \| \phi), \end{align*}

where we use the fact that the KL between two product distributions becomes the sum of the KL between the marginals. We can further write this as:

ELBO(q,θ)=i=1nEziqϕ(yi)logpθ(yizi)i=1nKL(N(μϕ(yi),Σϕ(yi))N(0,Id)).\begin{align*} \text{ELBO}(q, \theta) = \sum_{i=1}^n \E_{z_i \sim q_{\phi}(\cdot \mid y_i)} \log p_{\theta}(y_i \mid z_i) - \sum_{i=1}^n KL(N(\mu_{\phi}(y_i), \Sigma_{\phi}(y_i)) \| N(0, I_d)). \end{align*}

The second term above can be written in closed form as (see e.g., {https://en.wikipedia.org/wiki/Kullback

KL(N(μϕ(yi),Σϕ(yi))N(0,Id))=12j=1d{σϕ,j2(yi)+μϕ,j2(yi)logσϕ,j2(yi)1}.\begin{align*} KL(N(\mu_{\phi}(y_i), \Sigma_{\phi}(y_i)) \| N(0, I_d)) = \frac{1}{2} \sum_{j=1}^d \left\{\sigma^2_{\phi, j}(y_i) + \mu^2_{\phi, j}(y_i) - \log \sigma^2_{\phi, j}(y_i) - 1 \right\}. \end{align*}

For the first term in the ELBO expression above, we need to compute

Eziqϕ(yi)logpθ(yizi)\begin{align*} \E_{z_i \sim q_{\phi}(\cdot \mid y_i)} \log p_{\theta}(y_i \mid z_i) \end{align*}

For this, we can use Monte Carlo averaging. First write ziN(μϕ(yi),Σϕ(yi))z_i \sim N(\mu_{\phi}(y_i), \Sigma_{\phi}(y_i)) as

zi=μϕ(yi)+σϕ(yi)ui  where uii.i.dN(0,Id).\begin{align*} z_i = \mu_{\phi}(y_i) + \sigma_{\phi}(y_i) \odot u_i ~~ \text{where}~ u_i \overset{\text{i.i.d}}{\sim} N(0, I_d). \end{align*}

Here \odot stands for the Hadamard product (elementwise multiplication). Note that the components of Σϕ(yi)1/2ui\Sigma_{\phi}(y_i)^{1/2} u_i are simply σϕ,j(yi)ui,j\sigma_{\phi, j}(y_i) u_{i, j}. Thus

Eziqϕ(yi)logpθ(yizi)=EuiN(0,Id)logpθ(yiμϕ(yi)+σϕ(yi)ui)1Ll=1Llogpθ(yiμϕ(yi)+σϕ(yi)ui(l))\begin{align*} \E_{z_i \sim q_{\phi}(\cdot \mid y_i)} \log p_{\theta}(y_i \mid z_i) &= \E_{u_i \sim N(0, I_d)} \log p_{\theta}(y_i \mid \mu_{\phi}(y_i) + \sigma_{\phi}(y_i) \odot u_i) \\ &\approx \frac{1}{L} \sum_{l=1}^L \log p_{\theta}(y_i \mid \mu_{\phi}(y_i) + \sigma_{\phi}(y_i) \odot u_i^{(l)}) \end{align*}

where ui(l),1lL,1inu_i^{(l)}, 1 \leq l \leq L, 1 \leq i \leq n are i.i.d samples generated from N(0,Id)N(0, I_d). Thus the ELBO becomes:

ELBO(ϕ,θ)=1Li=1nl=1Lj=1dy{yi,jlogσ(fθ,j(μϕ(yi)+σϕ(yi)ui(l)))+(1yi,j)log[1σ(fθ,j(μϕ(yi)+σϕ(yi)ui(l)))]}12i=1nj=1d{σϕ,j2(yi)+μϕ,j2(yi)logσϕ,j2(yi)1}.\begin{align*} & \text{ELBO}(\phi, \theta) \\ &= \frac{1}{L} \sum_{i=1}^n \sum_{l = 1}^L \sum_{j=1}^{d_y} \left\{y_{i, j} \log \sigma\left( f_{\theta, j}(\mu_{\phi}(y_i) + \sigma_{\phi}(y_i) \odot u_i^{(l)})\right) \right. \\ & \left.+ (1 - y_{i, j}) \log \left[1 - \sigma\left( f_{\theta, j}(\mu_{\phi}(y_i) + \sigma_{\phi}(y_i) \odot u_i^{(l)}) \right) \right] \right\} \\ &- \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^d \left\{\sigma^2_{\phi, j}(y_i) + \mu^2_{\phi, j}(y_i) - \log \sigma^2_{\phi, j}(y_i) - 1 \right\}. \end{align*}

Here σ\sigma denotes the sigmoid function while σϕ\sigma_{\phi} denotes the variance parameter of the variational distribution qϕq_{\phi}.

We simply maximize ELBO(ϕ,θ)\text{ELBO}(\phi, \theta) jointly over ϕ\phi and θ\theta to estimate them. We will parametrize μϕ\mu_{\phi} and σϕ\sigma_{\phi} using neural networks. For example,

μϕ(y)=WμReLU(Wy+w)+bμlogσϕ2(y)=WσReLU(Wy+w)+bσ,\begin{align*} & \mu_{\phi}(y) = W_{\mu} \text{ReLU}(Wy + w) + b_{\mu} \\ & \log \sigma^2_{\phi}(y) = W_{\sigma} \text{ReLU}(Wy + w) + b_{\sigma}, \end{align*}

with ϕ=(Wμ,W,w,bμ,Wσ,bσ)\phi = (W_{\mu}, W, w, b_{\mu}, W_{\sigma}, b_{\sigma}). Recall also that fθf_{\theta} will also be parametrized by a neural network as in (5).

With these parametrizations, ELBO(ϕ,θ)\text{ELBO}(\phi, \theta) can be maximized using optimization software such as PyTorch.