Consider data points y1,…,yn, where each yi is a binary vector in {0,1}dy. One can think of each yi as an image, flattened from its 2D pixel grid into a single vector of length dy, where each entry corresponds to one pixel. In general, pixel values can take many forms—for instance, grayscale images use intensities in {0,1,…,255}, and color images use three such values per pixel (one per RGB channel). Here we focus on the simplest case: binary images (also called black-and-white or bilevel images), in which each pixel is either off (0) or on (1). For example, a 28×28 binary image yields dy=282=784.
We build up a model for the data in stages. Since each coordinate of yi is binary, the most basic assumption is
where pi∈[0,1]dy specifies the probability that each pixel is on (Bernoulli is interpreted componentwise, so pixels are conditionally independent given pi). In other words, we are assuming that each yi∈{0,1}dy is Bernoulli with its own parameter pi∈[0,1]dy. We need to further model p1,…,pn (otherwise, there will be too many parameters to yield anything useful). Here it is more convenient to work on the logit scale: let
We shall assume η1,…,ηn are i.i.d with some common distribution on Rdy. Further, we assume that this common distribution, despite living in Rdy, is in fact concentrated on (or near) a much lower-dimensional set. A natural way to encode this is to write
where fθ:Rd→Rdy is a neural network with parameters θ and d≪dy. The pushforward of N(0,Id) under fθ defines a d-dimensional family of distributions on Rdy, and the flexibility of fθ ensures this family is rich enough to model the structure we care about. The Gaussian prior on zi is essentially a convention: any continuous distribution on Rd can be written as a deterministic transformation of a standard Gaussian, so fixing zi∼N(0,Id) and letting fθ be flexible loses no generality (it is also analytically convenient, as we will see when designing an inference procedure). Putting the pieces together, and writing σ for the componentwise sigmoid, the overall model can be written in the following alternative way:
zi∼i.i.dN(0,Id) and yi∣zi∼Bernoulli(σ(fθ(zi)))
The unknown parameters in this model are the neural network parameters θ. The latent variables z1,…,zn are also unobserved. The zi’s can be integrated out and the model becomes
This log-likelihood cannot be directly used as the objective function in optimization software such as PyTorch because of the presence of the integral. A natural idea is to discretize the integral by sampling zi(l),l=1,…,M from N(0,Id) and forming the approximate log-likelihood:
The above term (11) is unlikely to be a good approximation to the actual log-likelihood (10). This is because the integral ∫pθ(yi∣zi)φ(zi)dzi in (10) is dominated by the region where pθ(yi∣zi) is large i.e., by the posterior pθ(zi∣yi). This posterior typically is very concentrated in the sense that only a tiny region of zi∈Rd yields non-negligibe values of pθ(yi∣zi). The prior zi∼N(0,Id), by contrast, spreads mass diffusely over Rd. The consequence will be that almost every sample zi(l) lands in a region where pθ(yi∣zi(l)) is small, so that the average in (11) is dominated by one or two (or even zero) lucky samples. This makes the average in (11) a highly variance estimate of the integral in (10), and hence unreliable as an optimization objective.
A better approach is to use ideas from variational inference and to maximize the ELBO instead.
Use of the ELBO for maximization of logfdata∣θ(y1,…,yn)¶
We shall use the following formula whose proof we saw in Lecture 36 (see also Lecture 37):
where qi is the i-th marginal corresponding to q, and φ×⋯×φ denotes the distribution of (z1,…,zn) corresponding to zi∼i.i.dN(0,Id).
The maximization in (12) is over all probability densities q of z1,…,zn. We have also seen (again in Lecture 36) that the maximum in (12) is achieved when q is the conditional density of z1,…,zn given y1,…,yn and θ:
However because of the somewhat complicated form of pθ(yi∣zi), the density qi∗ above is not of a simple form (for example, it is not a Gaussian). We therefore take a simpler class of densities Q and then maximize the ELBO over Q. Specifically, we shall take Q to be the class of all product densities ∏qi,ϕ(zi∣yi) for which each marginal qϕ(zi∣yi) is Gaussian: