STAT 238 - Bayesian Statistics Lecture Thirty Seven
Spring 2026, UC Berkeley
Variational Inference ¶ We looked at the basics of variational inference in the last lecture. Consider the standard Bayesian setup with data given by y y y and parameters given by θ \theta θ . The prior is f θ ( θ ) f_{\theta}(\theta) f θ ( θ ) and the likelihood is f y ∣ θ ( y ) f_{y \mid \theta}(y) f y ∣ θ ( y ) . The posterior is then:
f θ ∣ y ( θ ) = f θ ( θ ) f y ∣ θ ( y ) f y ( y ) \begin{align*}
f_{\theta \mid y}(\theta) = \frac{f_{\theta}(\theta) f_{y \mid
\theta}(y)}{f_y(y)}
\end{align*} f θ ∣ y ( θ ) = f y ( y ) f θ ( θ ) f y ∣ θ ( y ) where the denominator is the evidence f y ( y ) = ∫ f y ∣ θ ( y ) f θ ( θ ) d θ f_{y}(y) = \int f_{y \mid
\theta}(y) f_{\theta}(\theta) d\theta f y ( y ) = ∫ f y ∣ θ ( y ) f θ ( θ ) d θ .
The most important quantity in variational inference is the ELBO defined as:
ELBO ( q ) : = ∫ q ( θ ) log f y , θ ( y , θ ) q ( θ ) d θ \begin{align}
\text{ELBO}(q) := \int q(\theta) \log \frac{f_{y, \theta}(y,
\theta)}{q(\theta)} d\theta
\end{align} ELBO ( q ) := ∫ q ( θ ) log q ( θ ) f y , θ ( y , θ ) d θ where q q q is an arbitrary probability density. The key facts about the ELBO are:
ELBO ( q ) ≤ log f y ( y ) for every q . \begin{align}
\text{ELBO}(q) \leq \log f_{y}(y) ~~ \text{ for every } q.
\end{align} ELBO ( q ) ≤ log f y ( y ) for every q . and
max q ELBO ( q ) = log f y ( y ) with the maximum achieved at q ( θ ) = f θ ∣ y ( θ ) . \begin{align}
\max_{q} \text{ELBO}(q) = \log f_y(y) ~~ \text{ with the maximum
achieved at } q(\theta) = f_{\theta \mid y}(\theta).
\end{align} q max ELBO ( q ) = log f y ( y ) with the maximum achieved at q ( θ ) = f θ ∣ y ( θ ) . Both these facts are consequences of:
ELBO ( q ) = log f y ( y ) − K L ( q ∥ f θ ∣ y ) . \begin{align*}
\text{ELBO}(q) = \log f_{y}(y) - KL\left(q \| f_{\theta \mid y}
\right).
\end{align*} ELBO ( q ) = log f y ( y ) − K L ( q ∥ f θ ∣ y ) . Another useful fact about the ELBO (which follows directly from the definition (8) when f y , θ ( y , θ ) f_{y, \theta}(y, \theta) f y , θ ( y , θ ) is replaced by f y ∣ θ ( y ) f θ ( θ ) f_{y \mid \theta}(y) f_{\theta}(\theta) f y ∣ θ ( y ) f θ ( θ ) ) is:
ELBO ( q ) = ∫ q ( θ ) log f y ∣ θ ( y ) d θ − K L ( q ∥ f θ ) . \begin{align*}
\text{ELBO}(q) = \int q(\theta) \log f_{y \mid \theta}(y) d\theta -
KL(q \| f_{\theta}).
\end{align*} ELBO ( q ) = ∫ q ( θ ) log f y ∣ θ ( y ) d θ − K L ( q ∥ f θ ) . Connection to the EM Algorithm ¶ The relation max q ELBO ( q ) = log f y ( y ) \max_q \text{ELBO}(q) = \log f_{y}(y) max q ELBO ( q ) = log f y ( y ) (with the maximum being achieved at the posterior q ( θ ) = f θ ∣ y ( θ ) q(\theta) = f_{\theta \mid
y}(\theta) q ( θ ) = f θ ∣ y ( θ ) ) can be used to understand the EM algorithm.
Consider the same Bayesian setting as in the previous section, but suppose now that there is an additional hyperparameter α \alpha α , and that both the likelihood f y ∣ θ , α ( y ) f_{y \mid \theta, \alpha}(y) f y ∣ θ , α ( y ) as well as the prior f θ ∣ α ( θ ) f_{\theta \mid \alpha}(\theta) f θ ∣ α ( θ ) depend on α \alpha α .
The ELBO now becomes:
ELBO ( q , α ) = ∫ q ( θ ) log f y , θ ∣ α ( y , θ ) q ( θ ) d θ = ∫ q ( θ ) log f y ∣ θ , α ( y ) f θ ∣ α ( θ ) q ( θ ) d θ \begin{align*}
\text{ELBO}(q, \alpha) = \int q(\theta) \log \frac{f_{y, \theta \mid
\alpha}(y, \theta)}{q(\theta)} d\theta = \int q(\theta) \log
\frac{f_{y \mid \theta, \alpha}(y) f_{\theta \mid
\alpha}(\theta)}{q(\theta)} d\theta
\end{align*} ELBO ( q , α ) = ∫ q ( θ ) log q ( θ ) f y , θ ∣ α ( y , θ ) d θ = ∫ q ( θ ) log q ( θ ) f y ∣ θ , α ( y ) f θ ∣ α ( θ ) d θ and equation (4) now becomes:
max q ELBO ( q , α ) = log f y ∣ α ( y ) with the maximum achieved at q ( θ ) = f θ ∣ y , α ( θ ) . \begin{align}
\max_{q} \text{ELBO}(q, \alpha) = \log f_{y\mid \alpha}(y) ~~ \text{ with the maximum
achieved at } q(\theta) = f_{\theta \mid y, \alpha}(\theta).
\end{align} q max ELBO ( q , α ) = log f y ∣ α ( y ) with the maximum achieved at q ( θ ) = f θ ∣ y , α ( θ ) . Suppose now that we want to obtain the MLE α ^ \hat{\alpha} α ^ of α \alpha α i.e., we want to solve the optimization:
Maximize α log f y ∣ α ( y ) . \begin{align*}
\underset{\alpha}{\text{Maximize}}~ \log f_{y \mid \alpha}(y).
\end{align*} α Maximize log f y ∣ α ( y ) . Using (8) , we can rewrite the optimization above as:
Maximize α max q ELBO ( q , α ) , \begin{align*}
\underset{\alpha}{\text{Maximize}}~ \max_{q} \text{ELBO}(q, \alpha),
\end{align*} α Maximize q max ELBO ( q , α ) , which can be viewed as the two parameter maximization problem:
Maximize q , α ELBO ( q , α ) . \begin{align*}
\underset{q, \alpha}{\text{Maximize}} ~ \text{ELBO}(q, \alpha).
\end{align*} q , α Maximize ELBO ( q , α ) . It is natural to solve this two-parameter optimization alternatively. Fix α \alpha α and optimize over q q q , and then fix q q q and optimize over α \alpha α . This leads to the following algorithm.
Initialize α = α ( 0 ) \alpha = \alpha^{(0)} α = α ( 0 ) .
Repeat the following for t = 0 , 1 , 2 , … t = 0, 1, 2, \dots t = 0 , 1 , 2 , … (until some kind of convergence):
Take q ( t ) q^{(t)} q ( t ) to be the maximizer of ELBO ( q , α ( t ) ) \text{ELBO}(q, \alpha^{(t)}) ELBO ( q , α ( t ) ) over all q q q (for fixed α ( t ) \alpha^{(t)} α ( t ) )
Take α ( t + 1 ) \alpha^{(t+1)} α ( t + 1 ) to be the maximizer of ELBO ( q ( t ) , α ) \text{ELBO}(q^{(t)},
\alpha) ELBO ( q ( t ) , α ) over all α \alpha α .
By (8) , it is clear that q ( t ) q^{(t)} q ( t ) which maximizes ELBO ( q , α ( t ) ) \text{ELBO}(q, \alpha^{(t)}) ELBO ( q , α ( t ) ) is simply equal to:
q ( t ) = f θ ∣ y , α ( t ) . q^{(t)} = f_{\theta \mid y, \alpha^{(t)}}. q ( t ) = f θ ∣ y , α ( t ) . Thus the alternating minimization algorithm above is equivalent to:
Initialize α = α ( 0 ) \alpha = \alpha^{(0)} α = α ( 0 ) .
Repeat the following for t = 0 , 1 , 2 , … t = 0, 1, 2, \dots t = 0 , 1 , 2 , … (until some kind of convergence):
Take α ( t + 1 ) \alpha^{(t+1)} α ( t + 1 ) to be the maximizer of ELBO ( f θ , ∣ y , α ( t ) , α ) \text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}},
\alpha) ELBO ( f θ , ∣ y , α ( t ) , α ) over all α \alpha α .
The quantity ELBO ( f θ , ∣ y , α ( t ) , α ) \text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}},
\alpha) ELBO ( f θ , ∣ y , α ( t ) , α ) equals:
ELBO ( f θ , ∣ y , α ( t ) , α ) = ∫ f θ ∣ y , α ( t ) ( θ ) log f y , θ ∣ α ( y , θ ) f θ ∣ y , α ( t ) ( θ ) d θ = ∫ f θ ∣ y , α ( t ) ( θ ) log f y , θ ∣ α ( θ ) d θ − ∫ f θ ∣ y , α ( t ) ( θ ) log f θ ∣ y , α ( t ) ( θ ) d θ . \begin{align*}
\text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}},
\alpha) &= \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log
\frac{f_{y, \theta \mid \alpha}(y, \theta)}{f_{\theta \mid
y, \alpha^{(t)}}(\theta)} d\theta \\
&= \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log f_{y, \theta
\mid \alpha}(\theta) d\theta - \int f_{\theta \mid y,
\alpha^{(t)}}(\theta) \log f_{\theta \mid y, \alpha^{(t)}}(\theta)
d\theta.
\end{align*} ELBO ( f θ , ∣ y , α ( t ) , α ) = ∫ f θ ∣ y , α ( t ) ( θ ) log f θ ∣ y , α ( t ) ( θ ) f y , θ ∣ α ( y , θ ) d θ = ∫ f θ ∣ y , α ( t ) ( θ ) log f y , θ ∣ α ( θ ) d θ − ∫ f θ ∣ y , α ( t ) ( θ ) log f θ ∣ y , α ( t ) ( θ ) d θ . The second term actually does not depend on α \alpha α , so maximizing ELBO ( f θ , ∣ y , α ( t ) , α ) \text{ELBO}(f_{\theta, \mid y, \alpha^{(t)}}, \alpha) ELBO ( f θ , ∣ y , α ( t ) , α ) over α \alpha α is equivalent to maximizing the first term above:
E ( t , α ) : = ∫ f θ ∣ y , α ( t ) ( θ ) log f y , θ ∣ α ( θ ) d θ . \begin{align}
E(t, \alpha) := \int f_{\theta \mid y, \alpha^{(t)}}(\theta) \log f_{y, \theta
\mid \alpha}(\theta) d\theta.
\end{align} E ( t , α ) := ∫ f θ ∣ y , α ( t ) ( θ ) log f y , θ ∣ α ( θ ) d θ . The alternating maximization algorithm above then becomes:
Initialize α = α ( 0 ) \alpha = \alpha^{(0)} α = α ( 0 ) .
Repeat the following for t = 0 , 1 , 2 , … t = 0, 1, 2, \dots t = 0 , 1 , 2 , … (until some kind of convergence):
Calculate E ( t , α ) E(t, \alpha) E ( t , α ) given in (13)
Take α ( t + 1 ) \alpha^{(t+1)} α ( t + 1 ) to be the maximizer of E ( t , α ) E(t, \alpha) E ( t , α ) over all α \alpha α .
This is the EM algorithm. The calculation of E ( t , α ) E(t, \alpha) E ( t , α ) is called the E E E -step and the maximization of E ( t , α ) E(t, \alpha) E ( t , α ) over α \alpha α is called the M M M -step. For more on this connection between Variational Inference (VI) and EM, see neal1998view .
Coordinatewise Variational Inference ¶ The key problem that we need to solve in VI is that of maximizing ELBO ( q ) \text{ELBO}(q) ELBO ( q ) . The full maximizer of ELBO ( q ) \text{ELBO}(q) ELBO ( q ) over q q q is, as already mentioned, f θ ∣ y f_{\theta \mid y} f θ ∣ y . But this posterior is intractable in general and the very reason for using variational inerference is this intractability. The strategy is to restrict the class of q q q over which we look for minimizers. This hopefully will lead to a tractable minimizer which is close to the actual posterior f θ ∣ y f_{\theta \mid y} f θ ∣ y .
Suppose θ \theta θ can be decomposed into k k k separate classes of parameters as θ = ( θ 1 , … , θ k ) \theta = (\theta_1, \dots, \theta_k) θ = ( θ 1 , … , θ k ) . We look for variational approximations of the form q ( θ ) = q 1 ( θ 1 ) … q k ( θ k ) q(\theta) = q_1(\theta_1)
\dots q_k(\theta_k) q ( θ ) = q 1 ( θ 1 ) … q k ( θ k ) . The ELBO maximization problem is:
maximize q 1 , … , q k ELBO ( q 1 , … , q k ) = ∫ ⋯ ∫ q 1 ( θ 1 ) … q k ( θ k ) log f y , θ ( y , θ ) q 1 ( θ 1 ) … q k ( θ k ) d θ . \underset{q_1, \dots, q_k}{\text{maximize}}~ \text{ELBO}(q_1, \dots,
q_k) = \int \dots \int
q_1(\theta_1) \dots q_k(\theta_k) \log \frac{f_{y, \theta}(y, \theta)}{q_1(\theta_1) \dots
q_k(\theta_k)} d\theta. q 1 , … , q k maximize ELBO ( q 1 , … , q k ) = ∫ ⋯ ∫ q 1 ( θ 1 ) … q k ( θ k ) log q 1 ( θ 1 ) … q k ( θ k ) f y , θ ( y , θ ) d θ . Alternating minimization leads to the following result.
Consider the maximization problem (14) . Fix j = 1 , … , k j = 1,
\dots, k j = 1 , … , k and fix q i , i ≠ j q_i, i \neq j q i , i = j . Then the minimizing density q j = q j ∗ q_j
= q_j^* q j = q j ∗ is given by
q j ∗ ( θ j ) ∝ exp [ ∫ q − j ( θ − j ) log f θ j ∣ y , θ − j ( θ j ) d θ − j ] . q_j^*(\theta_j) \propto \exp \left[\int q_{-j}(\theta_{-j}) \log
f_{\theta_j \mid y, \theta_{-j}}(\theta_j) d\theta_{-j}
\right]. q j ∗ ( θ j ) ∝ exp [ ∫ q − j ( θ − j ) log f θ j ∣ y , θ − j ( θ j ) d θ − j ] . where q − j ( θ j ) q_{-j}(\theta_j) q − j ( θ j ) denotes the product of q l ( θ l ) q_{l}(\theta_l) q l ( θ l ) over all l ≠ j l \neq j l = j and d θ j d\theta_j d θ j is the product of d θ l d\theta_l d θ l over all l ≠ j l \neq j l = j .
The CAVI update equation (15) is equivalent to the following:
log q j ∗ ( θ j ) = constant + E θ − j log f y , θ 1 , … , θ k ( y , θ 1 , … , θ k ) \begin{align}
\log q_j^*(\theta_j) = \text{constant} + \E_{\theta_{-j}} \log f_{y,
\theta_1, \dots, \theta_k}(y, \theta_1, \dots, \theta_k)
\end{align} log q j ∗ ( θ j ) = constant + E θ − j log f y , θ 1 , … , θ k ( y , θ 1 , … , θ k ) where the expectation is taken with respect to θ − j ∼ q − j \theta_{-j} \sim
q_{-j} θ − j ∼ q − j .
Proposition 1 is a consequence of the following simple fact.
Suppose g g g is a nonnegative function that is not necessarily a density function. Then the minimizer of
L ( f ) : = ∫ f log f g L(f) := \int f \log \frac{f}{g} L ( f ) := ∫ f log g f over all densities f f f is given by
f ∗ ( x ) ∝ g ( x ) f^*(x) \propto g(x) f ∗ ( x ) ∝ g ( x ) which of course means that f ∗ = g / ( ∫ g ) f^* = g/(\int g) f ∗ = g / ( ∫ g ) .
Suppose f 2 ( ⋅ ) f_2(\cdot) f 2 ( ⋅ ) below is a fixed probability density function, and g ( x 1 , x 2 ) g(x_1, x_2) g ( x 1 , x 2 ) is a fixed nonnegative function that is not necessarily a density. Then the minimizer of
Γ ( f 1 ) : = ∫ ∫ f 1 ( x 1 ) f 2 ( x 2 ) log f 1 ( x 1 ) f 2 ( x 2 ) g ( x 1 , x 2 ) d x 1 d x 2 \Gamma(f_1) := \int \int f_1(x_1) f_2(x_2) \log
\frac{f_1(x_1)f_2(x_2)}{g(x_1, x_2)} dx_1 dx_2 Γ ( f 1 ) := ∫∫ f 1 ( x 1 ) f 2 ( x 2 ) log g ( x 1 , x 2 ) f 1 ( x 1 ) f 2 ( x 2 ) d x 1 d x 2 over all densities f 1 f_1 f 1 is given by
f 1 ∗ ( x 1 ) ∝ exp ( ∫ f 2 ( x 2 ) log g ( x 1 , x 2 ) d x 2 ) . f_1^*(x_1) \propto \exp \left( \int f_2(x_2) \log g(x_1, x_2)
dx_2
\right). f 1 ∗ ( x 1 ) ∝ exp ( ∫ f 2 ( x 2 ) log g ( x 1 , x 2 ) d x 2 ) . Simple CAVI Example ¶ As a simple illustration of how CAVI works, let us consider the problem of estimating μ , σ \mu, \sigma μ , σ from observations y 1 , … , y n y_1,
\dots, y_n y 1 , … , y n that are i.i.d N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) . As usual, we use the prior
μ , log σ ∼ i.i.d uniform ( − ∞ , ∞ ) . \begin{align*}
\mu, \log \sigma \overset{\text{i.i.d}}{\sim}
\text{uniform}(-\infty, \infty).
\end{align*} μ , log σ ∼ i.i.d uniform ( − ∞ , ∞ ) . In other words, the prior is f μ , σ 2 ( μ , σ 2 ) = 1 σ 2 f_{\mu, \sigma^2}(\mu,
\sigma^2)=\frac{1}{\sigma^2} f μ , σ 2 ( μ , σ 2 ) = σ 2 1 (note that we are treating σ 2 \sigma^2 σ 2 and not σ \sigma σ as the parameter).
The joint density of the data y = ( y 1 , … , y n ) y = (y_1, \dots, y_n) y = ( y 1 , … , y n ) and the parameter θ = ( μ , σ 2 ) \theta = (\mu, \sigma^2) θ = ( μ , σ 2 ) is given by:
f y , θ ( y , θ ) = ( ∏ i = 1 n 1 2 π σ exp ( − ( y i − μ ) 2 2 σ 2 ) ) 1 σ 2 \begin{align*}
f_{y, \theta}(y, \theta) = \left(\prod_{i=1}^n \frac{1}{\sqrt{2 \pi}
\sigma} \exp \left(-\frac{(y_i - \mu)^2}{2 \sigma^2} \right) \right)
\frac{1}{\sigma^2}
\end{align*} f y , θ ( y , θ ) = ( i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( y i − μ ) 2 ) ) σ 2 1 so that
log f y , θ ( y , θ ) = − ∑ i = 1 n ( y i − μ ) 2 2 σ 2 − ( n 2 + 1 ) log σ 2 + constant = − n 2 σ 2 ( y ˉ − μ ) 2 − 1 2 σ 2 ∑ i = 1 n ( y i − y ˉ ) 2 − ( n 2 + 1 ) log σ 2 + constant . \begin{align*}
\log f_{y, \theta}(y, \theta) &= -\sum_{i=1}^n \frac{(y_i - \mu)^2}{2
\sigma^2} - \left(\frac{n}{2} + 1 \right) \log \sigma^2 +
\text{constant} \\
&= -\frac{n}{2 \sigma^2} (\bar{y} - \mu)^2 - \frac{1}{2 \sigma^2}
\sum_{i=1}^n (y_i - \bar{y})^2 - \left(\frac{n}{2} + 1 \right)
\log \sigma^2 + \text{constant}.
\end{align*} log f y , θ ( y , θ ) = − i = 1 ∑ n 2 σ 2 ( y i − μ ) 2 − ( 2 n + 1 ) log σ 2 + constant = − 2 σ 2 n ( y ˉ − μ ) 2 − 2 σ 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 − ( 2 n + 1 ) log σ 2 + constant . CAVI approximates the posterior of θ = ( μ , σ 2 ) \theta = (\mu, \sigma^2) θ = ( μ , σ 2 ) by a product distribution q μ ( μ ) q σ 2 ( σ 2 ) q_{\mu}(\mu) q_{\sigma^2}(\sigma^2) q μ ( μ ) q σ 2 ( σ 2 ) . Below we figure out how to update q μ q_{\mu} q μ given q σ 2 q_{\sigma^2} q σ 2 and also q σ 2 q_{\sigma^2} q σ 2 given q μ q_{\mu} q μ .
By (16) , given q σ 2 q_{\sigma^2} q σ 2 , we get
log q μ ∗ ( μ ) = − n 2 ( y ˉ − μ ) 2 E q σ 2 ( 1 σ 2 ) + constant , \begin{align*}
\log q^*_{\mu}(\mu) = -\frac{n}{2} (\bar{y} - \mu)^2 \E_{q_{\sigma^2}}
\left(\frac{1}{\sigma^2} \right) + \text{constant},
\end{align*} log q μ ∗ ( μ ) = − 2 n ( y ˉ − μ ) 2 E q σ 2 ( σ 2 1 ) + constant , which means that q μ ∗ q^*_{\mu} q μ ∗ is the density of the normal distribution: with mean μ \mu μ and variance
q μ ∗ = density of N ( y ˉ , 1 n E q σ 2 ( 1 / σ 2 ) ) . \begin{align}
q^*_{\mu} = \text{ density of } N\left(\bar{y}, \frac{1}{n
\E_{q_{\sigma^2}}(1/\sigma^2)} \right).
\end{align} q μ ∗ = density of N ( y ˉ , n E q σ 2 ( 1/ σ 2 ) 1 ) . To get q σ 2 ∗ ( σ 2 ) q^*_{\sigma^2}(\sigma^2) q σ 2 ∗ ( σ 2 ) for fixed q μ q_{\mu} q μ , we again use (16) to get
log q σ 2 ∗ ( σ 2 ) = − 1 2 σ 2 ( n E q μ ( y ˉ − μ ) 2 + ∑ i = 1 n ( y i − y ˉ ) 2 ) − ( n 2 + 1 ) log σ 2 + constant . \begin{align*}
\log q^*_{\sigma^2}(\sigma^2) = -\frac{1}{2 \sigma^2} \left(n \E_{q_{\mu}}
(\bar{y} - \mu)^2 + \sum_{i=1}^n (y_i - \bar{y})^2 \right) -
\left(\frac{n}{2} + 1 \right) \log \sigma^2 + \text{constant}.
\end{align*} log q σ 2 ∗ ( σ 2 ) = − 2 σ 2 1 ( n E q μ ( y ˉ − μ ) 2 + i = 1 ∑ n ( y i − y ˉ ) 2 ) − ( 2 n + 1 ) log σ 2 + constant . This has the form of the Inverse Gamma density I G ( a , b ) IG(a,
b) I G ( a , b ) given by:
− b σ 2 − ( a + 1 ) log σ 2 . \begin{align*}
-\frac{b}{\sigma^2} - (a + 1) \log \sigma^2.
\end{align*} − σ 2 b − ( a + 1 ) log σ 2 . Comparing terms, we get
q σ 2 ∗ = density of I G ( n 2 , n 2 E q μ ( y ˉ − μ ) 2 + 1 2 ∑ i = 1 n ( y i − y ˉ ) 2 ) . \begin{align}
q^*_{\sigma^2} = \text{ density of } IG \left(\frac{n}{2}, \frac{n}{2}
\E_{q_{\mu}} (\bar{y} - \mu)^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2
\right).
\end{align} q σ 2 ∗ = density of I G ( 2 n , 2 n E q μ ( y ˉ − μ ) 2 + 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) . (23) and (26) give the following CAVI algorithm:
Initialize with some density q σ 2 ( 0 ) q^{(0)}_{\sigma^2} q σ 2 ( 0 ) of σ 2 \sigma^2 σ 2 . Suppose s 0 : = ( n E q σ 2 ( 0 ) ( 1 / σ 2 ) ) − 1 s_0 := \left(n
\E_{q^{(0)}_{\sigma^2}}(1/\sigma^2) \right)^{-1} s 0 := ( n E q σ 2 ( 0 ) ( 1/ σ 2 ) ) − 1 .
Repeat the following for k = 0 , 1 , 2 , … k = 0, 1, 2, \dots k = 0 , 1 , 2 , … :
Take q μ ( k + 1 ) q_{\mu}^{(k+1)} q μ ( k + 1 ) to be the density of N ( y ˉ , s k 2 ) N(\bar{y},
s_k^2) N ( y ˉ , s k 2 ) .
Take q σ 2 ( k + 1 ) q_{\sigma^2}^{(k+1)} q σ 2 ( k + 1 ) to be the density of
I G ( n 2 , n 2 E q μ ( k + 1 ) ( y ˉ − μ ) 2 + 1 2 ∑ i = 1 n ( y i − y ˉ ) 2 ) = I G ( n 2 , n 2 s k 2 + 1 2 ∑ i = 1 n ( y i − y ˉ ) 2 ) \begin{align*}
IG \left(\frac{n}{2}, \frac{n}{2}\E_{q^{(k+1)}_{\mu}} (\bar{y} -
\mu)^2 + \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \right) =
IG \left(\frac{n}{2}, \frac{n}{2}s_k^2 + \frac{1}{2}
\sum_{i=1}^n (y_i - \bar{y})^2 \right)
\end{align*} I G ( 2 n , 2 n E q μ ( k + 1 ) ( y ˉ − μ ) 2 + 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) = I G ( 2 n , 2 n s k 2 + 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) Take s k + 1 2 = ( n E q σ 2 ( k + 1 ) ( 1 / σ 2 ) ) − 1 s_{k+1}^2 = \left(n \E_{q_{\sigma^2}^{(k+1)}}
(1/\sigma^2) \right)^{-1} s k + 1 2 = ( n E q σ 2 ( k + 1 ) ( 1/ σ 2 ) ) − 1 which leads to (using the fact E σ − 2 = a / b \E \sigma^{-2} = a/b E σ − 2 = a / b when σ 2 ∼ I G ( a , b ) \sigma^2 \sim IG(a, b) σ 2 ∼ I G ( a , b ) )
s k + 1 2 = s k 2 n + 1 n 2 ∑ i = 1 n ( y i − y ˉ ) 2 . \begin{align}
s_{k+1}^2 = \frac{s_k^2}{n} + \frac{1}{n^2} \sum_{i=1}^n (y_i -
\bar{y})^2.
\end{align} s k + 1 2 = n s k 2 + n 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 . The limit of the iteration for { s k 2 } \{s_k^2\} { s k 2 } in (28) can be computed in closed form. Indeed if s 2 = lim k → ∞ s k 2 s^2 =
\lim_{k \rightarrow \infty} s_k^2 s 2 = lim k → ∞ s k 2 , then (28) gives
s 2 = s 2 n + 1 n 2 ∑ i = 1 n ( y i − y ˉ ) 2 ⟹ s 2 = 1 n ( n − 1 ) ∑ i = 1 n ( y i − y ˉ ) 2 . \begin{align*}
s^2 = \frac{s^2}{n} + \frac{1}{n^2} \sum_{i=1}^n (y_i -
\bar{y})^2 \implies s^2 = \frac{1}{n(n-1)} \sum_{i=1}^n (y_i -
\bar{y})^2.
\end{align*} s 2 = n s 2 + n 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ⟹ s 2 = n ( n − 1 ) 1 i = 1 ∑ n ( y i − y ˉ ) 2 . Thus the VI solution q μ ∗ ( μ ) q σ 2 ∗ ( σ 2 ) q^*_{\mu}(\mu) q^*_{\sigma^2}(\sigma^2) q μ ∗ ( μ ) q σ 2 ∗ ( σ 2 ) is given by:
q μ ∗ = density of N ( y ˉ , 1 n ( n − 1 ) ∑ i = 1 n ( y i − y ˉ ) 2 ) \begin{align}
q_{\mu}^* = \text{ density of } N \left(\bar{y}, \frac{1}{n(n-1)}
\sum_{i=1}^n (y_i - \bar{y})^2 \right)
\end{align} q μ ∗ = density of N ( y ˉ , n ( n − 1 ) 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) and
q σ 2 ∗ = density of I G ( n 2 , 1 2 ( n − 1 ) ∑ i = 1 n ( y i − y ˉ ) 2 + 1 2 ∑ i = 1 n ( y i − y ˉ ) 2 ) = density of I G ( n 2 , n 2 ( n − 1 ) ∑ i = 1 n ( y i − y ˉ ) 2 ) \begin{align}
q_{\sigma^2}^* &= \text{ density of } IG\left(\frac{n}{2},
\frac{1}{2(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 + \frac{1}{2}
\sum_{i=1}^n (y_i - \bar{y})^2 \right) \nonumber \\
&= \text{ density of } IG\left(\frac{n}{2},
\frac{n}{2(n-1)} \sum_{i=1}^n (y_i - \bar{y})^2 \right)
\end{align} q σ 2 ∗ = density of I G ( 2 n , 2 ( n − 1 ) 1 i = 1 ∑ n ( y i − y ˉ ) 2 + 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) = density of I G ( 2 n , 2 ( n − 1 ) n i = 1 ∑ n ( y i − y ˉ ) 2 ) We can compare these VI marginals to those of the exact posterior which we know is given by:
μ ∣ data ∼ t n − 1 ( y ˉ , 1 n ( n − 1 ) ∑ i = 1 n ( y i − y ˉ ) 2 ) and σ 2 ∣ data ∼ I G ( n − 1 2 , 1 2 ∑ i = 1 n ( y i − y ˉ ) 2 ) . \begin{align*}
\mu \mid \text{data} \sim t_{n-1} \left(\bar{y}, \frac{1}{n(n-1)}
\sum_{i=1}^n (y_i - \bar{y})^2 \right) ~~ \text{ and } ~~ \sigma^2
\mid \text{data} \sim IG \left(\frac{n-1}{2}, \frac{1}{2}
\sum_{i=1}^n (y_i - \bar{y})^2 \right).
\end{align*} μ ∣ data ∼ t n − 1 ( y ˉ , n ( n − 1 ) 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) and σ 2 ∣ data ∼ I G ( 2 n − 1 , 2 1 i = 1 ∑ n ( y i − y ˉ ) 2 ) . Comparing the above to (30) , we see that the VI marginal posterior has a narrower normal density instead of t n − 1 t_{n-1} t n − 1 . Comparing the above to (31) , we see that the parameters of the IG distribution are changed slightly (the changes are negligible for large n n n ). Note also that μ \mu μ and σ 2 \sigma^2 σ 2 are independent in the actual posterior but this VI analysis assumes they are independent.