STAT 238 - Bayesian Statistics Lecture Eleven

Spring 2026, UC Berkeley

Bayesian Inference with Normal Likelihoods¶

Data is $y$ and parameter is $\theta$ , the likelihood is given by:

\begin{align*} y \mid \theta \sim N(\theta, \sigma^2) \end{align*}

(1)

for a fixed $\sigma > 0$ . We will look at the case of unknown $\sigma$ later. The frequentist estimator of $\theta$ is the MLE $y$ . For Bayesian inference, we use the normal $N(\mu, \tau^2)$ prior on $\theta$ . The basic fact is:

\begin{align*} \theta \sim N(\mu, \tau^2) ~ \text{ and } ~ y \mid \theta \sim N(\theta, \sigma^2) &\implies \theta \mid y \sim N \left(\frac{y/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2}, \frac{1}{1/\sigma^2 + 1/\tau^2} \right) \\ &\text{ and } y \sim N(\mu, \sigma^2 + \tau^2). \end{align*}

(2)

The mean of the posterior distribution is thus

\frac{y/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2} = \frac{1/\sigma^2}{1/\sigma^2 + 1/\tau^2} y + \frac{1/\tau^2}{1/\sigma^2 + 1/\tau^2} \mu

which is a weighted linear combination of the prior mean $\mu$ and the data $y$ . The weights are inversely proportion to the variances of the corresponding normal distributions. For a normal distribution, we use the term “precision” to denote the inverse of variance. Thus the weights of the linear combination above are proportional to the precisions of the prior and the likelihood.

Also note that the precision of the posterior equals the sum of the prior and likelihood precisions.

Note that when $\tau^2 \rightarrow \infty$ and $\mu$ is a fixed constant, we have $\theta \mid y \sim N(y, \sigma^2)$ (in this case, the posterior mean coincides with the MLE $y$ ). Operationally, the $N(\mu, \tau^2)$ prior with fixed $\mu$ and $\tau^2 = +\infty$ has the same behavior as the $\text{uniform}(-\infty, \infty)$ prior. These are uninformative priors in this problem.

Often the data will consist of multiple numbers $y_1, \dots, y_n$ with the likelihood being:

\begin{align*} y_1, \dots, y_n \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2). \end{align*}

(3)

In this case (for the same prior $\theta \sim N(\mu, \tau^2)$ ), the posterior is

\theta | y_1, \dots, y_n \sim N \left(\frac{n\bar{y}/\sigma^2 + \mu/\tau^2}{n/\sigma^2 + 1/\tau^2}, \frac{1}{n/\sigma^2 + 1/\tau^2} \right).

where $\bar{y} := (y_1 + \dots + y_n)/n$ . This can be proved directly or use the fact that $\bar{y}$ is sufficient for $\theta$ and the data can be reduced to the single number $\bar{y}$ with likelihood $N(\theta, \sigma^2/n)$ .

Multiple Instances of the Same Problem¶

Consider the problem of estimating $\theta_1, \dots, \theta_N$ from data $y_1, \dots, y_N$ under the likelihood:

\begin{align*} y_i \overset{\text{ind}}{\sim} N(\theta_i, \sigma^2) \end{align*}

(4)

for a fixed $\sigma^2$ . We further assume:

\begin{align*} \theta_i \overset{\text{i.i.d}}{\sim} N(\mu, \tau^2). \end{align*}

(5)

If we fix $\mu$ to be some constant value (say $\mu = 0$ ) and $\tau^2$ to be very large, then the estimate of $\theta_i$ would be equal to $y_i$ .

Instead of fixing these values, we can treat $\mu$ and $\tau^2$ also as unknown parameters and attempt to estimate them from the observed data. Marginalizing $\theta_i$ , it is easy to see that

\begin{align*} y_i \mid \mu, \tau \overset{\text{i.i.d}}{\sim} N(\mu, \sigma^2 + \tau^2). \end{align*}

(6)

One can estimate $\mu$ and $\tau$ from this model by some estimates $\hat{\mu}$ and $\hat{\tau}$ . $\theta_i$ is then estimated by $\E(\theta_i \mid y_i, \mu = \hat{\mu}, \tau = \hat{\tau})$ . How to obtain $\hat{\mu}$ and $\hat{\tau}$ ? It is natural to take

\begin{align*} \hat{\mu} = \frac{y_1 + \dots + y_N}{N}. \end{align*}

(7)

The estimate of $\tau$ should be based on the sample variance:

\begin{align*} \sum_{i=1}^N(y_i - \bar{y})^2 \sim \left(\sigma^2 + \tau^2\right) \chi^2_{N-1}. \end{align*}

(8)

The formula for $\E(\theta_i \mid y_i, \mu, \tau)$ is:

\begin{align*} \E(\theta_i \mid y_i, \mu, \tau) = \frac{y_i/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2} = \mu + \left(1 - \frac{\sigma^2}{\sigma^2 + \tau^2} \right) (y_i - \mu). \end{align*}

(9)

So we need to estimate $1/(\sigma^2 + \tau^2)$ as opposed to $\tau^2$ directly. The following fact:

Undefined control sequence: \qt at position 72: … \frac{1}{v-2} \̲q̲t̲{for $v > 2$}
\…

\begin{align*}
  \E \left(\frac{1}{\chi^2_{v}} \right) = \frac{1}{v-2} \qt{for $v > 2$}
\end{align*}

shows that

\begin{align*} \E \left(\frac{1}{\sum_{i=1}^n (y_i - \bar{y})^2} \right) = \frac{1}{\sigma^2 + \tau^2} \E \left(\frac{1}{\chi^2_{N-1}} \right) = \frac{1}{(N-3)(\sigma^2 + \tau^2)}. \end{align*}

(11)

So we use

\begin{align} \text{estimate of } \frac{1}{\sigma^2 + \tau^2} = \frac{N-3}{\sum_{i=1}^n (y_i - \bar{y})^2}. \end{align}

(12)

The estimate of $\tau^2$ implied by (12) can be nonpositive (depending on the data and $\sigma^2$ ). We shall ignore this however.

The estimates of $\theta_i$ are therefore given by:

\begin{align*} \hat{\theta}_i^{\text{JS}} := \bar{y} + \left(1 - \frac{(N-3) \sigma^2}{\sum_{i=1}^n (y_i - \bar{y})^2} \right) \left(y_i - \bar{y} \right). \end{align*}

(13)

This is the James-Stein estimator.

The James-Stein estimator can thus be seen as an empirical Bayes procedure. The James-Stein estimator has the following remarkable frequentist property: its risk is unformly better than the naive estimator $\hat{\theta}_{i, \text{naive}} = y_i$ in mean squared error for all values of $\theta_1, \dots, \theta_N$ (this is true for $N \geq 3$ ; although we only defined it for $N > 3$ , one can create a more naive version of James-Stein replacing $\bar{y}$ by any fixed constant like 0 and this would work also for $N = 3$ ). This property is beyond the scope of this class (and also irrelevant to us as it is a frequentist property).

Let us now present the full Bayes estimate. This requires specification of (uninformative) priors of $\mu$ and $\tau$ . For $\mu$ , we use the $\text{uniform}(-\infty, \infty)$ prior. For $\tau$ , we use the fact that the marginal distribution of the data given $\mu, \tau$ is $y_i \mid \mu, \tau \overset{\text{i.i.d}}{\sim} N(\mu, \sigma^2 + \tau^2)$ , which is in terms of the parameter $\gamma^2 := \sigma^2 + \tau^2$ . For the $N(\mu, \gamma^2)$ model, the standard uninformative prior for $\gamma$ is

\begin{align*} \log \gamma \sim \text{uniform}(-\infty, \infty) ~~ \text{ or } ~~ f_{\gamma}(\gamma) \propto \frac{I\{\gamma > 0\}}{\gamma}. \end{align*}

(14)

Now we have the additional information that $\gamma > \sigma$ (because $\gamma^2 = \sigma^2 + \tau^2$ ), so it is natural to modify the above prior as:

\begin{align*} f_{\gamma}(\gamma) \propto \frac{I\{\gamma > \sigma\}}{\gamma}. \end{align*}

(15)

It is easy to check that this prior on $\gamma$ leads to the following prior on $\tau = \sqrt{\gamma^2 - \sigma^2}$ :

\begin{align*} f_{\tau}(\tau) \propto \frac{\tau}{\sigma^2 + \tau^2} I\{\tau > 0\}. \end{align*}

(16)

To recap, we are using the following prior for $\mu$ and $\tau$ :

\begin{align*} \mu \sim \text{uniform}(-\infty, \infty) ~~ \text{ and } ~~ f_{\tau}(\tau) \propto \frac{\tau}{\sigma^2 + \tau^2} I\{\tau > 0\}. \end{align*}

(17)

We will combine this with the likelihood $y_i \mid \mu, \tau \overset{\text{i.i.d}}{\sim} N(\mu, \sigma^2 + \tau^2)$ to obtain the posterior of $\mu, \tau$ . It can be checked that:

\begin{align*} \mu \mid \gamma, y_1, \dots, y_N \sim N\left(\bar{y}, \gamma^2/N \right) \text{ and } f_{\gamma \mid y_1, \dots, y_N}(\gamma) \propto \gamma^{-N} \exp \left(-\frac{\sum_{i=1}^N (y_i - \bar{y})^2}{2 \gamma^2} \right) I\{\gamma > \sigma\}. \end{align*}

(18)

You will see how to compute these posteriors given data in the next lab.