Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Three Spring 2026, UC Berkeley

In the last lecture, we started discussing this example.

Example 3: Inference from measurements

Suppose a scientist makes 6 numerical measurements 26.6, 38.5, 34.4, 34, 31, 23.6 on an unknown real-valued physical quantity θ\theta. On the basis of these measurements, what can be inferred about θ\theta?

Here is the Bayesian solution to this problem. The first step is modeling where we have to write the likelihood and prior. The likelihood represents the probability of the observed data conditional on parameter values. Here the main parameter is θ\theta. In order to write the probability of the observed data, it is helpful to introduce another parameter σ\sigma which represents the scale of the noise inherent in the measurement process.

So our parameter vector is (θ,σ)(\theta, \sigma). We work with the normal likelihood:

Likelihood=i=1n12πσexp((xiθ)22σ2).\text{Likelihood} = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i - \theta)^2}{2 \sigma^2} \right).

where n=6n = 6 and x1=26.6,x2=38.5,x3=34.4,x4=34,x5=31,x6=23.6x_1 = 26.6, x_2 = 38.5, x_3 = 34.4, x_4 = 34, x_5 = 31, x_6 = 23.6 denote the observed data points. More formally, you can arrive at this likelihood in the following way. Denote potential measurements by X1,,XnX_1, \dots, X_n. Each actual measurement will have some rounding error so the data point 26.6 should be understood as belonging to the interval [26.6δ,26.6+δ][26.6 - \delta, 26.6 + \delta] for some small rounding error δ\delta. So the likelihood is:

likelihood=P{observed dataθ,σ}=P{X1[x1δ,x1+δ],,Xn[xnδ,xn+δ]θ,σ}.\begin{align*} \text{likelihood} &= \P\{\text{observed data} \mid \theta, \sigma\} \\ &= \P \left\{X_1 \in [x_1 - \delta, x_1 + \delta], \dots, X_n \in [x_n - \delta, x_n + \delta] \mid \theta, \sigma \right\}. \end{align*}

Assuming δ\delta is small, we can use probability-density approximation to write

likelihoodδnfX1,,Xnθ,σ(x1,,xn).\begin{align*} \text{likelihood} \approx \delta^n f_{X_1, \dots, X_n \mid \theta, \sigma}(x_1, \dots, x_n). \end{align*}

We are now assuming that:

fX1,,Xnθ,σ(x1,,xn)=i=1n12πσexp((xiθ)22σ2).\begin{align} f_{X_1, \dots, X_n \mid \theta, \sigma}(x_1, \dots, x_n) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i - \theta)^2}{2 \sigma^2} \right). \end{align}

This leads to the likelihood (6) (note that δn\delta^n is being dropped as it is a constant of proportionality which does not affect any further calculations).

Here is a short digression on the likelihood and the assumption (9).

A lot of the time, the likelihood assumption (9) is written as:

X1,,Xnθ,σi.i.dN(θ,σ2).\begin{align} X_1, \dots, X_n \mid \theta,\sigma \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2). \end{align}

Strictly speaking (9) and (5) are not the same. This is because (5) is equivalent to

fX1,,Xnθ,σ(u1,,un)=i=1n12πσexp((uiθ)22σ2)for all u1,,un(,)\begin{align} f_{X_1, \dots, X_n \mid \theta, \sigma}(u_1,\dots, u_n) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(u_i - \theta)^2}{2 \sigma^2} \right) \qquad\text{for all $u_1, \dots, u_n \in (-\infty, \infty)$} \end{align}

(6) is much stronger than (9) because u1.,unu_1. \dots, u_n are completely arbitrary while in (9) the points x1,,xnx_1, \dots, x_n are not arbitrary (they simply equal the observed data).

For example, note that if we assume

fX1,,Xnθ,σ(u1,,un)={i=1n12πσexp ⁣((uiθ)22σ2),if u1,,un(20,30),completely arbitrary,if (u1,,un)(20,30)n.\begin{align*} f_{X_1, \dots, X_n \mid \theta, \sigma}(u_1,\dots, u_n) = \begin{cases} \displaystyle \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp\!\left( -\frac{(u_i - \theta)^2}{2 \sigma^2} \right), & \text{if } u_1,\dots,u_n \in (20,30), \\ \text{completely arbitrary}, & \text{if } (u_1,\dots,u_n) \notin (20,30)^n . \end{cases} \end{align*}

then again we arrive at the same likelihood. But now (5) is no longer true.

To complete the modeling step, we need to describe the prior on θ,σ\theta, \sigma. We assume

θ,logσi.i.duniform(C,C)\begin{align} \theta, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C) \end{align}

for a very large positive constant CC. The idea here is that we are allowing θ\theta and logσ\log \sigma to take values essentially on the entire real line and not expressing preference for any one value over any other. In terms of the densities, (8) is the same as

prior=fθ,σ(θ,σ)=fθ(θ)fσ(σ)=fθ(θ)flogσ(logσ)1σ=1{C<θ<C}2C1{C<logσ<C}2C1σ1{C<θ<C,C<logσ<C}σ.\begin{align*} \text{prior} = f_{\theta, \sigma}(\theta, \sigma) &= f_{\theta}(\theta)\, f_{\sigma}(\sigma) \\ &= f_{\theta}(\theta)\, f_{\log \sigma}(\log \sigma)\, \frac{1}{\sigma} \\ &= \frac{\mathbf{1}\{-C < \theta < C\}}{2C} \frac{\mathbf{1}\{-C < \log \sigma < C\}}{2C} \frac{1}{\sigma} \\ &\propto \frac{\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}}{\sigma}. \end{align*}

To now get the posterior (which is the joint density of θ,σ\theta, \sigma given the observed data), we use Bayes rule as

posterior=fθ,σdata(θ,σ)fθ,σ(θ,σ)×likelihood1{C<θ<C,C<logσ<C}σi=1n12πσexp((xiθ)22σ2)σn1exp(12σ2i=1n(xiθ)2)1{C<θ<C,C<logσ<C}.\begin{align*} \text{posterior} &= f_{\theta, \sigma \mid \text{data}}(\theta, \sigma) \\ &\propto f_{\theta, \sigma}(\theta, \sigma) \times \text{likelihood} \\ &\propto \frac{\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}}{\sigma} \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i - \theta)^2}{2 \sigma^2} \right) \\ &\propto \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right)\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}. \end{align*}

The constant underlying proportionality above is determined by the overall integral being one:

posterior=fθ,σdata(θ,σ)=σn1exp(12σ2i=1n(xiθ)2)1{C<θ<C,C<logσ<C}CCeCeCσn1exp(12σ2i=1n(xiθ)2)dθdσ.\begin{align*} \text{posterior} = f_{\theta, \sigma \mid \text{data}}(\theta, \sigma) = \frac{\sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right)\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}}{\int_{-C}^C \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\theta d\sigma}. \end{align*}

This is the joint posterior density of θ\theta and σ\sigma. If we only want the posterior density for θ\theta, we use the sum rule of probability to integrate out σ\sigma:

fθdata(θ)1{C<θ<C}eCeCσn1exp(12σ2i=1n(xiθ)2)dσ\begin{align*} f_{\theta \mid \text{data}}(\theta) \propto \mathbf{1}\{-C < \theta < C\} \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\sigma \end{align*}

Because CC is large, the limits of the integral can be taken to be 0 and \infty leading to

fθdata(θ)1{C<θ<C}0σn1exp(12σ2i=1n(xiθ)2)dσ=1{C<θ<C}2(n/2)1Γ(n/2)[i=1n(xiθ)2]n/2\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \mathbf{1}\{-C < \theta < C\} \int_{0}^{\infty} \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\sigma \\ &= \mathbf{1}\{-C < \theta < C\} 2^{(n/2) - 1} \Gamma(n/2)\left[\sum_{i=1}^n (x_i - \theta)^2 \right]^{-n/2} \end{align*}

The factor 2(n/2)1Γ(n/2)2^{(n/2) - 1} \Gamma(n/2) does not depend on θ\theta so it can be absorbed in the constant of proportionality as well leading to:

fθdata(θ)1{C<θ<C}(1S(θ))n/2\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \mathbf{1}\{-C < \theta < C\} \left(\frac{1}{S(\theta)} \right)^{n/2} \end{align*}

where S(θ)S(\theta) is the sum of squares term:

S(θ)=i=1n(xiθ)2.\begin{align*} S(\theta) = \sum_{i=1}^n (x_i - \theta)^2. \end{align*}

If CC is large, then the indicator can be dropped (because it will essentially be always 1) so the posterior becomes:

fθdata(θ)(1S(θ))n/2.\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \left(\frac{1}{S(\theta)} \right)^{n/2}. \end{align*}

Thus the posterior is inversely proportional to S(θ)n/2S(\theta)^{n/2}. This means that the posterior model will be at the least squares estimator which is the mean θ^=xˉ=(x1++xn)/n\hat{\theta} = \bar{x} = (x_1 + \dots + x_n)/n. It is cleaner to write the above posterior as:

fθdata(θ)(S(θ^)S(θ))n/2.\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \left(\frac{S(\hat{\theta})}{S(\theta)} \right)^{n/2}. \end{align*}

Because S(θ)=S(θ^)+n(θθ^)2S(\theta) = S(\hat{\theta}) + n(\theta - \hat{\theta})^2, we can also rewrite the posterior as:

fθdata(θ)(S(θ^)S(θ^)+n(θθ^)2)n/2=(11+(θθ^)2S(θ^)/n)n/2.\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \left(\frac{S(\hat{\theta})}{S(\hat{\theta}) + n(\theta - \hat{\theta})^2} \right)^{n/2} = \left(\frac{1}{1 + \frac{(\theta - \hat{\theta})^2}{S(\hat{\theta})/n}} \right)^{n/2}. \end{align*}

It can be shown (left as exercise) that this is related to the tt-density (see {https://en.wikipedia.org/wiki/Student

the above is equivalent to

n(θθ^)S(θ^)/(n1)datatn1\begin{align} \frac{\sqrt{n}(\theta - \hat{\theta})}{\sqrt{S(\hat{\theta})/(n-1)}} \mid \text{data} \sim t_{n-1} \end{align}

where tn1t_{n-1} denotes the tt-density with n1n-1 degrees of freedom. Note that θ^=xˉ\hat{\theta} = \bar{x} and S(θ^)=i=1n(xixˉ)2S(\hat{\theta}) = \sum_{i=1}^n (x_i - \bar{x})^2.

So the Bayesian point estimate is simply θ^=xˉ\hat{\theta} = \bar{x} (this is the posterior mean, median and mode!). A 100(1α)100(1 - \alpha)% uncertainty interval for θ\theta is given by:

[θ^1ntn1,α/2S(θ^)n1,θ^+1ntn1,α/2S(θ^)n1]\begin{align} \left[ \hat{\theta} - \frac{1}{\sqrt{n}} t_{n-1, \alpha/2} \sqrt{\frac{S(\hat{\theta})}{n-1}}, \hat{\theta} + \frac{1}{\sqrt{n}} t_{n-1, \alpha/2} \sqrt{\frac{S(\hat{\theta})}{n-1}} \right] \end{align}

where tn1,α/2t_{n-1, \alpha/2} is the (1α/2)(1 - \alpha/2) quantile of the Student tt-distribution with n1n-1 degrees of freedom. This uncertainty interval is sometimes referred to as the Bayesian Credible Interval.

Check that α=0.05\alpha = 0.05 leads to the interval [25.598,37.102][25.598, 37.102].

Frequentist Solution

It turns out that the Bayesian credible interval (20) is also the standard frequentist confidence interval in this problem. Specifically, we have (below Xˉ=(X1++Xn)/n\bar{X} = (X_1 + \dots + X_n)/n)

P{Xˉtn1,α/2n1n1i=1n(XiXˉ)2θXˉ+tn1,α/2n1n1i=1n(XiXˉ)2}=1α\P \left\{\bar{X} - \frac{t_{n-1, \alpha/2}}{\sqrt{n}} \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2} \leq \theta \leq \bar{X} + \frac{t_{n-1, \alpha/2}}{\sqrt{n}} \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2} \right\} = 1 - \alpha

under the assumption:

X1,,Xn are i.i.d N(θ,σ2).X_1, \dots, X_n ~ \text{are i.i.d} ~ N(\theta, \sigma^2).

This is because

n(Xˉθ)Stn1where S:=1n1i=1n(XiXˉ)2.\frac{\sqrt{n} (\bar{X} - \theta)}{S} \sim t_{n-1} \qquad\text{where $S := \sqrt{\frac{1}{n-1} \sum_{i=1}^n \left(X_i - \bar{X} \right)^2} $}.

In the above probability statement, θ\theta is held fixed, and the probability is taken with respect to X1,,XnX_1, \dots, X_n which are i.i.d N(0,σ2)N(0, \sigma^2).

Observe the difference between (19) and (22). In (19), the data is held fixed at the observed values and the probability is with respect to θ\theta. In (22), θ\theta is held fixed and the probability is with respect to the random variables X1,,XnX_1, \dots, X_n which are supposed to represent data.

In this problem, the standard Bayesian inference and standard frequentist inference exactly coincide.

However, it is very easy to break this coincidence. We shall discuss this in the next lecture.