STAT 238 - Bayesian Statistics Lecture Three Spring 2026, UC Berkeley
In the last lecture, we started discussing this example.
Example 3: Inference from measurements ¶ Suppose a scientist makes 6 numerical measurements 26.6, 38.5, 34.4, 34, 31, 23.6 on an unknown real-valued physical quantity θ \theta θ . On the basis of these measurements, what can be inferred about θ \theta θ ?
Here is the Bayesian solution to this problem. The first step is modeling where we have to write the likelihood and prior. The likelihood represents the probability of the observed data conditional on parameter values. Here the main parameter is θ \theta θ . In order to write the probability of the observed data, it is helpful to introduce another parameter σ \sigma σ which represents the scale of the noise inherent in the measurement process.
So our parameter vector is ( θ , σ ) (\theta, \sigma) ( θ , σ ) . We work with the normal likelihood:
Likelihood = ∏ i = 1 n 1 2 π σ exp ( − ( x i − θ ) 2 2 σ 2 ) . \text{Likelihood} = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp
\left(-\frac{(x_i -
\theta)^2}{2 \sigma^2} \right). Likelihood = i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( x i − θ ) 2 ) . where n = 6 n = 6 n = 6 and x 1 = 26.6 , x 2 = 38.5 , x 3 = 34.4 , x 4 = 34 , x 5 = 31 , x 6 = 23.6 x_1 = 26.6, x_2 = 38.5, x_3 = 34.4,
x_4 = 34, x_5 = 31, x_6 = 23.6 x 1 = 26.6 , x 2 = 38.5 , x 3 = 34.4 , x 4 = 34 , x 5 = 31 , x 6 = 23.6 denote the observed data points. More formally, you can arrive at this likelihood in the following way. Denote potential measurements by X 1 , … , X n X_1, \dots,
X_n X 1 , … , X n . Each actual measurement will have some rounding error so the data point 26.6 should be understood as belonging to the interval [ 26.6 − δ , 26.6 + δ ] [26.6 - \delta, 26.6 + \delta] [ 26.6 − δ , 26.6 + δ ] for some small rounding error δ \delta δ . So the likelihood is:
likelihood = P { observed data ∣ θ , σ } = P { X 1 ∈ [ x 1 − δ , x 1 + δ ] , … , X n ∈ [ x n − δ , x n + δ ] ∣ θ , σ } . \begin{align*}
\text{likelihood} &= \P\{\text{observed data} \mid \theta,
\sigma\} \\ &= \P \left\{X_1
\in [x_1 -
\delta, x_1 + \delta], \dots, X_n \in [x_n -
\delta, x_n +
\delta] \mid \theta, \sigma
\right\}.
\end{align*} likelihood = P { observed data ∣ θ , σ } = P { X 1 ∈ [ x 1 − δ , x 1 + δ ] , … , X n ∈ [ x n − δ , x n + δ ] ∣ θ , σ } . Assuming δ \delta δ is small, we can use probability-density approximation to write
likelihood ≈ δ n f X 1 , … , X n ∣ θ , σ ( x 1 , … , x n ) . \begin{align*}
\text{likelihood} \approx \delta^n f_{X_1, \dots,
X_n \mid \theta, \sigma}(x_1, \dots, x_n).
\end{align*} likelihood ≈ δ n f X 1 , … , X n ∣ θ , σ ( x 1 , … , x n ) . We are now assuming that:
f X 1 , … , X n ∣ θ , σ ( x 1 , … , x n ) = ∏ i = 1 n 1 2 π σ exp ( − ( x i − θ ) 2 2 σ 2 ) . \begin{align}
f_{X_1, \dots,
X_n \mid \theta, \sigma}(x_1, \dots, x_n) = \prod_{i=1}^n
\frac{1}{\sqrt{2 \pi} \sigma} \exp
\left(-\frac{(x_i -
\theta)^2}{2 \sigma^2} \right).
\end{align} f X 1 , … , X n ∣ θ , σ ( x 1 , … , x n ) = i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( x i − θ ) 2 ) . This leads to the likelihood (6) (note that δ n \delta^n δ n is being dropped as it is a constant of proportionality which does not affect any further calculations).
Here is a short digression on the likelihood and the assumption (9) .
A lot of the time, the likelihood assumption (9) is written as:
X 1 , … , X n ∣ θ , σ ∼ i.i.d N ( θ , σ 2 ) . \begin{align}
X_1, \dots, X_n \mid \theta,\sigma \overset{\text{i.i.d}}{\sim}
N(\theta, \sigma^2).
\end{align} X 1 , … , X n ∣ θ , σ ∼ i.i.d N ( θ , σ 2 ) . Strictly speaking (9) and (5) are not the same. This is because (5) is equivalent to
f X 1 , … , X n ∣ θ , σ ( u 1 , … , u n ) = ∏ i = 1 n 1 2 π σ exp ( − ( u i − θ ) 2 2 σ 2 ) for all u 1 , … , u n ∈ ( − ∞ , ∞ ) \begin{align}
f_{X_1, \dots, X_n \mid \theta, \sigma}(u_1,\dots, u_n) = \prod_{i=1}^n
\frac{1}{\sqrt{2 \pi} \sigma} \exp
\left(-\frac{(u_i -
\theta)^2}{2 \sigma^2} \right) \qquad\text{for all $u_1, \dots, u_n \in
(-\infty, \infty)$}
\end{align} f X 1 , … , X n ∣ θ , σ ( u 1 , … , u n ) = i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( u i − θ ) 2 ) for all u 1 , … , u n ∈ ( − ∞ , ∞ ) (6) is much stronger than (9) because u 1 . … , u n u_1. \dots, u_n u 1 . … , u n are completely arbitrary while in (9) the points x 1 , … , x n x_1, \dots, x_n x 1 , … , x n are not arbitrary (they simply equal the observed data).
For example, note that if we assume
f X 1 , … , X n ∣ θ , σ ( u 1 , … , u n ) = { ∏ i = 1 n 1 2 π σ exp ( − ( u i − θ ) 2 2 σ 2 ) , if u 1 , … , u n ∈ ( 20 , 30 ) , completely arbitrary , if ( u 1 , … , u n ) ∉ ( 20 , 30 ) n . \begin{align*}
f_{X_1, \dots, X_n \mid \theta, \sigma}(u_1,\dots, u_n)
=
\begin{cases}
\displaystyle
\prod_{i=1}^n
\frac{1}{\sqrt{2 \pi} \sigma}
\exp\!\left(
-\frac{(u_i - \theta)^2}{2 \sigma^2}
\right),
& \text{if } u_1,\dots,u_n \in (20,30), \\
\text{completely arbitrary},
& \text{if } (u_1,\dots,u_n) \notin (20,30)^n .
\end{cases}
\end{align*} f X 1 , … , X n ∣ θ , σ ( u 1 , … , u n ) = ⎩ ⎨ ⎧ i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( u i − θ ) 2 ) , completely arbitrary , if u 1 , … , u n ∈ ( 20 , 30 ) , if ( u 1 , … , u n ) ∈ / ( 20 , 30 ) n . then again we arrive at the same likelihood. But now (5) is no longer true.
To complete the modeling step, we need to describe the prior on θ , σ \theta, \sigma θ , σ . We assume
θ , log σ ∼ i.i.d uniform ( − C , C ) \begin{align}
\theta, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C)
\end{align} θ , log σ ∼ i.i.d uniform ( − C , C ) for a very large positive constant C C C . The idea here is that we are allowing θ \theta θ and log σ \log \sigma log σ to take values essentially on the entire real line and not expressing preference for any one value over any other. In terms of the densities, (8) is the same as
prior = f θ , σ ( θ , σ ) = f θ ( θ ) f σ ( σ ) = f θ ( θ ) f log σ ( log σ ) 1 σ = 1 { − C < θ < C } 2 C 1 { − C < log σ < C } 2 C 1 σ ∝ 1 { − C < θ < C , − C < log σ < C } σ . \begin{align*}
\text{prior} = f_{\theta, \sigma}(\theta, \sigma)
&= f_{\theta}(\theta)\, f_{\sigma}(\sigma) \\
&= f_{\theta}(\theta)\, f_{\log \sigma}(\log \sigma)\, \frac{1}{\sigma} \\
&= \frac{\mathbf{1}\{-C < \theta < C\}}{2C}
\frac{\mathbf{1}\{-C < \log \sigma < C\}}{2C}
\frac{1}{\sigma} \\
&\propto \frac{\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}}{\sigma}.
\end{align*} prior = f θ , σ ( θ , σ ) = f θ ( θ ) f σ ( σ ) = f θ ( θ ) f l o g σ ( log σ ) σ 1 = 2 C 1 { − C < θ < C } 2 C 1 { − C < log σ < C } σ 1 ∝ σ 1 { − C < θ < C , − C < log σ < C } . To now get the posterior (which is the joint density of θ , σ \theta,
\sigma θ , σ given the observed data), we use Bayes rule as
posterior = f θ , σ ∣ data ( θ , σ ) ∝ f θ , σ ( θ , σ ) × likelihood ∝ 1 { − C < θ < C , − C < log σ < C } σ ∏ i = 1 n 1 2 π σ exp ( − ( x i − θ ) 2 2 σ 2 ) ∝ σ − n − 1 exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − θ ) 2 ) 1 { − C < θ < C , − C < log σ < C } . \begin{align*}
\text{posterior} &= f_{\theta, \sigma \mid \text{data}}(\theta,
\sigma) \\
&\propto f_{\theta, \sigma}(\theta, \sigma) \times \text{likelihood}
\\
&\propto \frac{\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma <
C\}}{\sigma} \prod_{i=1}^n
\frac{1}{\sqrt{2 \pi} \sigma} \exp
\left(-\frac{(x_i -
\theta)^2}{2 \sigma^2} \right) \\
&\propto \sigma^{-n-1} \exp\left(-\frac{1}{2
\sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right)\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma <
C\}.
\end{align*} posterior = f θ , σ ∣ data ( θ , σ ) ∝ f θ , σ ( θ , σ ) × likelihood ∝ σ 1 { − C < θ < C , − C < log σ < C } i = 1 ∏ n 2 π σ 1 exp ( − 2 σ 2 ( x i − θ ) 2 ) ∝ σ − n − 1 exp ( − 2 σ 2 1 i = 1 ∑ n ( x i − θ ) 2 ) 1 { − C < θ < C , − C < log σ < C } . The constant underlying proportionality above is determined by the overall integral being one:
posterior = f θ , σ ∣ data ( θ , σ ) = σ − n − 1 exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − θ ) 2 ) 1 { − C < θ < C , − C < log σ < C } ∫ − C C ∫ e − C e C σ − n − 1 exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − θ ) 2 ) d θ d σ . \begin{align*}
\text{posterior} = f_{\theta, \sigma \mid \text{data}}(\theta,
\sigma) = \frac{\sigma^{-n-1} \exp\left(-\frac{1}{2
\sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right)\mathbf{1}\{-C <
\theta < C,\,-C < \log \sigma <
C\}}{\int_{-C}^C \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp\left(-\frac{1}{2
\sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\theta d\sigma}.
\end{align*} posterior = f θ , σ ∣ data ( θ , σ ) = ∫ − C C ∫ e − C e C σ − n − 1 exp ( − 2 σ 2 1 ∑ i = 1 n ( x i − θ ) 2 ) d θ d σ σ − n − 1 exp ( − 2 σ 2 1 ∑ i = 1 n ( x i − θ ) 2 ) 1 { − C < θ < C , − C < log σ < C } . This is the joint posterior density of θ \theta θ and σ \sigma σ . If we only want the posterior density for θ \theta θ , we use the sum rule of probability to integrate out σ \sigma σ :
f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ∫ e − C e C σ − n − 1 exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − θ ) 2 ) d σ \begin{align*}
f_{\theta \mid \text{data}}(\theta) \propto \mathbf{1}\{-C <
\theta < C\} \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp\left(-\frac{1}{2
\sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\sigma
\end{align*} f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ∫ e − C e C σ − n − 1 exp ( − 2 σ 2 1 i = 1 ∑ n ( x i − θ ) 2 ) d σ Because C C C is large, the limits of the integral can be taken to be 0 and ∞ \infty ∞ leading to
f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ∫ 0 ∞ σ − n − 1 exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − θ ) 2 ) d σ = 1 { − C < θ < C } 2 ( n / 2 ) − 1 Γ ( n / 2 ) [ ∑ i = 1 n ( x i − θ ) 2 ] − n / 2 \begin{align*}
f_{\theta \mid \text{data}}(\theta) &\propto \mathbf{1}\{-C <
\theta < C\} \int_{0}^{\infty} \sigma^{-n-1} \exp\left(-\frac{1}{2
\sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\sigma \\
&= \mathbf{1}\{-C <
\theta < C\} 2^{(n/2) - 1} \Gamma(n/2)\left[\sum_{i=1}^n
(x_i - \theta)^2 \right]^{-n/2}
\end{align*} f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ∫ 0 ∞ σ − n − 1 exp ( − 2 σ 2 1 i = 1 ∑ n ( x i − θ ) 2 ) d σ = 1 { − C < θ < C } 2 ( n /2 ) − 1 Γ ( n /2 ) [ i = 1 ∑ n ( x i − θ ) 2 ] − n /2 The factor 2 ( n / 2 ) − 1 Γ ( n / 2 ) 2^{(n/2) - 1} \Gamma(n/2) 2 ( n /2 ) − 1 Γ ( n /2 ) does not depend on θ \theta θ so it can be absorbed in the constant of proportionality as well leading to:
f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ( 1 S ( θ ) ) n / 2 \begin{align*}
f_{\theta \mid \text{data}}(\theta) &\propto \mathbf{1}\{-C <
\theta < C\} \left(\frac{1}{S(\theta)} \right)^{n/2}
\end{align*} f θ ∣ data ( θ ) ∝ 1 { − C < θ < C } ( S ( θ ) 1 ) n /2 where S ( θ ) S(\theta) S ( θ ) is the sum of squares term:
S ( θ ) = ∑ i = 1 n ( x i − θ ) 2 . \begin{align*}
S(\theta) = \sum_{i=1}^n (x_i - \theta)^2.
\end{align*} S ( θ ) = i = 1 ∑ n ( x i − θ ) 2 . If C C C is large, then the indicator can be dropped (because it will essentially be always 1) so the posterior becomes:
f θ ∣ data ( θ ) ∝ ( 1 S ( θ ) ) n / 2 . \begin{align*}
f_{\theta \mid \text{data}}(\theta) &\propto
\left(\frac{1}{S(\theta)}
\right)^{n/2}.
\end{align*} f θ ∣ data ( θ ) ∝ ( S ( θ ) 1 ) n /2 . Thus the posterior is inversely proportional to S ( θ ) n / 2 S(\theta)^{n/2} S ( θ ) n /2 . This means that the posterior model will be at the least squares estimator which is the mean θ ^ = x ˉ = ( x 1 + ⋯ + x n ) / n \hat{\theta} = \bar{x} =
(x_1 + \dots + x_n)/n θ ^ = x ˉ = ( x 1 + ⋯ + x n ) / n . It is cleaner to write the above posterior as:
f θ ∣ data ( θ ) ∝ ( S ( θ ^ ) S ( θ ) ) n / 2 . \begin{align*}
f_{\theta \mid \text{data}}(\theta) &\propto
\left(\frac{S(\hat{\theta})}{S(\theta)}
\right)^{n/2}.
\end{align*} f θ ∣ data ( θ ) ∝ ( S ( θ ) S ( θ ^ ) ) n /2 . Because S ( θ ) = S ( θ ^ ) + n ( θ − θ ^ ) 2 S(\theta) = S(\hat{\theta}) + n(\theta - \hat{\theta})^2 S ( θ ) = S ( θ ^ ) + n ( θ − θ ^ ) 2 , we can also rewrite the posterior as:
f θ ∣ data ( θ ) ∝ ( S ( θ ^ ) S ( θ ^ ) + n ( θ − θ ^ ) 2 ) n / 2 = ( 1 1 + ( θ − θ ^ ) 2 S ( θ ^ ) / n ) n / 2 . \begin{align*}
f_{\theta \mid \text{data}}(\theta) &\propto
\left(\frac{S(\hat{\theta})}{S(\hat{\theta})
+ n(\theta - \hat{\theta})^2}
\right)^{n/2} =
\left(\frac{1}{1 +
\frac{(\theta -
\hat{\theta})^2}{S(\hat{\theta})/n}}
\right)^{n/2}.
\end{align*} f θ ∣ data ( θ ) ∝ ( S ( θ ^ ) + n ( θ − θ ^ ) 2 S ( θ ^ ) ) n /2 = ⎝ ⎛ 1 + S ( θ ^ ) / n ( θ − θ ^ ) 2 1 ⎠ ⎞ n /2 . It can be shown (left as exercise) that this is related to the t t t -density (see { https://en.wikipedia.org/wiki/Student
the above is equivalent to
n ( θ − θ ^ ) S ( θ ^ ) / ( n − 1 ) ∣ data ∼ t n − 1 \begin{align}
\frac{\sqrt{n}(\theta - \hat{\theta})}{\sqrt{S(\hat{\theta})/(n-1)}}
\mid \text{data} \sim t_{n-1}
\end{align} S ( θ ^ ) / ( n − 1 ) n ( θ − θ ^ ) ∣ data ∼ t n − 1 where t n − 1 t_{n-1} t n − 1 denotes the t t t -density with n − 1 n-1 n − 1 degrees of freedom. Note that θ ^ = x ˉ \hat{\theta} = \bar{x} θ ^ = x ˉ and S ( θ ^ ) = ∑ i = 1 n ( x i − x ˉ ) 2 S(\hat{\theta}) =
\sum_{i=1}^n (x_i - \bar{x})^2 S ( θ ^ ) = ∑ i = 1 n ( x i − x ˉ ) 2 .
So the Bayesian point estimate is simply θ ^ = x ˉ \hat{\theta} = \bar{x} θ ^ = x ˉ (this is the posterior mean, median and mode!). A 100 ( 1 − α ) 100(1 - \alpha) 100 ( 1 − α ) % uncertainty interval for θ \theta θ is given by:
[ θ ^ − 1 n t n − 1 , α / 2 S ( θ ^ ) n − 1 , θ ^ + 1 n t n − 1 , α / 2 S ( θ ^ ) n − 1 ] \begin{align}
\left[ \hat{\theta} - \frac{1}{\sqrt{n}} t_{n-1, \alpha/2}
\sqrt{\frac{S(\hat{\theta})}{n-1}}, \hat{\theta} +
\frac{1}{\sqrt{n}} t_{n-1, \alpha/2}
\sqrt{\frac{S(\hat{\theta})}{n-1}} \right]
\end{align} ⎣ ⎡ θ ^ − n 1 t n − 1 , α /2 n − 1 S ( θ ^ ) , θ ^ + n 1 t n − 1 , α /2 n − 1 S ( θ ^ ) ⎦ ⎤ where t n − 1 , α / 2 t_{n-1, \alpha/2} t n − 1 , α /2 is the ( 1 − α / 2 ) (1 - \alpha/2) ( 1 − α /2 ) quantile of the Student t t t -distribution with n − 1 n-1 n − 1 degrees of freedom. This uncertainty interval is sometimes referred to as the Bayesian Credible Interval.
Check that α = 0.05 \alpha = 0.05 α = 0.05 leads to the interval [ 25.598 , 37.102 ] [25.598, 37.102] [ 25.598 , 37.102 ] .
Frequentist Solution ¶ It turns out that the Bayesian credible interval (20) is also the standard frequentist confidence interval in this problem. Specifically, we have (below X ˉ = ( X 1 + ⋯ + X n ) / n \bar{X} = (X_1 + \dots +
X_n)/n X ˉ = ( X 1 + ⋯ + X n ) / n )
P { X ˉ − t n − 1 , α / 2 n 1 n − 1 ∑ i = 1 n ( X i − X ˉ ) 2 ≤ θ ≤ X ˉ + t n − 1 , α / 2 n 1 n − 1 ∑ i = 1 n ( X i − X ˉ ) 2 } = 1 − α \P \left\{\bar{X} - \frac{t_{n-1,
\alpha/2}}{\sqrt{n}}
\sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2} \leq \theta \leq
\bar{X} + \frac{t_{n-1,
\alpha/2}}{\sqrt{n}}
\sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2}
\right\} = 1 - \alpha P { X ˉ − n t n − 1 , α /2 n − 1 1 i = 1 ∑ n ( X i − X ˉ ) 2 ≤ θ ≤ X ˉ + n t n − 1 , α /2 n − 1 1 i = 1 ∑ n ( X i − X ˉ ) 2 } = 1 − α under the assumption:
X 1 , … , X n are i.i.d N ( θ , σ 2 ) . X_1, \dots, X_n ~ \text{are i.i.d} ~ N(\theta, \sigma^2). X 1 , … , X n are i.i.d N ( θ , σ 2 ) . This is because
n ( X ˉ − θ ) S ∼ t n − 1 where S : = 1 n − 1 ∑ i = 1 n ( X i − X ˉ ) 2 . \frac{\sqrt{n} (\bar{X} - \theta)}{S} \sim t_{n-1} \qquad\text{where $S := \sqrt{\frac{1}{n-1} \sum_{i=1}^n \left(X_i - \bar{X}
\right)^2} $}. S n ( X ˉ − θ ) ∼ t n − 1 where S := n − 1 1 ∑ i = 1 n ( X i − X ˉ ) 2 . In the above probability statement, θ \theta θ is held fixed, and the probability is taken with respect to X 1 , … , X n X_1, \dots, X_n X 1 , … , X n which are i.i.d N ( 0 , σ 2 ) N(0, \sigma^2) N ( 0 , σ 2 ) .
Observe the difference between (19) and (22) . In (19) , the data is held fixed at the observed values and the probability is with respect to θ \theta θ . In (22) , θ \theta θ is held fixed and the probability is with respect to the random variables X 1 , … , X n X_1, \dots, X_n X 1 , … , X n which are supposed to represent data.
In this problem, the standard Bayesian inference and standard frequentist inference exactly coincide.
However, it is very easy to break this coincidence. We shall discuss this in the next lecture.