Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Four Spring 2026, UC Berkeley

In the last lecture, we discussed the following problem.

Example 3: Inference from measurements

Frequentist Solution

Here is the standard frequentist solution to this problem. Use the model:

X1,,Xni.i.dN(θ,σ2).\begin{align} X_1, \dots, X_n \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2). \end{align}

It then follows that Xˉn:=(X1++Xn)/nN(θ,σ2/n)\bar{X}_n := (X_1 + \dots + X_n)/n \sim N(\theta, \sigma^2/n) which implies:

n(Xˉnθ)σN(0,1).\begin{align} \frac{\sqrt{n}(\bar{X}_n - \theta)}{\sigma} \sim N(0, 1). \end{align}

If σ\sigma is known, this gives the confidence interval Xˉn±σnzα/2\bar{X}_n \pm \frac{\sigma}{\sqrt{n}} z_{\alpha/2} for θ\theta (here zα/2z_{\alpha/2} satisfies P{N(0,1)>zα/2}=α/2\P\{N(0, 1) > z_{\alpha/2}\} = \alpha/2). But this confidence interval cannot be computed as σ\sigma is unknown. It is natural to replace σ\sigma by the natural estimator:

σ^:=1n1i=1n(XiXˉn)2.\begin{align*} \hat{\sigma} := \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2}. \end{align*}

But then the normal distribution in (2) needs to be changed to the Student tt distribution with n1n-1 degrees of freedom:

n(Xˉnθ)σ^tn1.\begin{align} \frac{\sqrt{n}(\bar{X}_n - \theta)}{\hat{\sigma}} \sim t_{n-1}. \end{align}

This leads to the confidence interval:

[Xˉntn1,α/2n1n1i=1n(XiXˉn)2,Xˉn+tn1,α/2n1n1i=1n(XiXˉn)2].\begin{align} \left[ \bar{X}_n - \frac{t_{n-1, \alpha/2}}{\sqrt{n}} \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2}, \bar{X}_n + \frac{t_{n-1, \alpha/2}}{\sqrt{n}} \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2} \right]. \end{align}

When α=0.05\alpha = 0.05 and one plugs in the observed data X1=26.6,X2=38.5,X3=34.4,X4=34,X5=31,X6=23.6X_1 = 26.6, X_2 = 38.5, X_3 = 34.4, X_4 = 34, X_5 = 31, X_6 = 23.6 with n=6n = 6 in the above interval, we obtain the interval [25.598,37.102][25.598, 37.102].

Bayesian Solution

The Bayesian solution also leads to the same interval but with a different reasoning. We went over the calculations in the last lecture. Here are the main facts. We use the likelihood:

Likelihood=i=1n12πσexp((xiθ)22σ2).\text{Likelihood} = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i - \theta)^2}{2 \sigma^2} \right).

where n=6n = 6 and x1=26.6,x2=38.5,x3=34.4,x4=34,x5=31,x6=23.6x_1 = 26.6, x_2 = 38.5, x_3 = 34.4, x_4 = 34, x_5 = 31, x_6 = 23.6 denote the observed data points.

The unknown parameters are θ\theta and σ\sigma. The prior is given by:

θ,logσi.i.duniform(C,C)\begin{align} \theta, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C) \end{align}

for a very large positive constant CC. In terms of the densities, (7) is the same as

prior=fθ,σ(θ,σ)1{C<θ<C,C<logσ<C}σ.\begin{align*} \text{prior} = f_{\theta, \sigma}(\theta, \sigma) \propto \frac{\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}}{\sigma}. \end{align*}

For this model, the posterior becomes:

posterior=fθ,σdata(θ,σ)σn1exp(12σ2i=1n(xiθ)2)1{C<θ<C,C<logσ<C}.\begin{align*} \text{posterior} = f_{\theta, \sigma \mid \text{data}}(\theta, \sigma)\propto \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right)\mathbf{1}\{-C < \theta < C,\,-C < \log \sigma < C\}. \end{align*}

This is the joint posterior density of θ\theta and σ\sigma. The posterior of θ\theta alone is obtained by integrating out σ\sigma:

fθdata(θ)1{C<θ<C}eCeCσn1exp(12σ2i=1n(xiθ)2)dσ\begin{align*} f_{\theta \mid \text{data}}(\theta) \propto \mathbf{1}\{-C < \theta < C\} \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp\left(-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i - \theta)^2 \right) d\sigma \end{align*}

Because CC is large, the limits of the integral can be taken to be 0 and \infty. The integral then can be calculated precisely to obtain

fθdata(θ)1{C<θ<C}(1S(θ))n/2\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \mathbf{1}\{-C < \theta < C\} \left(\frac{1}{S(\theta)} \right)^{n/2} \end{align*}

where S(θ)S(\theta) is the sum of squares term:

S(θ)=i=1n(xiθ)2.\begin{align*} S(\theta) = \sum_{i=1}^n (x_i - \theta)^2. \end{align*}

If CC is large, then the indicator can be dropped (because it will essentially be always 1) so the posterior becomes:

fθdata(θ)(1S(θ))n/2(S(θ^)S(θ))n/2.\begin{align*} f_{\theta \mid \text{data}}(\theta) &\propto \left(\frac{1}{S(\theta)} \right)^{n/2} \propto \left(\frac{S(\hat{\theta})}{S(\theta)} \right)^{n/2}. \end{align*}

where θ^=xˉ=(x1++xn)/n\hat{\theta} = \bar{x} = (x_1 + \dots + x_n)/n is the least squares estimator (which minimizes S(θ)S(\theta) over all θ\theta). Thus the posterior mode is the mean.

It can be shown that this distribution is closely related to the tt-distribution. Specifically,

n(θθ^)S(θ^)/(n1)datatn1\begin{align} \frac{\sqrt{n}(\theta - \hat{\theta})}{\sqrt{S(\hat{\theta})/(n-1)}} \mid \text{data} \sim t_{n-1} \end{align}

where tn1t_{n-1} denotes the tt-density with n1n-1 degrees of freedom. Note that θ^=xˉ\hat{\theta} = \bar{x} and S(θ^)=i=1n(xixˉ)2S(\hat{\theta}) = \sum_{i=1}^n (x_i - \bar{x})^2.

So the Bayesian point estimate is simply θ^=xˉ\hat{\theta} = \bar{x} (this is the posterior mean, median and mode!). A 100(1α)100(1 - \alpha)% uncertainty interval for θ\theta is given by:

[θ^1ntn1,α/2S(θ^)n1,θ^+1ntn1,α/2S(θ^)n1]\begin{align} \left[ \hat{\theta} - \frac{1}{\sqrt{n}} t_{n-1, \alpha/2} \sqrt{\frac{S(\hat{\theta})}{n-1}}, \hat{\theta} + \frac{1}{\sqrt{n}} t_{n-1, \alpha/2} \sqrt{\frac{S(\hat{\theta})}{n-1}} \right] \end{align}

where tn1,α/2t_{n-1, \alpha/2} is the (1α/2)(1 - \alpha/2) quantile of the Student tt-distribution with n1n-1 degrees of freedom. This uncertainty interval is referred to as the Bayesian Credible Interval.

Thus, in problem 1, the standard frequentist and Bayesian solutions coincide.

Frequentist vs Bayes

However, it is very easy to break this coincidence. For example, consider the following problem:

Note that the observed data is exactly the same as before.

The frequentist confidence interval (5) is no longer valid, because the frequentist probability statement (4) is no longer valid. This is because the number of datapoints nn cannot be taken to be deterministically equal to 6. So the frequentist probability that we need to calculate is (below we denote the number of data points by NN and treat it as a random variable):

P{XˉNtN1,α/2n1N1i=1N(XiXˉN)2θXˉN+tN1,α/2N1N1i=1N(XiXˉN)2}\P \left\{\bar{X}_N - \frac{t_{N-1, \alpha/2}}{\sqrt{n}} \sqrt{\frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X}_N)^2} \leq \theta \leq \bar{X}_N + \frac{t_{N-1, \alpha/2}}{\sqrt{N}} \sqrt{\frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X}_N)^2} \right\}

where X1,X2,X_1, X_2, \dots are i.i.d N(θ,σ2)N(\theta, \sigma^2) as before and

N:=inf{n1:Xn25}.N := \inf \left\{n \geq 1 : X_n \leq 25 \right\}.

The probability above is complicated and there is no reason for it to be exactly equal to 1α1 - \alpha. Constructing valid frequentist confidence intervals in the presence of stopping rules (such as the rule of stopping as soon as we observe a data point smaller than 25) is, in fact, a problem of current research (see e.g., the paper https://arxiv.org/abs/2210.01948).

In contrast to frequentist inference, the Bayesian inference procedure will not change. This is because the likelihood function in Problem 2 is the same as the likelihood function in Problem 1. To verify this, consider the following likelihood in Problem 2 (below δ\delta denotes the rounding error in the observations, which is extremely small).

The likelihood in Problem 2

=P{observed dataθ,σ}=P{X1[x1δ,x1+δ],,X6[x6δ,x6+δ],X125,,X525,X6<25θ,σ}.\begin{align*} &= \P \left\{\text{observed data} \mid \theta, \sigma \right\} \\ &= \P\{X_1 \in [x_1 - \delta, x_1 + \delta], \dots, X_6 \in [x_6 - \delta, x_6 + \delta], X_1 \geq 25, \dots, X_5 \ge 25, X_6 < 25 \mid \theta, \sigma \}. \end{align*}

In other words, we are writing the probability that the first five observations are all larger than 25 while the sixth observation is smaller than 25, in addition to the exact values of the observations, in the likelihood. But it is clear that these additional constraints do not change the probability (as an example, just note that P(Z=5)=P(Z=5,Z<10)\P(Z = 5) = \P(Z = 5, Z < 10). The additional restriction Z<10Z < 10 does not affect the probability because it is already covered in Z=5Z = 5). Thus

The likelihood in Problem 2

=P{X1[x1δ,x1+δ],,X6[x6δ,x6+δ],X125,,X525,X6<25θ,σ}=P{X1[x1δ,x1+δ],,X6[x6δ,x6+δ]θ,σ},\begin{align*} &= \P\{X_1 \in [x_1 - \delta, x_1 + \delta], \dots, X_6 \in [x_6 - \delta, x_6 + \delta], X_1 \geq 25, \dots, X_5 \ge 25, X_6 < 25 \mid \theta, \sigma \} \\ &= \P\{X_1 \in [x_1 - \delta, x_1 + \delta], \dots, X_6 \in [x_6 - \delta, x_6 + \delta] \mid \theta, \sigma \}, \end{align*}

which is the likelihood in Problem 1. Since the likelihood is the same in both problems, Bayesian inference for both problems will be the same (note the priors will be the same as there is no reason to use different priors). Therefore, from the Bayesian perspective, stopping rules can be ignored for inferring θ\theta, because they do not affect the likelihood.

This example shows clearly that frequentist inference violates the Likelihood Principle (the likelihood principle states that “all the evidence in a sample relevant to model parameters is contained in the likelihood function”). See Likelihood principle for more information on the likelihood principle.

On the other hand, Bayesian inference always satisfies the likelihood principle (assuming that priors are the same), because data enters the Bayesian posterior calculation only through the likelihood.

Here is another example of violation of the likelihood principle in frequentist inference.

Example 4: Coin Fairness Testing

Frequentist Solution

For the usual frequentist answer to this question, we assume that the observed sequence of outcomes are the realization of random variables X1,,XnX_1, \dots, X_n (with n=12n = 12) that are independently distributed according to the Bin(n,p)\text{Bin}(n, p) distribution for some unknown pp. We need to test the (null) hypothesis that p=0.5p = 0.5 against, say, the alternative p<0.5p < 0.5. This can be done by calculating the pp-value which is the probability (under the assumption p=0.5p = 0.5) of getting 3 or lower heads. The distribution of the number of heads under the null distribution is Bin(n,0.5)\text{Bin}(n, 0.5) so the pp-value is

((123)+(122)+(121)+(120))1212=2994096=0.073=7.3%\left( \binom{12}{3} + \binom{12}{2} + \binom{12}{1} + \binom{12}{0} \right) \frac{1}{2^{12}} = \frac{299}{4096} = 0.073 = 7.3\%

which does not lead to a rejection of the null hypothesis at the usual 5% level.

In this pp-value calculation, we implicitly assumed that the experiment consisted of tossing the coin 12 times where 12 was a priori chosen by the coin tosser. Consider now the alternative scenario where the coin tosser wanted to toss the coin until the point where 3 heads are observed. Now for the same outcome, the pp-value will change. Indeed now the random variable of interest will become N=number of tossesN = \text{number of tosses} and the pp-value will equal the probability of needing to toss the coin 12 or more times to get the 3 heads (assuming fairness). This is calculated using the negative binomial distribution as:

1n=311(n12)2n=1344096=0.0327=3.27%1 - \sum_{n = 3}^{11} \binom{n-1}{2} 2^{-n} = \frac{134}{4096} = 0.0327 = 3.27\%

and this leads to rejection of the null hypothesis at the 5%5\% level.

Note that the “likelihood function” is the same function p3(1p)9p^3(1 - p)^9 whether the sample size was predetermined or whether the coin was tossed till 3 heads are observed. But the procedure obtained for testing p=0.5p = 0.5 has changed from the binomial to the negative binomial case. This means that pp-valued based frequentist inference violates the Likelihood Principle. Here is a story from the wikipedia article on the “Likelihood Principle” (see Likelihood principle) which puts an interesting context to these numbers:

Suppose a number of scientists are assessing the probability of a certain outcome (which we shall call ’success’) in experimental trials. Conventional wisdom suggests that if there is no bias towards success or failure then the success probability would be one half. Adam, a scientist, conducted 12 trials and obtains 3 successes and 9 failures. One of those successes was the 12th and last observation. Then Adam left the lab.

Bill, Adam’s boss in the same lab, continued Adam’s work and published Adam’s results, along with a significance test. He tested the null hypothesis that θ\theta, the success probability, is equal to a half, versus θ<0.5\theta < 0.5 . The probability that out of 12 trials, 3 or fewer (i.e. more extreme) were successes, if H0H_0 is true, is 7.3%7.3\%. Thus the null hypothesis is not rejected at the 5% significance level.

Adam actually stopped immediately after 3 successes, because his boss Bill had instructed him to do so. After the publication of the statistical analysis by Bill, Adam realizes that he has missed a later instruction from Bill to instead conduct 12 trials, and that Bill’s paper is based on this second instruction. Adam is very glad that he got his 3 successes after exactly 12 trials, and explains to his friend Charlotte that by coincidence he executed the second instruction. But Charlotte then explains to Adam that the pp-value should now be changed to 3.27%3.27\% and the result becomes significant at the 5%5\% level. Adam is astonished to hear this.

For more comments on the violation of the likelihood principle by pp-values, read MacKay (2003, Section 37.2).

We shall look at the Bayesian solution to this problem in the next lecture.

References
  1. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.