Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Five Spring 2026, UC Berkeley

In the last lecture, we started discussed the following problem.

Example 4: Coin Fairness Testing

We looked at frequentist solutions in the last lecture. These are generally based on pp-values. Calculation of these pp-values require consideration of alternative datasets that could have appeared (but did not actually appear). As we discussed last time, such calculations would violate the likelihood principle.

Here is a Bayesian solution to this problem. The goal is to calculate:

P{fairnessdata}\P\{\text{fairness} \mid \text{data}\}

where data\text{data} refers to TTTTHTHTTTTHTTTTHTHTTTTH. By the Bayes rule, we can write

P{fairnessdata}=P{datafairness}P{fairness}P{datafairness}P{fairness}+P{datanot fair}P{not fair}\P\{\text{fairness} \mid \text{data}\} = \frac{\P\{\text{data} \mid \text{fairness}\} \P\{\text{fairness}\}}{\P\{\text{data} \mid \text{fairness}\} \P\{\text{fairness}\} + \P\{\text{data} \mid \text{not fair}\} \P\{\text{not fair}\}}

We clearly have

P{datafairness}=2n.\P\{\text{data} \mid \text{fairness}\}= 2^{-n}.

What assignments do we use for

P{fairness},   P{not fair}    and    P{datanot fair}?\P\{\text{fairness}\}, ~~~ \P\{\text{not fair}\} ~~~ \text{ and } ~~~ \P\{\text{data} \mid \text{not fair}\}?

For concreteness, let us assume

P{fairness}=0.5   and   P{not fair}=0.5.\P\{\text{fairness}\} = 0.5 ~~~ \text{and} ~~~ \P\{\text{not fair}\} = 0.5.

This is actually a very strong assumption in favor of fairness because a coin can be not fair in many many variety of ways. So to assume that the probability of fairness is the same as the combined probability of the many variety of ways in which the coin can be non-fair seems quite strong.

Let us now come to P{datanot fair}\P\{\text{data} \mid \text{not fair}\}. If the coin is not fair, we can assume that it has a heads probability of pp and that the coin tosses are still independent. We can then write

P{datanot fair}=01P{datanot fair,p}fpnot fair(p)dp=01p3(1p)9fpnot fair(p)dp.\P\{\text{data} \mid \text{not fair}\} = \int_0^1 \P\{\text{data} \mid \text{not fair}, p\} f_{p|\text{not fair}}(p) dp = \int_0^1 p^3 (1 - p)^9 f_{p|\text{not fair}}(p) dp.

To proceed further, we need to assign fpnot fair(p)f_{p|\text{not fair}}(p). One concrete assumption might be that

fpnot fair(p)=1   for every p[0,1].f_{p|\text{not fair}}(p) = 1 ~~~\text{for every $p \in [0, 1]$}.

This corresponds to the assumption that, under the alternative (not fair), pp has the uniform distribution on [0,1][0, 1]. Then (using an online integrator)

P{datanot fair}=01p3(1p)9dp=12860.\P\{\text{data} \mid \text{not fair}\} = \int_0^1 p^3 (1 - p)^9 dp = \frac{1}{2860}.

We then get

P{fairnessdata}=2120.52120.5+128600.5=0.4111558.\P\{\text{fairness} \mid \text{data}\} = \frac{2^{-12} * 0.5}{2^{-12}*0.5 + \frac{1}{2860} * 0.5} = 0.4111558.

Note that this Bayesian probability calculation does not depend at all on whether the number of tosses (n=12n = 12) was decided a priori or whether it was decided to toss until getting 3 heads. It is the same for both those cases.

Also note that the Bayesian approach (based on (1) and (2)) is only slightly supporting the alternative hypothesis (roughly 60%60\% to the null 40%40\%) while the frequentist pp-values are fairly small indicating more evidence for the alternative. This discrepancy also persists when the sample size is large. Consider the following example.

Example 5: MacKay sequence example

Here is another example of hypothesis testing or model selection in the Bayesian framework. This example comes from the book MacKay (2003, Chapter 28).

Most people would look at the sequence and guess the next number as 15. In other words, they recognize that the data form an arithmetic progression. If they are willing to consider alternative models, then we can pose the question as that of model selection, and use Bayesian methods (probability) to formally answer it.

Here are two possible models:

  1. Model 1: Arithmetic Progression i.e., a1=αa_1 = \alpha and an+1=an+βa_{n+1} = a_n + \beta.

  2. Model 2: Random (will be made precise later).

A Bayesian solution to this problem will attempt to calculate

P{Model i  data}   for i=1,2.\P \left\{\text{Model}~ i~ |~ \text{data} \right\} ~~~\text{for $i = 1, 2$}.

What probability assignments would we need to calculate the above? We can use the Bayes Rule to write

P{Model idata}=P{dataModel i}P{Model i}P{dataModel 1}P{Model 1}+P{dataModel 2}P{Model 2}\P \left\{\text{Model}~ i | \text{data} \right\} = \frac{\P \left\{\text{data} | \text{Model}~i \right\} \P\{\text{Model}~i\}}{\P \left\{\text{data} | \text{Model}~1 \right\} \P\{\text{Model}~1\} + \P \left\{\text{data} | \text{Model}~2 \right\} \P\{\text{Model}~2\}}

To be equally fair to both models, we shall take

P{Model i}=12   for each i=1,2\P \{\text{Model}~i\} = \frac{1}{2} ~~~\text{for each $i = 1, 2$}

We now need to calculate P{dataModel i}\P \left\{\text{data} | \text{Model}~i \right\} for i=1,2i = 1, 2. For i=1i = 1, we have (below α\alpha and β\beta are the parameters in Model 1):

P{dataModel 1}=P{α=1,β=4}\P \left\{\text{data} | \text{Model}~1 \right\} = \P\{\alpha = -1, \beta = 4\}

To calculate the above, we need to make a probability assignment for the probability with which α\alpha and β\beta take various values. MacKay (2003, Chapter 28) assumes that α\alpha and β\beta are integer-valued that they are independently uniformly distributed over the set {50,49,,49,50}\{-50, -49, \dots, 49, 50\} which has cardinality 101. Then

P{dataModel 1}=P{α=1,β=4}=P{α=1}P{β=4}=(1101)29.8×105.\P \left\{\text{data} | \text{Model}~1 \right\} = \P\{\alpha = -1, \beta = 4\} = \P\{\alpha = -1\} \P\{\beta = 4\} = \left(\frac{1}{101} \right)^2 \approx 9.8 \times 10^{-5}.

For the second model, we need to specify what we mean by “random”. We shall take this to mean that a1,a2,a3,a4a_1, a_2, a_3, a_4 are independently distributed according to the uniform distribution on {50,49,,49,50}\{-50, -49, \dots, 49, 50\}. Then

P{dataModel 2}=P{a1=1,a2=3,a3=7,a4=11}=P{a1=1}P{a2=3}P{a3=7}P{a4=11}=(1101)49.6×109.\begin{align*} \P \left\{\text{data} | \text{Model}~2 \right\} &= \P\left\{a_1 = -1, a_2 = 3, a_3 = 7, a_4 = 11\right\} \\ &= \P\{a_1 = -1\} \P\{a_2 = 3\} \P\{a_3 = 7\} \P\{a_4 = 11\} = \left(\frac{1}{101} \right)^4 \approx 9.6 \times 10^{-9}. \end{align*}

Plugging in the above value (as well as (4) and (5)) in (3), we get

P{Model 1data}1012×0.51012×0.5+1014×0.5=0.999902   and P{Model 2data}=.000098\P \left\{\text{Model}~ 1 | \text{data} \right\} \approx \frac{101^{-2} \times 0.5}{101^{-2} \times 0.5 + 101^{-4} \times 0.5} = 0.999902 ~~ \text{ and } \P \left\{\text{Model}~ 2 | \text{data} \right\} = .000098

This analysis clearly favors Model 1 compared to Model 2. The most interesting feature about this analysis is that

P{Model 1data}P{Model 2data}\P \left\{\text{Model}~ 1 | \text{data} \right\} \gg \P \left\{\text{Model}~ 2 | \text{data} \right\}

even though

P{Model 1}=P{Model 2}.\P \{\text{Model}~1\} = \P \{\text{Model}~2\}.

In other words, we did not dogmatically assert that the data was generated by an arithmetic progression but we gave a fair chance to the two models to explain the observed sequence.

In this example, some people argue in favor of Model 1 on the basis that Model 1 is “simpler” than Model 2. The Bayesian analysis above (based on probability theory) does not invoke any vague notion of simplicity but does some formal calculations which in this case preferred Model 1 to Model 2. In another situation, Model 2 may well be the preferred model.

One can consider other alternative models in this problem. For example, MacKay (2003, Chapter 28) considered the following cubic model:

Model 3 (Cubic): These numbers were generated by the formula: a1=aa_1 = a and an+1=ban3+can2+da_{n+1} = b a_n^3 + c a_n^2 + d for an integer aa and rational numbers b,c,db, c, d.

This cubic model explains the given data perfectly if and only if its four parameters a,b,c,da, b, c, d are chosen as a=1,b=1/11,c=9/11,d=23/11a = -1, b = -1/11, c = 9/11, d = 23/11. As a result,

P{dataModel 3}=P{a=1,b=1/11,c=9/11,d=23/11}.\P \left\{\text{data} | \text{Model}~3 \right\} = \P\{a = -1, b = -1/11, c = 9/11, d = 23/11\}.

In order to explicitly calculate the above, we need to make probability assignments for a,b,c,da, b, c, d. MacKay (2003) makes the following probability assignment: we assume that these four parameters are independent with aa being uniform on {50,49,,49,50}\{-50, -49, \dots, 49, 50\} and b,c,db, c, d having the distribution of x/yx/y where xUnif{50,49,,49,50}x \sim \text{Unif}\{-50, -49, \dots, 49, 50\} and yUnif{1,,50}y \sim \text{Unif}\{1, \dots, 50\} are independent. Under this assignment:

P{a=1,b=1/11,c=9/11,d=23/11}=P{a=1}P{b=1/11}P{c=9/11}P{d=23/11}=(1101)(41101150)(41101150)(21101150)2.5×1012.\begin{align*} \P\{a = -1, b = -1/11, c = 9/11, d = 23/11\} &= \P\{a = -1\} \P\{b = -1/11\} \P\{c = 9/11\} \P\{d = 23/11\}\\ &=\left(\frac{1}{101} \right) \left(4 \cdot \frac{1}{101} \cdot \frac{1}{50} \right) \left(4 \cdot \frac{1}{101} \cdot \frac{1}{50} \right) \left(2 \cdot \frac{1}{101} \cdot \frac{1}{50} \right) \\ & \approx 2.5 \times 10^{-12}. \end{align*}

In the above, we used P(b=1/11)=4(1/101)(1/50)\P(b = -1/11) = 4 \cdot(1/101)\cdot (1/50) because 1/11=2/22=3/33=4/44-1/11 = -2/22 = -3/33 = -4/44 and each of these has the probability (1/101)(1/50)(1/101) \cdot (1/50). A similar reasoning is used for P{c=9/11}\P\{c = 9/11\} and P{d=23/11}\P\{d = 23/11\}.

If Model 1, 2, 3 are the only three models considered, the Bayes rule (3) becomes

P{Model idata}=P{dataModel i}P{Model i}P{data}\P \left\{\text{Model}~ i | \text{data} \right\} = \frac{\P \left\{\text{data} | \text{Model}~i \right\} \P\{\text{Model}~i\}}{\P \left\{\text{data} \right\}}

where the denominator should be calculated as:

P{data}=i=13P{dataModel i}P{Model i}.\P \left\{\text{data} \right\} = \sum_{i=1}^3\P \left\{\text{data} | \text{Model}~i \right\} \P\{\text{Model}~i\}.

Under the fair assumption

P{Model i}=13   for each i=1,2,3,\P\{\text{Model}~i\} = \frac{1}{3} ~~~\text{for each $i = 1, 2, 3$},

we obtain

P{Model 1data}=1012×(1/3)1012×(1/3)+1014×(1/3)+2.5×1012×(1/3)=0.999902\P\{\text{Model}~1 | \text{data}\} = \frac{101^{-2} \times (1/3)}{101^{-2} \times (1/3) + 101^{-4} \times (1/3) + 2.5 \times 10^{-12} \times (1/3)} = 0.999902

and

P{Model 2data}9.8×105\P\{\text{Model}~2 | \text{data}\} \approx 9.8 \times 10^{-5}

and

P{Model 3data}9.8×1052.55×108.\P\{\text{Model}~3 | \text{data}\} \approx 9.8 \times 10^{-5} \approx 2.55 \times 10^{-8}.

Our preference for Model 1 is still as strong as before (when we only considered the two models Model 1 and Model 2).

The analysis given here depends on the specific choices of priors used for the three models. One can of course use alternative priors but the qualitative preference for Model 1 is unlikely to change for most reasonable prior choices.

References
  1. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.