Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Consider the following problem from Lectures 4 and 5. We observe toss outcomes TTTTHTHTTTTH from a coin. Based on this data, we need to assess whether the coin is fair or not.

Frequentist Inference

For frequentist inference, this can be formulated as a hypothesis testing problem with null hypothesis H0:the coin is fairH_0: \text{the coin is fair} and alternative hypothesis that it is skewed towards tails. Then one calculates the pp-value which is defined as the probability of obtaining results that are at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. One problem with pp-values is (taken from the wikipedia article: P-value): the “at least as extreme” definition of p-value is deeply contextual and depends on what the experimenter planned to do even in situations that did not occur. Consider the following three stopping rules which could have generated this data:

  1. S1: Toss the coin exactly 12 times

  2. S2: Continue tossing until 3 heads appear

  3. S3: Continue tossing until the pattern TH appears the third time

Below we calculate the pp-value under each of the above stopping rules. We will see that the answers will be different. We will calculate the pp-values by simulation. We will simply generate alternative datasets following the stopping rule, and then compute the proportion of samples where the proportion of heads is at most 0.25 (note that the proportion of heads in the observed sample is exactly 0.25).

import numpy as np
from scipy import stats
from scipy.special import betaln, logsumexp
from scipy.stats import beta
# For the first stopping rule: 
np.random.seed(42)
N = 1000000
n_tosses = 12
threshold = 0.25

tosses = np.random.randint(0, 2, size=(N, n_tosses)) #0 is tails and 1 is heads
heads_count = tosses.sum(axis = 1)
proportions = heads_count / n_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for first stopping rule: {empirical_p_value:.4f}")

#Actual p-value using the binomial distribution: 
p_value = stats.binom.cdf(threshold * n_tosses , n_tosses, 0.5)
print(f"Actual p-value for first stopping rule: {p_value:.4f}")
Empirical p-value for first stopping rule: 0.0730
Actual p-value for first stopping rule: 0.0730
#For the second stopping rule:
np.random.seed(42)
N = 100000
r = 3 #number of heads needed
threshold = 0.25

#np.random.negative_binomial generates number of failures before r successes
failures = np.random.negative_binomial(r, 0.5, size=N) 
total_tosses = failures + r

proportions = r / total_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for second stopping rule: {empirical_p_value:.4f}")

#Actual p-value using the negative binomial distribution:
#We need to find P(X <= k-1) where k is the maximum number of failures allowed
k = int((r / threshold) - r)
p_value = 1 - stats.nbinom.cdf(k-1, r, 0.5) #we are looking for the probability of there being at least 9 failures before the 3rd head
print(f"Actual p-value for second stopping rule: {p_value:.4f}")
Empirical p-value for second stopping rule: 0.0326
Actual p-value for second stopping rule: 0.0327
#For the third stopping rule: 
np.random.seed(42)

N = 100000
threshold = 0.25
pattern_count_needed = 3

def simulate_until_pattern(n_simulations, pattern_count_needed):
    #this function simulates tosses until the pattern 'TH' appears pattern_count_needed times
    total_tosses = np.zeros(n_simulations, dtype =int)
    heads_count = np.zeros(n_simulations, dtype=int)
    for i in range(n_simulations):
        tosses = 0
        heads = 0
        th_count = 0
        prev_toss = None
        while th_count < pattern_count_needed:
            current_toss = np.random.randint(0, 2) #0 is tails and 1 is heads
            tosses += 1
            heads += current_toss

            #check for 'TH' pattern (prev = 0 = T, current = 1 = H)
            if prev_toss == 0 and current_toss == 1:
                th_count += 1
            
            prev_toss = current_toss
        total_tosses[i] = tosses
        heads_count[i] = heads
    return heads_count, total_tosses

#Run simulation
heads_count, total_tosses = simulate_until_pattern(N, pattern_count_needed)
proportions = heads_count / total_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for third stopping rule: {empirical_p_value:.4f}")
Empirical p-value for third stopping rule: 0.0257

Here are some things to note from the above:

  1. The pp-value changes for every stopping rule. Sometimes it is below the conventional cutoff 0.05 and sometimes it is not.

  2. The pp-value is almost entirely based on calculations done with fictitious datasets that have not been generated. From the observed dataset, we are only noting down the proportion of heads, and using it to determine extremeness of a fictitious dataset.

Bayesian Inference

Bayesian inference is based on calculating the posterior probability that the coin is fair. This is calculated as:

P{fairdata}=2n0.52n0.5+0.501p3(1p)9fpnot fair(p)dp\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \int_0^1 p^3 (1-p)^9 f_{p \mid \text{not fair}}(p) dp} \end{align*}

This calculation assumes that both hypotheses fair and unfair are assigned equal probability 0.5 a priori. Under unfairness, the probability of heads pp is supposed to have the density fpnot fair(p)f_{p \mid \text{not fair}}(p). The answer above depends on the choice of fpnot fair(p)f_{p \mid \text{not fair}}(p). The simplest choice is a Uniform\textsf{Uniform} prior on [0,1]{0.5}[0,1]\setminus \{0.5\}, i.e., fpnot fair(p)=I{0<p<1}f_{p \mid \text{not fair}}(p) = I\{0 < p < 1\}. This corresponds to assuming that pp has the uniform density on (0,1)(0, 1) when the coin is unfair. This leads to the formula:

P{fairdata}=2n0.52n0.5+0.501p3(1p)9dp=2n0.52n0.5+0.5Beta(x+1,nx+1)\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \int_0^1 p^3 (1-p)^9 dp} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \text{Beta}(x+1, n-x+1)} \end{align*}

where n=12n = 12, x=3x = 3 and

Beta(α,β):=01pα1(1p)β1dp\begin{align*} \text{Beta}(\alpha, \beta) := \int_0^1 p^{\alpha - 1} (1 - p)^{\beta - 1} dp \end{align*}

is the Beta function.

n = 12
x = 3
log_a = -n * np.log(2.0)
log_b = betaln(x + 1, n- x + 1)
val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)
0.4111558366877518

So the Bayesian method with these prior assumptions is only slightly supporting the alternative hypothesis (roughly 60% to 40%) while the frequentist pp-values are fairly small indicating more evidence for the alternative. This discrepancy also persists with larger sample sizes as seen in the following example from Lecture 5.

n = 98451
x = 49581
#frequentist p-value assuming the binomial distribution: 
p_value = 1 -stats.binom.cdf(x-1, n, 0.5)
print(f"Frequentist p-value: {p_value:.4f}")

#Bayesian posterior probability of fairness: 
log_a = -n * np.log(2.0)
log_b = betaln(x + 1, n- x + 1)
val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)
Frequentist p-value: 0.0118
0.9505229569817306
0.5036109333577109

This huge discrepancy between the Bayesian and frequentist results is called the Jeffreys-Lindley paradox (see https://en.wikipedia.org/wiki/Lindley's_paradox for some explanations).

The Bayesian answer depends crucially on the prior density chosen for pp when the coin is not fair. Suppose we take:

fpnot fair(p)=I{low<p<high}highlow.\begin{align*} f_{p \mid \text{not fair}}(p) = \frac{I\{\text{low} < p < \text{high}\}}{\text{high} - \text{low}}. \end{align*}

for two values low\text{low} and high\text{high}. For example, we can take low=0.49\text{low} = 0.49 and high=0.51\text{high} = 0.51.

Then the posterior probability of fairness is:

P{fairdata}=2n0.52n0.5+0.5highlowlowhighpx(1p)nxdp\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + \frac{0.5}{\text{high} - \text{low}} * \int_{\text{low}}^{\text{high}} p^x (1-p)^{n-x} dp} \end{align*}

The integral can be computed using the inbuilt cdf of the Beta density. Note that

lowhighpx(1p)nxdp=Beta(x+1,nx+1)P{low<Z<high}\begin{align*} \int_{\text{low}}^{\text{high}} p^x (1-p)^{n-x} dp = \text{Beta}(x+1, n-x+1) \mathbb{P} \left\{\text{low} < Z < \text{high} \right\} \end{align*}

where ZZ has the Beta density with parameters x+1x+1 and nx+1n - x + 1.

x = 49581
n = 98451

lo = 0.49
hi = 0.51

log_a = -n * np.log(2.0)                 # log(2^{-n})
log_beta = betaln(x + 1, n - x + 1)      # log Beta(x+1, n-x+1)

cdf_diff = beta.cdf(hi, x + 1, n - x + 1) - beta.cdf(lo, x + 1, n - x + 1)
log_b = log_beta + np.log(cdf_diff) - np.log(hi - lo)

val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)
0.277581683742418

Now the alternative hypothesis gets a much higher probability (72%) compared to the null hypothesis of fairness (28%).