Coin Tossing Problem

Consider the following problem from Lectures 4 and 5. We observe toss outcomes TTTTHTHTTTTH from a coin. Based on this data, we need to assess whether the coin is fair or not.

Frequentist Inference¶

For frequentist inference, this can be formulated as a hypothesis testing problem with null hypothesis $H_0: \text{the coin is fair}$ and alternative hypothesis that it is skewed towards tails. Then one calculates the $p$ -value which is defined as the probability of obtaining results that are at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. One problem with $p$ -values is (taken from the wikipedia article: P-value): the “at least as extreme” definition of p-value is deeply contextual and depends on what the experimenter planned to do even in situations that did not occur. Consider the following three stopping rules which could have generated this data:

S1: Toss the coin exactly 12 times
S2: Continue tossing until 3 heads appear
S3: Continue tossing until the pattern TH appears the third time

Below we calculate the $p$ -value under each of the above stopping rules. We will see that the answers will be different. We will calculate the $p$ -values by simulation. We will simply generate alternative datasets following the stopping rule, and then compute the proportion of samples where the proportion of heads is at most 0.25 (note that the proportion of heads in the observed sample is exactly 0.25).

import numpy as np
from scipy import stats
from scipy.special import betaln, logsumexp
from scipy.stats import beta

# For the first stopping rule: 
np.random.seed(42)
N = 1000000
n_tosses = 12
threshold = 0.25

tosses = np.random.randint(0, 2, size=(N, n_tosses)) #0 is tails and 1 is heads
heads_count = tosses.sum(axis = 1)
proportions = heads_count / n_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for first stopping rule: {empirical_p_value:.4f}")

#Actual p-value using the binomial distribution: 
p_value = stats.binom.cdf(threshold * n_tosses , n_tosses, 0.5)
print(f"Actual p-value for first stopping rule: {p_value:.4f}")

Empirical p-value for first stopping rule: 0.0730
Actual p-value for first stopping rule: 0.0730

#For the second stopping rule:
np.random.seed(42)
N = 100000
r = 3 #number of heads needed
threshold = 0.25

#np.random.negative_binomial generates number of failures before r successes
failures = np.random.negative_binomial(r, 0.5, size=N) 
total_tosses = failures + r

proportions = r / total_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for second stopping rule: {empirical_p_value:.4f}")

#Actual p-value using the negative binomial distribution:
#We need to find P(X <= k-1) where k is the maximum number of failures allowed
k = int((r / threshold) - r)
p_value = 1 - stats.nbinom.cdf(k-1, r, 0.5) #we are looking for the probability of there being at least 9 failures before the 3rd head
print(f"Actual p-value for second stopping rule: {p_value:.4f}")

Empirical p-value for second stopping rule: 0.0326
Actual p-value for second stopping rule: 0.0327

#For the third stopping rule: 
np.random.seed(42)

N = 100000
threshold = 0.25
pattern_count_needed = 3

def simulate_until_pattern(n_simulations, pattern_count_needed):
    #this function simulates tosses until the pattern 'TH' appears pattern_count_needed times
    total_tosses = np.zeros(n_simulations, dtype =int)
    heads_count = np.zeros(n_simulations, dtype=int)
    for i in range(n_simulations):
        tosses = 0
        heads = 0
        th_count = 0
        prev_toss = None
        while th_count < pattern_count_needed:
            current_toss = np.random.randint(0, 2) #0 is tails and 1 is heads
            tosses += 1
            heads += current_toss

            #check for 'TH' pattern (prev = 0 = T, current = 1 = H)
            if prev_toss == 0 and current_toss == 1:
                th_count += 1
            
            prev_toss = current_toss
        total_tosses[i] = tosses
        heads_count[i] = heads
    return heads_count, total_tosses

#Run simulation
heads_count, total_tosses = simulate_until_pattern(N, pattern_count_needed)
proportions = heads_count / total_tosses
extreme_results = np.sum(proportions <= threshold)
empirical_p_value = extreme_results / N
print(f"Empirical p-value for third stopping rule: {empirical_p_value:.4f}")

Empirical p-value for third stopping rule: 0.0257

Here are some things to note from the above:

The $p$ -value changes for every stopping rule. Sometimes it is below the conventional cutoff 0.05 and sometimes it is not.
The $p$ -value is almost entirely based on calculations done with fictitious datasets that have not been generated. From the observed dataset, we are only noting down the proportion of heads, and using it to determine extremeness of a fictitious dataset.

Bayesian Inference¶

Bayesian inference is based on calculating the posterior probability that the coin is fair. This is calculated as:

\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \int_0^1 p^3 (1-p)^9 f_{p \mid \text{not fair}}(p) dp} \end{align*}

(1)

This calculation assumes that both hypotheses fair and unfair are assigned equal probability 0.5 a priori. Under unfairness, the probability of heads $p$ is supposed to have the density $f_{p \mid \text{not fair}}(p)$ . The answer above depends on the choice of $f_{p \mid \text{not fair}}(p)$ . The simplest choice is a $\textsf{Uniform}$ prior on $[0,1]\setminus \{0.5\}$ , i.e., $f_{p \mid \text{not fair}}(p) = I\{0 < p < 1\}$ . This corresponds to assuming that $p$ has the uniform density on $(0, 1)$ when the coin is unfair. This leads to the formula:

\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \int_0^1 p^3 (1-p)^9 dp} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + 0.5 * \text{Beta}(x+1, n-x+1)} \end{align*}

(2)

where $n = 12$ , $x = 3$ and

\begin{align*} \text{Beta}(\alpha, \beta) := \int_0^1 p^{\alpha - 1} (1 - p)^{\beta - 1} dp \end{align*}

(3)

is the Beta function.

n = 12
x = 3
log_a = -n * np.log(2.0)
log_b = betaln(x + 1, n- x + 1)
val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)

0.4111558366877518

So the Bayesian method with these prior assumptions is only slightly supporting the alternative hypothesis (roughly 60% to 40%) while the frequentist $p$ -values are fairly small indicating more evidence for the alternative. This discrepancy also persists with larger sample sizes as seen in the following example from Lecture 5.

n = 98451
x = 49581
#frequentist p-value assuming the binomial distribution: 
p_value = 1 -stats.binom.cdf(x-1, n, 0.5)
print(f"Frequentist p-value: {p_value:.4f}")

#Bayesian posterior probability of fairness: 
log_a = -n * np.log(2.0)
log_b = betaln(x + 1, n- x + 1)
val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)

Frequentist p-value: 0.0118
0.9505229569817306
0.5036109333577109

This huge discrepancy between the Bayesian and frequentist results is called the Jeffreys-Lindley paradox (see https://en.wikipedia.org/wiki/Lindley's_paradox for some explanations).

The Bayesian answer depends crucially on the prior density chosen for $p$ when the coin is not fair. Suppose we take:

\begin{align*} f_{p \mid \text{not fair}}(p) = \frac{I\{\text{low} < p < \text{high}\}}{\text{high} - \text{low}}. \end{align*}

(4)

for two values $\text{low}$ and $\text{high}$ . For example, we can take $\text{low} = 0.49$ and $\text{high} = 0.51$ .

Then the posterior probability of fairness is:

\begin{align*} \mathbb{P} \left\{\text{fair} \mid \text{data} \right\} = \frac{2^{-n} * 0.5}{2^{-n} * 0.5 + \frac{0.5}{\text{high} - \text{low}} * \int_{\text{low}}^{\text{high}} p^x (1-p)^{n-x} dp} \end{align*}

(5)

The integral can be computed using the inbuilt cdf of the Beta density. Note that

\begin{align*} \int_{\text{low}}^{\text{high}} p^x (1-p)^{n-x} dp = \text{Beta}(x+1, n-x+1) \mathbb{P} \left\{\text{low} < Z < \text{high} \right\} \end{align*}

(6)

where $Z$ has the Beta density with parameters $x+1$ and $n - x + 1$ .

x = 49581
n = 98451

lo = 0.49
hi = 0.51

log_a = -n * np.log(2.0)                 # log(2^{-n})
log_beta = betaln(x + 1, n - x + 1)      # log Beta(x+1, n-x+1)

cdf_diff = beta.cdf(hi, x + 1, n - x + 1) - beta.cdf(lo, x + 1, n - x + 1)
log_b = log_beta + np.log(cdf_diff) - np.log(hi - lo)

val = np.exp(log_a - logsumexp([log_a, log_b]))
print(val)

0.277581683742418

Now the alternative hypothesis gets a much higher probability (72%) compared to the null hypothesis of fairness (28%).