Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

More on Example 1 from Lecture

In Lectures 1 and 2, we studied the following problem.

Let Θ\Theta denote the binary parameter which represents whether I truly have Covid or not (Θ=1\Theta = 1 when I have Covid and Θ=0\Theta = 0 when I don’t). Let XX denote the binary outcome of the Covid test so that X=1X = 1 represents the positive test. We need to calculate the probability:

P{Θ=1test data and background information}\P \left\{\Theta = 1 \mid \text{test data} \text{ and } \text{background information} \right\}

where test data is simply X=1X = 1, and the background information refers to things like “I have been strictly quarantining for the past 3 weeks”, “I do not have symptoms such as fever” etc.

We used the probability model (below BB stands for background information):

prior: P(Θ=1B)=0.02likelihood: P(X=1Θ=1,B)=0.99   P(X=1Θ=0,B)=0.04.\begin{split} & \text{prior: } \P(\Theta = 1 \mid B) = 0.02 \\ & \text{likelihood: } \P(X = 1\mid \Theta = 1, B) = 0.99 ~~~ \P(X = 1 \mid \Theta = 0, B) = 0.04. \end{split}

With these probability assignments, we use the Bayes rule to compute (1) as

P(Θ=1X=1,B)=P(X=1Θ=1,B)P(Θ=1B)P(X=1Θ=1,B)P(Θ=1B)+P(X=1Θ=0,B)P(Θ=0B)=0.990.020.990.02+0.040.98=0.3356.\begin{align*} \P(\Theta = 1 \mid X = 1, B) &= \frac{\P(X = 1 \mid \Theta = 1, B) \P(\Theta = 1 \mid B)}{\P(X = 1 \mid \Theta = 1, B) \P(\Theta = 1 \mid B) + \P(X = 1 \mid \Theta = 0, B) \P(\Theta = 0 \mid B)} \\ &= \frac{0.99 * 0.02}{0.99 * 0.02 + 0.04*0.98} = 0.3356. \end{align*}

This probability is not very high even though the test has very good false positive and false negative rates. This is because the prior probability P(Θ=1B)\P(\Theta = 1 \mid B) is very low (0.02). So, even with the positive test result, it is more likely than not that we are covid-free.

On the other hand, if we pose the problem as hypothesis testing:

H0:Θ=0  versus  H1:Θ=1H_0 : \Theta = 0 ~~ \text{versus} ~~ H_1: \Theta = 1

and calculate the pp-value as

P{X=1H0}=P(X=1Θ=0)=0.04,\P\{X = 1 | H_0\} = \P(X = 1 | \Theta = 0) = 0.04,

we get a different result (if we use the standard cutoff 0.05 on the pp-value). This leads to rejection of the null hypothesis and declaring that I have Covid.

The use of pp-values has been linked to serious issues such as lack of reproducibility (see for example the paper "The reproducibility of research and the misinterpretation of pp-values" by David Colquhoun). In this context, we can calculate the probability of reproducibility of the positive test as follows. Let X2X_2 denote the outcome of the second test (and X1=XX_1 = X will now denote the outcome of the first test):

P(X2=1X1=1,B)=P(X2=1Θ=1,X1=1,B)P(Θ=1X1=1,B)+P(X2=1Θ=0,X1=1,B)P(Θ=0X1=1,B).\begin{align*} \P(X_2 = 1 | X_1 = 1, B) &= \P(X_2 = 1|\Theta = 1, X_1 = 1, B) \P(\Theta = 1 | X_1 = 1, B) \\ &+ \P(X_2 = 1 | \Theta = 0, X_1 = 1, B) \P(\Theta = 0|X_1 = 1, B). \end{align*}

To calculate the probabilities in the right hand side above, we make the following assignment:

P(X2=1Θ=1,X1=1,B)=P(X2=1Θ=1)=0.99P(X2=1Θ=0,X1=1,B)=P(X2=1Θ=0)=0.04.\begin{align*} & \P(X_2 = 1|\Theta = 1, X_1 = 1, B) = \P(X_2 = 1|\Theta = 1) = 0.99 \\ &\P(X_2 = 1| \Theta = 0, X_1 = 1, B) = \P(X_2 = 1 \mid \Theta = 0) = 0.04. \end{align*}

This assumption means that conditional on my Covid status Θ\Theta, the two test outcomes X1X_1 and X2X_2 are independent. Using this assignment, it is straightforward to calculate the reproducibility probability as follows (note that we already calculated P(Θ=1X1=1,B)=1P(Θ=0X1=1,B)=0.3356\P(\Theta = 1 \mid X_1 = 1, B) = 1 - \P( \Theta = 0 \mid X_1 = 1, B) = 0.3356)

P(X2=1X1=1,B)=0.990.3356+0.04(10.3356)=0.35882.\begin{align*} \P(X_2 = 1| X_1 = 1, B) = 0.99 * 0.3356 + 0.04 * (1-0.3356) = 0.35882. \end{align*}

Thus the positive test will be reproducible with probability only 35.88% which means that it is more likely to get a negative test the second time.

Law of Total Probability and Bayes Rule

In Bayesian statistics, the rules of probability are used mostly for the following:

  1. to compute the marginal distribution of XX based on knowledge of the conditional distribution of XX given Θ=θ\Theta = \theta (i.e., the likelihood) as well as the marginal distribution of Θ\Theta (i.e., the prior)

  2. to compute the conditional distribution of Θ\Theta given X=xX = x (i.e., the posterior) based on the same knowledge of the conditional distribution of XX given Θ=θ\Theta = \theta (i.e., the likelihood) as well as the marginal distribution of Θ\Theta (i.e., the prior).

The formula for the first item above is sometimes called the Law of Total Probability (LTP), while the formula for the second item is called the Bayes Rule. The precise formulae differ according to whether XX and Θ\Theta are discrete or continuous. It is natural to consider the following four separate cases.

XX and Θ\Theta are both discrete

The LTP is

P{X=x}=θP{X=xΘ=θ}P{Θ=θ}\P\{X = x\} = \sum_{\theta} \P \{X = x | \Theta = \theta\} \P\{\Theta = \theta\}

and the Bayes rule is

P{Θ=θX=x}=P{X=xΘ=θ}P{Θ=θ}P{X=x}=P{X=xΘ=θ}P{Θ=θ}θP{X=xΘ=θ}P{Θ=θ}.\P \{\Theta = \theta | X = x\} = \frac{\P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}}{\P\{X = x\}} = \frac{\P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}}{\sum_{\theta} \P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}} .

XX and Θ\Theta are both continuous

Here LTP is

fX(x)=fXΘ=θ(x)fΘ(θ)dθf_X(x) = \int f_{X | \Theta = \theta}(x) f_{\Theta}(\theta) d\theta

and Bayes rule is

fΘX=x(θ)=fXΘ=θ(x)fΘ(θ)fX(x)=fXΘ=θ(x)fΘ(θ)fXΘ=θ(x)fΘ(θ)dx.f_{\Theta | X = x}(\theta) =\frac{f_{X| \Theta = \theta}(x) f_{\Theta}(\theta)}{f_X(x)} = \frac{f_{X| \Theta = \theta}(x) f_{\Theta}(\theta)}{\int f_{X| \Theta = \theta}(x) f_{\Theta}(\theta) dx}.

XX is discrete while Θ\Theta is continuous

LTP is

P{X=x}=P{X=xΘ=θ}fΘ(θ)dθ\P \{X = x\} = \int \P\{X = x | \Theta = \theta\} f_{\Theta}(\theta) d\theta

and Bayes rule is

fΘX=x(θ)=P{X=xΘ=θ}fΘ(θ)P{X=x}=P{X=xΘ=θ}fΘ(θ)P{X=xΘ=θ}fΘ(θ)dθ.f_{\Theta|X = x}(\theta) = \frac{\P\{X = x| \Theta = \theta\} f_{\Theta}(\theta)}{\P\{X = x\}} = \frac{\P\{X = x| \Theta = \theta\} f_{\Theta}(\theta)}{\int \P\{X = x| \Theta = \theta\} f_{\Theta}(\theta) d\theta}.

XX is continuous while Θ\Theta is discrete

LTP is

fX(x)=θfXΘ=θ(x)P{Θ=θ}f_X(x) = \sum_{\theta} f_{X| \Theta = \theta}(x) \P\{\Theta = \theta\}

and Bayes rule is

P{Θ=θX=x}=fXΘ=θ(x)P{Θ=θ}fX(x)=fXΘ=θ(x)P{Θ=θ}θfXΘ=θ(x)P{Θ=θ}\P\{\Theta = \theta | X = x\} = \frac{f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}{f_X(x)} = \frac{f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}{\sum_{\theta} f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}

Next is a simple application for the case where Θ\Theta is discrete and XX is continuous.

A Simple Model Selection Application

Suppose Θ\Theta has the Ber(0.5)Ber(0.5) distribution i.e.,

P{Θ=0}=P{Θ=1}=0.5.\P\{\Theta = 0\} = \P \{\Theta = 1 \} = 0.5.

Next assume that X1,,XnX_1, \dots, X_n have the following distributions conditional on Θ=θ\Theta = \theta:

X1,,XnΘ=0i.i.df0where f0(x):=12πexp(x22)X_1, \dots, X_n \mid \Theta = 0 \overset{\text{i.i.d}}{\sim} f_0 \text{where $f_0(x) := \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^2}{2} \right)$}

and

X1,,XnΘ=1i.i.df1where f1(x):=12πexp(x2π).X_1, \dots, X_n \mid \Theta = 1 \overset{\text{i.i.d}}{\sim} f_1 \text{where $f_1(x) := \frac{1}{\sqrt{2 \pi}} \exp \left(-|x| \sqrt{\frac{2}{\pi}} \right)$}.

f0f_0 is the standard normal density and f1f_1 is a Laplace (double-exponential) density. Both densities have the same maximal value of 1/2π1/\sqrt{2 \pi}. Based on the information given, calculate the conditional distribution of Θ\Theta given X1=x1,X2=x2,,X6=x6X_1 = x_1, X_2 = x_2, \dots, X_6 = x_6 (i.e., n=6n = 6) where

x1=0.55,x2=1.11,x3=1.23,x4=0.29,x5=1.56,x6=1.64.x_1 = -0.55, x_2 = -1.11, x_3 = 1.23, x_4 = 0.29, x_5 = 1.56, x_6 = -1.64.

Here is the statistical context for this question. We observe data x1,,xnx_1, \dots, x_n with n=6n = 6. We want to use one of the models f0f_0 or f1f_1 for this data. The random variable Θ\Theta is used to describe the choice of the model. We want to treat both the models on an equal footing so we assumed that Θ\Theta has the uniform prior distribution on {0,1}\{0, 1\}.

To calculate the conditional distribution of Θ\Theta given the data, we use the formula (6) because Θ\Theta is discrete and the data X1,,XnX_1, \dots, X_n are continuous. This gives

P{Θ=0X1=x1,,Xn=xn}=fX1,,XnΘ=0(x1,,xn)P{Θ=0}fX1,,XnΘ=0(x1,,xn)P{Θ=0}+fX1,,XnΘ=1(x1,,xn)P{Θ=1}=fX1,,XnΘ=0(x1,,xn)×12fX1,,XnΘ=0(x1,,xn)×12+fX1,,XnΘ=1(x1,,xn)×12=fX1,,XnΘ=0(x1,,xn)fX1,,XnΘ=0(x1,,xn)+fX1,,XnΘ=1(x1,,xn)=f0(x1)f0(x2)f0(xn)f0(x1)f0(x2)f0(xn)+f1(x1)f1(x2)f1(xn)=(12π)nexp(12i=1nxi2)(12π)nexp(12i=1nxi2)+(12π)nexp(2πi=1nxi)=exp(12i=1nxi2)exp(12i=1nxi2)+exp(2πi=1nxi).\begin{align*} &\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_n = x_n\} \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) \P\{\Theta = 0\}}{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) \P\{\Theta = 0\} + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n) \P\{\Theta = 1\}} \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) \times \frac{1}{2} }{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) \times \frac{1}{2} + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n) \times \frac{1}{2} } \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) }{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n)} \\ &= \frac{f_0(x_1) f_0(x_2) \dots f_0(x_n)}{f_0(x_1) f_0(x_2) \dots f_0(x_n) + f_1(x_1) f_1(x_2) \dots f_1(x_n)} \\ &= \frac{\left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right)}{\left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)} \\ &= \frac{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right)}{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}. \end{align*}

Similarly

P{Θ=1X1=x1,,Xn=xn}=exp(2πi=1nxi)exp(12i=1nxi2)+exp(2πi=1nxi).\begin{align*} \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_n = x_n\} = \frac{ \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}. \end{align*}

Plugging in the above formula the data values given in (7) for x1,,x6x_1, \dots, x_6, we obtain

P{Θ=0X1=x1,,X6=x6}=0.72   and   P{Θ=1X1=x1,,X6=x6}=0.28\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_6 = x_6\} = 0.72 ~~ \text{ and } ~~ \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_6 = x_6\} = 0.28

Thus, conditioning on the data, we have a 72%72 \% probability for the normal model compared to 28%28\% probability for the Laplace model. So we would prefer to use the normal distribution here.

Now suppose that we add in an additional observation x7=5x_7 = 5. It can be checked that

P{Θ=0X1=x1,,X7=x7}=0.001  and  P{Θ=1X1=x1,,X7=x7}=0.999\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_7 = x_7\} = 0.001 ~ \text{ and } ~ \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_7 = x_7\} = 0.999

Now there is overwhelming preference for the Laplace model. This is because x7=5x_7 = 5 is an outlying observation to which the Laplace model gives much higher probability compared to the Normal model owing to heavy tails of the Laplace density.