STAT 238 - Bayesian Statistics Lab One

More on Example 1 from Lecture¶

In Lectures 1 and 2, we studied the following problem.

Let $\Theta$ denote the binary parameter which represents whether I truly have Covid or not ( $\Theta = 1$ when I have Covid and $\Theta = 0$ when I don’t). Let $X$ denote the binary outcome of the Covid test so that $X = 1$ represents the positive test. We need to calculate the probability:

\P \left\{\Theta = 1 \mid \text{test data} \text{ and } \text{background information} \right\}

(1)

where test data is simply $X = 1$ , and the background information refers to things like “I have been strictly quarantining for the past 3 weeks”, “I do not have symptoms such as fever” etc.

We used the probability model (below $B$ stands for background information):

\begin{split} & \text{prior: } \P(\Theta = 1 \mid B) = 0.02 \\ & \text{likelihood: } \P(X = 1\mid \Theta = 1, B) = 0.99 ~~~ \P(X = 1 \mid \Theta = 0, B) = 0.04. \end{split}

With these probability assignments, we use the Bayes rule to compute (1) as

\begin{align*} \P(\Theta = 1 \mid X = 1, B) &= \frac{\P(X = 1 \mid \Theta = 1, B) \P(\Theta = 1 \mid B)}{\P(X = 1 \mid \Theta = 1, B) \P(\Theta = 1 \mid B) + \P(X = 1 \mid \Theta = 0, B) \P(\Theta = 0 \mid B)} \\ &= \frac{0.99 * 0.02}{0.99 * 0.02 + 0.04*0.98} = 0.3356. \end{align*}

(2)

This probability is not very high even though the test has very good false positive and false negative rates. This is because the prior probability $\P(\Theta = 1 \mid B)$ is very low (0.02). So, even with the positive test result, it is more likely than not that we are covid-free.

On the other hand, if we pose the problem as hypothesis testing:

H_0 : \Theta = 0 ~~ \text{versus} ~~ H_1: \Theta = 1

and calculate the $p$ -value as

\P\{X = 1 | H_0\} = \P(X = 1 | \Theta = 0) = 0.04,

we get a different result (if we use the standard cutoff 0.05 on the $p$ -value). This leads to rejection of the null hypothesis and declaring that I have Covid.

The use of $p$ -values has been linked to serious issues such as lack of reproducibility (see for example the paper "The reproducibility of research and the misinterpretation of $p$ -values" by David Colquhoun). In this context, we can calculate the probability of reproducibility of the positive test as follows. Let $X_2$ denote the outcome of the second test (and $X_1 = X$ will now denote the outcome of the first test):

\begin{align*} \P(X_2 = 1 | X_1 = 1, B) &= \P(X_2 = 1|\Theta = 1, X_1 = 1, B) \P(\Theta = 1 | X_1 = 1, B) \\ &+ \P(X_2 = 1 | \Theta = 0, X_1 = 1, B) \P(\Theta = 0|X_1 = 1, B). \end{align*}

(3)

To calculate the probabilities in the right hand side above, we make the following assignment:

\begin{align*} & \P(X_2 = 1|\Theta = 1, X_1 = 1, B) = \P(X_2 = 1|\Theta = 1) = 0.99 \\ &\P(X_2 = 1| \Theta = 0, X_1 = 1, B) = \P(X_2 = 1 \mid \Theta = 0) = 0.04. \end{align*}

(4)

This assumption means that conditional on my Covid status $\Theta$ , the two test outcomes $X_1$ and $X_2$ are independent. Using this assignment, it is straightforward to calculate the reproducibility probability as follows (note that we already calculated $\P(\Theta = 1 \mid X_1 = 1, B) = 1 - \P( \Theta = 0 \mid X_1 = 1, B) = 0.3356$ )

\begin{align*} \P(X_2 = 1| X_1 = 1, B) = 0.99 * 0.3356 + 0.04 * (1-0.3356) = 0.35882. \end{align*}

(5)

Thus the positive test will be reproducible with probability only 35.88% which means that it is more likely to get a negative test the second time.

Law of Total Probability and Bayes Rule¶

In Bayesian statistics, the rules of probability are used mostly for the following:

to compute the marginal distribution of $X$ based on knowledge of the conditional distribution of $X$ given $\Theta = \theta$ (i.e., the likelihood) as well as the marginal distribution of $\Theta$ (i.e., the prior)
to compute the conditional distribution of $\Theta$ given $X = x$ (i.e., the posterior) based on the same knowledge of the conditional distribution of $X$ given $\Theta = \theta$ (i.e., the likelihood) as well as the marginal distribution of $\Theta$ (i.e., the prior).

The formula for the first item above is sometimes called the Law of Total Probability (LTP), while the formula for the second item is called the Bayes Rule. The precise formulae differ according to whether $X$ and $\Theta$ are discrete or continuous. It is natural to consider the following four separate cases.

$X$ and $\Theta$ are both discrete¶

The LTP is

\P\{X = x\} = \sum_{\theta} \P \{X = x | \Theta = \theta\} \P\{\Theta = \theta\}

and the Bayes rule is

\P \{\Theta = \theta | X = x\} = \frac{\P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}}{\P\{X = x\}} = \frac{\P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}}{\sum_{\theta} \P\{X = x | \Theta = \theta\}\P\{\Theta = \theta\}} .

$X$ and $\Theta$ are both continuous¶

Here LTP is

f_X(x) = \int f_{X | \Theta = \theta}(x) f_{\Theta}(\theta) d\theta

and Bayes rule is

f_{\Theta | X = x}(\theta) =\frac{f_{X| \Theta = \theta}(x) f_{\Theta}(\theta)}{f_X(x)} = \frac{f_{X| \Theta = \theta}(x) f_{\Theta}(\theta)}{\int f_{X| \Theta = \theta}(x) f_{\Theta}(\theta) dx}.

$X$ is discrete while $\Theta$ is continuous¶

LTP is

\P \{X = x\} = \int \P\{X = x | \Theta = \theta\} f_{\Theta}(\theta) d\theta

and Bayes rule is

f_{\Theta|X = x}(\theta) = \frac{\P\{X = x| \Theta = \theta\} f_{\Theta}(\theta)}{\P\{X = x\}} = \frac{\P\{X = x| \Theta = \theta\} f_{\Theta}(\theta)}{\int \P\{X = x| \Theta = \theta\} f_{\Theta}(\theta) d\theta}.

$X$ is continuous while $\Theta$ is discrete¶

LTP is

f_X(x) = \sum_{\theta} f_{X| \Theta = \theta}(x) \P\{\Theta = \theta\}

and Bayes rule is

\P\{\Theta = \theta | X = x\} = \frac{f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}{f_X(x)} = \frac{f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}{\sum_{\theta} f_{X|\Theta = \theta}(x) \P\{\Theta = \theta\}}

(6)

Next is a simple application for the case where $\Theta$ is discrete and $X$ is continuous.

A Simple Model Selection Application¶

Suppose $\Theta$ has the $Ber(0.5)$ distribution i.e.,

\P\{\Theta = 0\} = \P \{\Theta = 1 \} = 0.5.

Next assume that $X_1, \dots, X_n$ have the following distributions conditional on $\Theta = \theta$ :

X_1, \dots, X_n \mid \Theta = 0 \overset{\text{i.i.d}}{\sim} f_0 \text{where $f_0(x) := \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{x^2}{2} \right)$}

and

X_1, \dots, X_n \mid \Theta = 1 \overset{\text{i.i.d}}{\sim} f_1 \text{where $f_1(x) := \frac{1}{\sqrt{2 \pi}} \exp \left(-|x| \sqrt{\frac{2}{\pi}} \right)$}.

$f_0$ is the standard normal density and $f_1$ is a Laplace (double-exponential) density. Both densities have the same maximal value of $1/\sqrt{2 \pi}$ . Based on the information given, calculate the conditional distribution of $\Theta$ given $X_1 = x_1, X_2 = x_2, \dots, X_6 = x_6$ (i.e., $n = 6$ ) where

x_1 = -0.55, x_2 = -1.11, x_3 = 1.23, x_4 = 0.29, x_5 = 1.56, x_6 = -1.64.

(7)

Here is the statistical context for this question. We observe data $x_1, \dots, x_n$ with $n = 6$ . We want to use one of the models $f_0$ or $f_1$ for this data. The random variable $\Theta$ is used to describe the choice of the model. We want to treat both the models on an equal footing so we assumed that $\Theta$ has the uniform prior distribution on $\{0, 1\}$ .

To calculate the conditional distribution of $\Theta$ given the data, we use the formula (6) because $\Theta$ is discrete and the data $X_1, \dots, X_n$ are continuous. This gives

\begin{align*} &\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_n = x_n\} \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) \P\{\Theta = 0\}}{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) \P\{\Theta = 0\} + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n) \P\{\Theta = 1\}} \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) \times \frac{1}{2} }{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) \times \frac{1}{2} + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n) \times \frac{1}{2} } \\ &= \frac{f_{X_1, \dots, X_n \mid \Theta = 0}(x_1, \dots, x_n) }{f_{X_1, \dots, X_n\mid \Theta= 0}(x_1, \dots, x_n) + f_{X_1, \dots, X_n \mid \Theta = 1}(x_1, \dots, x_n)} \\ &= \frac{f_0(x_1) f_0(x_2) \dots f_0(x_n)}{f_0(x_1) f_0(x_2) \dots f_0(x_n) + f_1(x_1) f_1(x_2) \dots f_1(x_n)} \\ &= \frac{\left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right)}{\left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \left(\frac{1}{\sqrt{2 \pi}} \right)^n \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)} \\ &= \frac{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right)}{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}. \end{align*}

(8)

Similarly

\begin{align*} \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_n = x_n\} = \frac{ \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}{ \exp \left(-\frac{1}{2} \sum_{i=1}^n x_i^2 \right) + \exp \left(-\sqrt{\frac{2}{\pi}} \sum_{i=1}^n |x_i| \right)}. \end{align*}

(9)

Plugging in the above formula the data values given in (7) for $x_1, \dots, x_6$ , we obtain

\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_6 = x_6\} = 0.72 ~~ \text{ and } ~~ \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_6 = x_6\} = 0.28

Thus, conditioning on the data, we have a $72 \%$ probability for the normal model compared to $28\%$ probability for the Laplace model. So we would prefer to use the normal distribution here.

Now suppose that we add in an additional observation $x_7 = 5$ . It can be checked that

\P\{\Theta = 0 \mid X_1 = x_1, \dots, X_7 = x_7\} = 0.001 ~ \text{ and } ~ \P\{\Theta = 1 \mid X_1 = x_1, \dots, X_7 = x_7\} = 0.999

Now there is overwhelming preference for the Laplace model. This is because $x_7 = 5$ is an outlying observation to which the Laplace model gives much higher probability compared to the Normal model owing to heavy tails of the Laplace density.

STAT 238 - Bayesian Statistics Lab One

More on Example 1 from Lecture¶

Law of Total Probability and Bayes Rule¶

XXX and Θ\ThetaΘ are both discrete¶

XXX and Θ\ThetaΘ are both continuous¶

XXX is discrete while Θ\ThetaΘ is continuous¶

XXX is continuous while Θ\ThetaΘ is discrete¶

A Simple Model Selection Application¶

$X$ and $\Theta$ are both discrete¶

$X$ and $\Theta$ are both continuous¶

$X$ is discrete while $\Theta$ is continuous¶

$X$ is continuous while $\Theta$ is discrete¶