The main goal today is to go over Bayesian inference of a parameter θ∈[0,1] from observation X∼Bin(n,θ) (n is also observed) with a Beta distribution prior on θ.
For the Beta density to be proper, we need both a and b to be strictly positive. Here are some basic properties of the Beta distribution:
When a=1 and b=1, we get the uniform distribution on (0,1). We consider the uniform distribution as the simplest example of a Beta distribution.
The mean of the Beta distribution is
mean=a+ba.
For example, if a is much smaller compared to b, then the mean will be small.
The variance of the Beta distribution is
variance=a+baa+bba+b+11.
An implication is that, when a+b is large, the variance tends to be small, so the Beta density will look skinny.
More generally, any moment E(θk) (for positive integers k) can be written in closed form as (here θ∼Beta(a,b))
E(θk)=(a+b)(a+b+1)…(a+b+k−1)a(a+1)…(a+k−1)for every k≥1.
In Bayesian inference, improper priors are also sometimes used. In the case of Beta priors, it is common to use the distributions Beta(0,0) as a prior. This corresponds to the density
This improper density cannot be normalized so we don’t put any normalizing constants on the right hand side.
If one wants to interpret Beta(0,0) as a probability distribution, the correct answer is the discrete distribution taking the values 0 and 1 with equal probability:
One way to formalize this equivalence is to argue that the distribution Beta(ϵ,ϵ) converges to Bernoulli(0.5) as ϵ↓0 in the usual sense of weak convergence. The easiest way to see this is that the moments of Beta(ϵ,ϵ) converge to the moments of Bernoulli(0.5).
The goal is to infer θ from observations X (and n) where X∼Bin(n,θ). One can also formulate this problem as that of estimating θ from n i.i.d Bernoulli observations Z1,…,Zn∼i.i.dBer(θ). The likelihood is given by:
So if n is much larger compared to a+b, then the posterior mean will be very close to the MLE. Conversely if n is much smaller compared to a+b, then the posterior mean will be very close to the prior mean.
When a=b=0, then the posterior mean exactly coincides with the MLE. Thus if we use the Beta(0,0) prior, then posterior inference will be close to frequentist inference. Note again that this prior is improper.
If we use the Beta(0,0) prior and if we observe n=x=1 (i.e., we only have data on one Bernoulli trial which resulted in a heads). Then the posterior is Beta(x+a,n−x+b)=Beta(1,0). Beta(1,0) is also improper but it can be interpreted as the point mass at 1:
This can be proved, for example, by computing the moments of Beta(1,ϵ) and showing that they all approach 1 (which are the moments of δ{1}) as ϵ→0. Similarly for x=0 and n=1, the posterior becomes Beta(0,1) which should be interpreted as δ{0}. If n=2 and x=1 (i.e., we observe one head and one tail), the posterior is the uniform distribution on (0,1).
The process of going from the Beta(0,0) prior to the Beta(1,0) posterior when n=x=1 is described as intuitive in the following situation by Jaynes: ...in a chemical laboratory we find a jar containing an unknown and unlabeled compound. We are at first completely ignorant as to whether a small sample of this compound will dissolve in water or not. But, having observed that one small sample does dissolve, we infer immediately that all samples of this compound are water soluble, and although the conclusion does not carry quite the force of deductive proof, we feel strongly that the inference was justified.
Consider the problem of estimating θi from (Xi,ni) for each i=1,…,N, with Xi∼Bin(ni,θi). For a concrete example, take all counties in the United States with Xi denoting the number of deaths due to kidney cancer (in the period 1980−89) and ni denoting the average population (during the same period) for the i-th county.
It is easy to see that the naive frequentist estimate Xi/ni can be very bad for θi if ni is small. We will therefore use Bayes estimation using the prior:
The choice of a and b is crucial here. Since we have a large dataset, it makes sense to learn good values of a and b from the observed data. We will look at the following three ways of doing this.
Method One: We place a threshold on ni and filter out i for which ni exceeds the threshold. We then select a and b by fitting the Beta(a,b) density to the data {Xi/ni:ni≥threshold}. The Beta density fitting can be done by just matching mean and variances. Let m and V denote the mean and variance of {Xi/ni:ni≥threshold}. Then we obtain a and b by solving:
a+ba=m and a+baa+bba+b+11=V
which gives
a^=m(Vm(1−m)−1) and b^=(1−m)(Vm(1−m)−1).
One drawback of this method is that the selection of threshold can be arbitrary. In the kidney cancer example, we choose 300000 as the threshold, but we could have also chosen some other threshold such as 350K or 400K.
Method Two: We consider the marginal likelihood of Xi given a,b. Note that