Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Eight

Spring 2026, UC Berkeley

The main goal today is to go over Bayesian inference of a parameter θ[0,1]\theta \in [0, 1] from observation XBin(n,θ)X \sim Bin(n, \theta) (nn is also observed) with a Beta distribution prior on θ\theta.

Let us start by recalling the Beta distribution.

Beta Distribution

The Beta distribution with parameters aa and bb is given by the density:

fθ(θ)=θa1(1θ)b1B(a,b)I{0θ1}\begin{align*} f_{\theta}(\theta) = \frac{\theta^{a-1} (1 - \theta)^{b-1}}{B(a, b)} I\{0 \leq \theta \leq 1\} \end{align*}

where the normalizing constant B(a,b)B(a, b) is the Beta function given by

B(a,b)=01θa1(1θ)b1dθ.\begin{align*} B(a, b) = \int_0^1 \theta^{a-1} (1 - \theta)^{b-1} d\theta. \end{align*}

For the Beta density to be proper, we need both aa and bb to be strictly positive. Here are some basic properties of the Beta distribution:

  1. When a=1a = 1 and b=1b = 1, we get the uniform distribution on (0,1)(0, 1). We consider the uniform distribution as the simplest example of a Beta distribution.

  2. The mean of the Beta distribution is

    mean=aa+b.\text{mean} = \frac{a}{a + b}.

    For example, if aa is much smaller compared to bb, then the mean will be small.

  3. The variance of the Beta distribution is

    variance=aa+bba+b1a+b+1.\text{variance} = \frac{a}{a + b} \frac{b}{a + b} \frac{1}{a + b + 1}.

    An implication is that, when a+ba + b is large, the variance tends to be small, so the Beta density will look skinny.

  4. More generally, any moment E(θk)\E (\theta^k) (for positive integers kk) can be written in closed form as (here θBeta(a,b)\theta \sim Beta(a, b))

    E(θk)=a(a+1)(a+k1)(a+b)(a+b+1)(a+b+k1)  for every k1.\E (\theta^k) = \frac{a(a + 1) \dots (a + k-1)}{(a + b) (a + b + 1) \dots (a + b + k - 1)} ~~ \text{for every $k \geq 1$}.

In Bayesian inference, improper priors are also sometimes used. In the case of Beta priors, it is common to use the distributions Beta(0,0)Beta(0, 0) as a prior. This corresponds to the density

fBeta(0,0)(θ)I{0<θ<1}θ(1θ).\begin{align*} f_{Beta(0, 0)}(\theta) \propto \frac{I\{0 < \theta < 1\}}{\theta (1 - \theta)}. \end{align*}

This improper density cannot be normalized so we don’t put any normalizing constants on the right hand side.

If one wants to interpret Beta(0,0)Beta(0, 0) as a probability distribution, the correct answer is the discrete distribution taking the values 0 and 1 with equal probability:

Beta(0,0)=Bernoulli(0.5).\begin{align*} Beta(0, 0) = Bernoulli(0.5). \end{align*}

One way to formalize this equivalence is to argue that the distribution Beta(ϵ,ϵ)Beta(\epsilon, \epsilon) converges to Bernoulli(0.5)Bernoulli(0.5) as ϵ0\epsilon \downarrow 0 in the usual sense of weak convergence. The easiest way to see this is that the moments of Beta(ϵ,ϵ)Beta(\epsilon, \epsilon) converge to the moments of Bernoulli(0.5)Bernoulli(0.5).

Beta-Binomial Inference

The goal is to infer θ\theta from observations XX (and nn) where XBin(n,θ)X \sim Bin(n, \theta). One can also formulate this problem as that of estimating θ\theta from nn i.i.d Bernoulli observations Z1,,Zni.i.dBer(θ)Z_1, \dots, Z_n \overset{\text{i.i.d}}{\sim} Ber(\theta). The likelihood is given by:

fXθ(x)=(nx)θx(1θ)nxθx(1θ)nx.\begin{align*} f_{X \mid \theta}(x) = {n \choose x} \theta^x (1 - \theta)^{n-x} \propto \theta^{x} (1 - \theta)^{n-x}. \end{align*}

The frequentist estimator (MLE) for θ\theta is x/nx/n. For Bayesian inference, we will use the Beta(a,b)Beta(a, b) prior:

fθ(θ)θa1(1θ)b1I{0θ1}\begin{align*} f_{\theta}(\theta) \propto \theta^{a-1} (1 - \theta)^{b-1} I\{0 \leq \theta \leq 1\} \end{align*}

leading to the posterior:

fθx(θ)θa1(1θ)b1I{0θ1}θx(1θ)nx=θx+a1(1θ)nx+b1I{0θ1}.\begin{align*} f_{\theta \mid x}(\theta) &\propto \theta^{a-1} (1 - \theta)^{b-1} I\{0 \leq \theta \leq 1\} \theta^{x} (1 - \theta)^{n-x} \\ &= \theta^{x + a - 1} (1 - \theta)^{n-x + b - 1} I\{0 \leq \theta \leq 1\}. \end{align*}

Thus

θxBeta(x+a,nx+b).\begin{align*} \theta \mid x \sim Beta(x + a, n-x + b). \end{align*}

A natural point estimate of θ\theta is the posterior mean:

θ^=E(θx)=x+an+a+b.\begin{align*} \hat{\theta} = \E(\theta \mid x) = \frac{x+a}{n + a + b}. \end{align*}

Here are two things to note about the posterior mean:

  1. It is a convex combination of the MLE and the prior mean a/(a+b)a/(a+b):

    x+an+a+b=(na+b+n)xn+(a+ba+b+n)aa+b.\begin{align*} \frac{x+a}{n + a + b} = \left(\frac{n}{a+b+n} \right) \frac{x}{n} + \left(\frac{a+b}{a+b+n} \right) \frac{a}{a+b}. \end{align*}

    So if nn is much larger compared to a+ba+b, then the posterior mean will be very close to the MLE. Conversely if nn is much smaller compared to a+ba + b, then the posterior mean will be very close to the prior mean.

  2. When a=b=0a = b = 0, then the posterior mean exactly coincides with the MLE. Thus if we use the Beta(0,0)Beta(0, 0) prior, then posterior inference will be close to frequentist inference. Note again that this prior is improper.

  3. If we use the Beta(0,0)Beta(0, 0) prior and if we observe n=x=1n = x = 1 (i.e., we only have data on one Bernoulli trial which resulted in a heads). Then the posterior is Beta(x+a,nx+b)=Beta(1,0)Beta(x + a, n-x + b) = Beta(1,0). Beta(1,0)Beta(1, 0) is also improper but it can be interpreted as the point mass at 1:

    Beta(1,0)=Bernoulli(1)=δ{1}.\begin{align*} Beta(1, 0) = Bernoulli(1) = \delta_{\{1\}}. \end{align*}

    This can be proved, for example, by computing the moments of Beta(1,ϵ)Beta(1, \epsilon) and showing that they all approach 1 (which are the moments of δ{1}\delta_{\{1\}}) as ϵ0\epsilon \rightarrow 0. Similarly for x=0x = 0 and n=1n = 1, the posterior becomes Beta(0,1)Beta(0, 1) which should be interpreted as δ{0}\delta_{\{0\}}. If n=2n = 2 and x=1x = 1 (i.e., we observe one head and one tail), the posterior is the uniform distribution on (0,1)(0, 1).

  4. The process of going from the Beta(0,0)Beta(0, 0) prior to the Beta(1,0)Beta(1, 0) posterior when n=x=1n = x = 1 is described as intuitive in the following situation by Jaynes: ...in a chemical laboratory we find a jar containing an unknown and unlabeled compound. We are at first completely ignorant as to whether a small sample of this compound will dissolve in water or not. But, having observed that one small sample does dissolve, we infer immediately that all samples of this compound are water soluble, and although the conclusion does not carry quite the force of deductive proof, we feel strongly that the inference was justified.

Multiple Replications of the Same Problem

Consider the problem of estimating θi\theta_i from (Xi,ni)(X_i, n_i) for each i=1,,Ni = 1, \dots, N, with XiBin(ni,θi)X_i \sim Bin(n_i, \theta_i). For a concrete example, take all counties in the United States with XiX_i denoting the number of deaths due to kidney cancer (in the period 1980891980-89) and nin_i denoting the average population (during the same period) for the ii-th county.

It is easy to see that the naive frequentist estimate Xi/niX_i/n_i can be very bad for θi\theta_i if nin_i is small. We will therefore use Bayes estimation using the prior:

θii.i.dBeta(a,b).\begin{align*} \theta_i \overset{\text{i.i.d}}{\sim} Beta(a, b). \end{align*}

The choice of aa and bb is crucial here. Since we have a large dataset, it makes sense to learn good values of aa and bb from the observed data. We will look at the following three ways of doing this.

  1. Method One: We place a threshold on nin_i and filter out ii for which nin_i exceeds the threshold. We then select aa and bb by fitting the Beta(a,b)Beta(a, b) density to the data {Xi/ni:nithreshold}\{X_i/n_i : n_i \geq \text{threshold}\}. The Beta density fitting can be done by just matching mean and variances. Let mm and VV denote the mean and variance of {Xi/ni:nithreshold}\{X_i/n_i : n_i \geq \text{threshold}\}. Then we obtain aa and bb by solving:

    aa+b=m    and    aa+bba+b1a+b+1=V\frac{a}{a+b} = m ~~~ \text{ and } ~~~ \frac{a}{a+b} \frac{b}{a+b} \frac{1}{a+b+1} = V

    which gives

    a^=m(m(1m)V1) and b^=(1m)(m(1m)V1).\hat{a} = m \left(\frac{m(1-m)}{V} - 1 \right) \text{ and } \hat{b} = (1 - m) \left(\frac{m(1-m)}{V} - 1 \right).

    One drawback of this method is that the selection of threshold can be arbitrary. In the kidney cancer example, we choose 300000 as the threshold, but we could have also chosen some other threshold such as 350K or 400K.

  2. Method Two: We consider the marginal likelihood of XiX_i given a,ba, b. Note that

    P{Xi=xa,b}=01P{Xi=xθi=θ}fθia,b(θ)dθ=01(nix)θx(1θ)nixθa1(1θ)b1B(a,b)dθ=(nix)1B(a,b)01θx+a1(1θ)nix+b1dθ=(nix)B(x+a,nix+b)B(a,b)\begin{align*} \P \{X_i = x \mid a, b\} &= \int_0^1 \P\{X_i = x \mid \theta_i = \theta\} f_{\theta_i \mid a, b}(\theta) d\theta \\ &= \int_0^1 {n_i \choose x} \theta^x (1 - \theta)^{n_i - x} \frac{\theta^{a-1} (1 - \theta)^{b-1}}{B(a, b)} d\theta \\ &={n_i \choose x} \frac{1}{B(a, b)} \int_0^1 \theta^{x + a - 1} (1 - \theta)^{n_i - x + b - 1} d\theta \\ &= {n_i \choose x} \frac{B(x + a, n_i - x + b)}{B(a, b)} \end{align*}

    As a result, the marginal likelihood of the data given in terms of aa, bb is:

    i=1N(niXi)B(Xi+a,niXi+b)B(a,b)\begin{align*} \prod_{i=1}^N {n_i \choose X_i} \frac{B(X_i + a, n_i - X_i + b)}{B(a, b)} \end{align*}

    So the maximum likelihood estimates of aa, bb are given by:

    (a^MLE,b^MLE)=argmaxa,bi=1N(niXi)B(Xi+a,niXi+b)B(a,b).\begin{align*} (\hat{a}_{\text{MLE}}, \hat{b}_{\text{MLE}}) = \underset{a, b}{\text{argmax}} \prod_{i=1}^N {n_i \choose X_i} \frac{B(X_i + a, n_i - X_i + b)}{B(a, b)}. \end{align*}

    This maximization can be done numerically by taking a grid of aa and bb values.

  3. Method Three: This is the fully Bayes solution where we simply place priors on aa and bb. It is sensible to reparametrize as:

    μ=aa+b   and   κ=a+b.\begin{align*} \mu = \frac{a}{a+b} ~~ \text{ and } ~~ \kappa = a + b. \end{align*}

    Now we take the priors:

    μuniform(0,1)   and   logκuniform(,).\begin{align*} \mu \sim \text{uniform}(0, 1) ~~ \text{ and } ~~ \log \kappa \sim \text{uniform}(-\infty, \infty). \end{align*}

    Inference under this fully Bayesian model is nontrivial from a computational point of view. We shall discuss this in the next lecture.