Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twelve

Spring 2026, UC Berkeley

Today we shall generalize the Beta-Binomial analysis from last week to Dirichlet-Multinomial inference. First we shall look at the Multinomial distribution (which is a generalization of the Binomial distribution) and the Dirichlet distribution (which is a generalization of the Beta distribution).

Multinomial Distribution

The multinomial distribution is a generalization of the Binomial distribution. We first recall the binomial distribution. We say that XBin(n,p)X \sim \text{Bin}(n, p) if

P{X=x}=n!x!(nx)!px(1p)nx   for x=0,1,,n.\begin{align*} \P\{X = x\} = \frac{n!}{x! (n-x)!} p^x (1 - p)^{n-x} ~~ \text{ for } x = 0, 1, \dots, n. \end{align*}

XX represents the count of successess in nn repetitions of a simple binary experiment with two outcomes success (probability pp) and failure (probability 1p1 - p).

The multinomial distribution is obtained by considering nn repetitions of an experiment with kk outcomes (for some k1k \geq 1) with probabilities p1,,pkp_1, \dots, p_k (we need pi0p_i \geq 0 and i=1kpi=1\sum_{i=1}^kp_i = 1). Let XiX_i denote the count of the ii-th outcome among the nn repetitions of the experiment (in other words, XiX_i is the number of times the ii-th outcome happens in the nn trials). The joint distribution of (X1,,Xk)(X_1, \dots, X_k) is written as Multinomial(n;p1,,pk)\text{Multinomial}(n; p_1, \dots, p_k). It corresponds to the probabilities:

P ⁣{X1=x1,,Xk=xk}={n!x1!xk!i=1kpixi,if xi0 for all i and i=1kxi=n,0,otherwise.\begin{align*} \P\!\left\{ X_1 = x_1, \ldots, X_k = x_k \right\} &= \begin{cases} \displaystyle \frac{n!}{x_1! \cdots x_k!} \prod_{i=1}^k p_i^{x_i}, & \text{if } x_i \ge 0 \text{ for all } i \text{ and } \sum_{i=1}^k x_i = n, \\ 0, & \text{otherwise}. \end{cases} \end{align*}

The following consequences of (X1,,Xk)Multinomial(n;p1,,pk)(X_1, \dots, X_k) \sim \text{Multinomial}(n; p_1, \dots, p_k) are straightforward to see:

  1. To derive the distribution of XiX_i i.e., P{Xi=x}\P\{X_i = x\}, we can write

    P{Xi=x}=P{x trials had outcome i,(nx) trials had outcome not i}=nx!(nx)!pix(1pi)nx.\begin{align*} \P\{X_i = x\} &= \P\{x \text{ trials had outcome } i, (n-x) \text{ trials had outcome not } i \} \\ &= \frac{n}{x! (n-x)!} p_i^{x} (1 - p_i)^{n-x}. \end{align*}

    In other words XiBin(n,pi)X_i \sim \text{Bin}(n, p_i). In essence, here we consider the variant of the original experiment where, for each trial, we just record if the original outcome was ’not ii’ or ii. This converts the setup to binary trials and XiX_i represents the number of successes (success = outcome is ii) so XiBin(n,pi)X_i \sim \text{Bin}(n, p_i).

  2. The joint distribution of (Xi,Xj)(X_i, X_j) (for fixed iji \neq j) can be derived in the following way:

    P{Xi=x1,Xj=x2}=P{x1 trials were i,x2 trials were j,nx1x2 trials were neither i nor j}=n!x1!x2!(nx1x2)!pix1pjx2(1pipj)nx1x2\begin{align*} & \P\{X_i = x_1, X_j = x_2\} \\ &= \P\{x_1 \text{ trials were } i, x_2 \text{ trials were } j, n-x_1-x_2 \text{ trials were neither } i \text{ nor } j \}\\ &= \frac{n!}{x_1! x_2!(n - x_1 - x_2)!} p_i^{x_1} p_j^{x_2} (1 - p_i - p_j)^{n-x_1-x_2} \end{align*}

    provided x1,x2{0,1,,n}x_1, x_2 \in \{0, 1, \dots, n\} with x1+x2nx_1 + x_2 \leq n (otherwise, the probability equals 0).

  3. It can be checked (see e.g., Multinomial distribution) that

    EXi=npi,  var(Xi)=npi(1pi), and   cov(Xi,Xj)=npipj if ij.\begin{align*} \E X_i = n p_i, ~~ \text{var}(X_i) = np_i(1 - p_i), \text{ and } ~~ \text{cov}(X_i, X_j) = -np_i p_j \text{ if } i \neq j. \end{align*}

Dirichlet Distribution

The Dirichlet distribution is a generalization of the Beta distribution. Recall the Beta(a,b)(a, b) distribution is a distribution over p[0,1]p \in [0, 1] corresponding to the density:

f(p)=Γ(a+b)Γ(a)Γ(b)pa1(1p)b1I{0p1}.\begin{align*} f(p) = \frac{\Gamma(a + b)}{\Gamma(a) \Gamma(b)} p^{a-1} (1 - p)^{b-1} I\{0 \leq p \leq 1\}. \end{align*}

This density function is well-defined only when both aa and bb are strictly positive. However one can still define the Beta(a,b)(a, b) distribution even when one or both of aa and bb are zero. This can be done by a limiting argument:

  1. Beta(0,0)=limϵ0Beta(ϵ,ϵ)=Bernoulli(0.5)\text{Beta}(0, 0) = \lim_{\epsilon \downarrow 0} \text{Beta}(\epsilon, \epsilon) = \text{Bernoulli}(0.5). In other words, Beta(0,0)\text{Beta}(0, 0) becomes a discrete two-point distribution assigning equal probability to 0 and 1.

  2. For fixed b>0b > 0, we have Beta(0,b)=limϵ0Beta(ϵ,b)=Bernoulli(0)=δ{0}\text{Beta}(0, b) = \lim_{\epsilon \downarrow 0} \text{Beta}(\epsilon, b) = \text{Bernoulli}(0) = \delta_{\{0\}}. In other words, Beta(0,b)\text{Beta}(0, b) is the point mass as 0.

  3. For fixed a>0a > 0, we have Beta(a,0)=limϵ0Beta(a,ϵ)=Bernoulli(1)=δ{1}\text{Beta}(a, 0) = \lim_{\epsilon \downarrow 0} \text{Beta}(a, \epsilon) = \text{Bernoulli}(1) = \delta_{\{1\}}. In other words, Beta(a,0)\text{Beta}(a, 0) is the point mass as 1.

The limiting statements above can be made rigorous by moment calculations (e.g., by showing that the moments of Beta(ϵ,ϵ)\text{Beta}(\epsilon, \epsilon) converge to the moments of Bernoulli(0.5)\text{Bernoulli}(0.5) as ϵ0\epsilon \downarrow 0).

The Beta distribution can be viewed as the distribution over the probabilities (pp and 1p1-p) of an experiment with two outcomes. The Dirichlet distribution is the distribution over the probabilities p1,,pkp_1, \dots, p_k of an experiment with kk outcomes. Specifically given a1,,aka_1, \dots, a_k, the Dirichlet(a1,,ak)(a_1, \dots, a_k) distribution corresponds to the density:

Γ(a1++ak)Γ(a1)Γ(ak)p1a11pkak1I{p1,,pk0,ipi=1}.\begin{align*} \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} p_1^{a_1 - 1} \dots p_k^{a_k - 1} I\{p_1, \dots, p_k \geq 0, \sum_{i} p_i = 1\}. \end{align*}

Note that this should be viewed as density on (k1)(k-1)-dimensional space (as opposed to kk-dimensional space) because of the restriction p1+pk=1p_1 + \dots p_k = 1. In other words, for every ARkA \subseteq \R^k, we have

P{(p1,,pk)A}=Γ(a1++ak)Γ(a1)Γ(ak)Sp1a11pk1ak11(1p1pk1)ak1I{(p1,,pk1,1p1pk1)A}dp1dpk1\begin{align*} & \P\{(p_1, \dots, p_k) \in A\} \\ &= \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} \int_S p_1^{a_1 - 1} \dots p_{k-1}^{a_{k-1} - 1} \left(1 - p_1 - \dots - p_{k-1} \right)^{a_k - 1} \\ &I\{(p_1, \dots, p_{k-1}, 1 - p_1 - \dots - p_{k-1}) \in A\} dp_1 \dots dp_{k-1} \end{align*}

where

S:={(p1,,pk1):pi0,p1++pk11}.\begin{align*} S := \left\{(p_1, \dots, p_{k-1}) : p_{i} \geq 0, p_1 + \dots + p_{k-1} \leq 1 \right\}. \end{align*}

It can be checked that if (p1,,pk)Dirichlet(a1,,ak)(p_1, \dots, p_k) \sim \text{Dirichlet}(a_1, \dots, a_k), then (see e.g., Dirichlet distribution)

Epi=aia1++akvar(pi)=(aia1++ak)((a1++ak)aia1++ak)(1a1++ak+1)cov(pi,pj)=(aia1++ak)(aja1++ak)(1a1++ak+1).\begin{align*} &\E p_i = \frac{a_i}{a_1 + \dots + a_k} \\ & \text{var}(p_i) = \left(\frac{a_i}{a_1 + \dots + a_k} \right) \left( \frac{(a_1 + \dots + a_k) - a_i}{a_1 + \dots + a_k} \right) \left( \frac{1}{a_1 + \dots + a_k + 1} \right) \\ &\text{cov}(p_i, p_j) = -\left(\frac{a_i}{a_1 + \dots + a_k} \right) \left( \frac{a_j}{a_1 + \dots + a_k} \right) \left( \frac{1}{a_1 + \dots + a_k + 1} \right). \end{align*}

As for the case of the Beta, the following facts about the Dirichlet can also be proved using moment calculations:

  1. Dirichlet(0,,0)=limϵ0Dirichlet(ϵ,,ϵ)\text{Dirichlet}(0, \dots, 0) = \lim_{\epsilon \downarrow 0} \text{Dirichlet}(\epsilon, \dots, \epsilon) is the discrete uniform distribution on e1,eke_1 \dots, e_k with eie_i is the vector which has 1 in the ii-th location and 0 everywhere else i.e.,

    Dirichlet(0,,0)=1ki=1kδ{ei}.\begin{align*} \text{Dirichlet}(0, \dots, 0) = \frac{1}{k} \sum_{i=1}^k \delta_{\{e_i\}}. \end{align*}
  2. If k=4k = 4 and a2,a4>0a_2, a_4 > 0, then Dirichlet(0,a2,0,a4)=limϵ0Dirichlet(ϵ,a2,ϵ,a4)\text{Dirichlet}(0, a_2, 0, a_4) = \lim_{\epsilon \downarrow 0} \text{Dirichlet}(\epsilon, a_2, \epsilon, a_4) is supported on {(p1,p2,p3,p4):p1=0,p20,p3=0,p40,p1+p2+p3+p4=1}\{(p_1, p_2, p_3, p_4) : p_1 = 0, p_2 \geq 0, p_3 = 0, p_4 \geq 0, p_1 + p_2 + p_3 + p_4 = 1 \}, and the distribution of (p2,p4)(p_2, p_4) is Dirichlet(a2,a4)\text{Dirichlet}(a_2, a_4). Informally, we write

    Dirichlet(0,a2,0,a4)=Dirichlet(a2,a4).\begin{align*} \text{Dirichlet}(0, a_2, 0, a_4) = \text{Dirichlet}(a_2, a_4). \end{align*}
  3. More generally, let a=(a1,,ak)a = (a_1, \dots, a_k) with each ai0a_i \geq 0. Let I={i:ai>0}I = \{i: a_i > 0\} and let mm be the cardinality of II. Then Dirichlet(a):=limϵ0Dirichlet(a1+ϵ,,ak+ϵ)\text{Dirichlet}(a) := \lim_{\epsilon \downarrow 0} \text{Dirichlet}(a_1 + \epsilon, \dots, a_k + \epsilon) satisfies the following.

    1. The support of Dirichlet(a)\text{Dirichlet}(a) equals {(p1,,pk):pi0,i=1kpi=1,pi=0 for all iI}\{(p_1, \dots, p_k): p_i \geq 0, \sum_{i=1}^k p_i = 1, p_i = 0 \text{ for all } i \notin I\}.

    2. If pDirichlet(a)p \sim \text{Dirichlet}(a), then the subvector pi,iIp_i, i \in I satisfies:

      (pi,iI)Dirichlet(ai,iS).\begin{align*} (p_i, i \in I) \sim \text{Dirichlet}(a_i, i \in S). \end{align*}

      Note that the above is a Dirichlet distribution for an mm-dimensional probability vector.

Dirichlet Prior and Multinomial Likelihood

We saw the following basic fact in Lecture 8:

pBeta(a,b)   and   XpBin(n,p)    pX=xBeta(a+x,b+nx).\begin{align*} p \sim \text{Beta}(a, b) ~~ \text{ and } ~~ X \mid p \sim \text{Bin}(n, p) \implies p \mid X = x \sim \text{Beta}(a + x, b + n - x). \end{align*}

The generalization of this is:

(p1,,pk)Dirichlet(a1,,ak) and (X1,,Xk)(p1,,pk)Multinomial(n;p1,,pk)    (p1,,pk)X1=x1,,Xk=xkDirichlet(a1+x1,a2+x2,,ak+xk).\begin{align*} & (p_1, \dots, p_k) \sim \text{Dirichlet}(a_1, \dots, a_k) \text{ and } (X_1, \dots, X_k) \mid (p_1, \dots, p_k) \sim \text{Multinomial}(n; p_1, \dots, p_k) \\ & \implies (p_1, \dots, p_k) \mid X_1 = x_1, \dots, X_k = x_k \sim \text{Dirichlet}(a_1 + x_1, a_2 + x_2, \dots, a_k + x_k). \end{align*}

So the posterior mean estimate of pip_i is given by:

E(piX1=x1,,Xk=xk)=ai+xi(a1+x1)++(ak+xk)=ai+xin+a1++ak\begin{align*} \E (p_i \mid X_1 = x_1, \dots, X_k = x_k) = \frac{a_i + x_i}{(a_1 + x_1) + \dots + (a_k + x_k)} = \frac{a_i + x_i}{n + a_1 + \dots + a_k} \end{align*}

where we used x1++xk=nx_1 + \dots + x_k = n.

In the special case where all ai=0a_i = 0, then the posterior mean of pip_i is simply xi/nx_i/n; so we are estimating the probability of the ii-th outcome by simply the proportion of times the ii-th outcome appeared in the nn trials. This is just the frequentist MLE. So the posterior mean estimate coincides with the frequentist MLE when the prior is Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0).

More precisely, the posterior corresponding to the Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0) prior equals Dirichlet(x1,,xk)\text{Dirichlet}(x_1, \dots, x_k). A key property of this posterior is that if xi=0x_i = 0 for some ii, then, by the properties of Dirichlet distributions with some zero hyperparameters discussed in the previous section, the posterior places all its mass on the set {pi=0}\{p_i = 0\}. In other words, any outcome that is not observed in the sample is assigned zero probability under the posterior and is therefore completely excluded from future consideration.