Today we shall generalize the Beta-Binomial analysis from last week to Dirichlet-Multinomial inference. First we shall look at the Multinomial distribution (which is a generalization of the Binomial distribution) and the Dirichlet distribution (which is a generalization of the Beta distribution).
X represents the count of successess in n repetitions of a simple binary experiment with two outcomes success (probability p) and failure (probability 1−p).
The multinomial distribution is obtained by considering n repetitions of an experiment with k outcomes (for some k≥1) with probabilities p1,…,pk (we need pi≥0 and ∑i=1kpi=1). Let Xi denote the count of the i-th outcome among the n repetitions of the experiment (in other words, Xi is the number of times the i-th outcome happens in the n trials). The joint distribution of (X1,…,Xk) is written as Multinomial(n;p1,…,pk). It corresponds to the probabilities:
P{X1=x1,…,Xk=xk}=⎩⎨⎧x1!⋯xk!n!i=1∏kpixi,0,if xi≥0 for all i and ∑i=1kxi=n,otherwise.
In other words Xi∼Bin(n,pi). In essence, here we consider the variant of the original experiment where, for each trial, we just record if the original outcome was ’not i’ or i. This converts the setup to binary trials and Xi represents the number of successes (success = outcome is i) so Xi∼Bin(n,pi).
The joint distribution of (Xi,Xj) (for fixed i=j) can be derived in the following way:
P{Xi=x1,Xj=x2}=P{x1 trials were i,x2 trials were j,n−x1−x2 trials were neither i nor j}=x1!x2!(n−x1−x2)!n!pix1pjx2(1−pi−pj)n−x1−x2
The Dirichlet distribution is a generalization of the Beta distribution. Recall the Beta(a,b) distribution is a distribution over p∈[0,1] corresponding to the density:
This density function is well-defined only when both a and b are strictly positive. However one can still define the Beta(a,b) distribution even when one or both of a and b are zero. This can be done by a limiting argument:
Beta(0,0)=limϵ↓0Beta(ϵ,ϵ)=Bernoulli(0.5). In other words, Beta(0,0) becomes a discrete two-point distribution assigning equal probability to 0 and 1.
For fixed b>0, we have Beta(0,b)=limϵ↓0Beta(ϵ,b)=Bernoulli(0)=δ{0}. In other words, Beta(0,b) is the point mass as 0.
For fixed a>0, we have Beta(a,0)=limϵ↓0Beta(a,ϵ)=Bernoulli(1)=δ{1}. In other words, Beta(a,0) is the point mass as 1.
The limiting statements above can be made rigorous by moment calculations (e.g., by showing that the moments of Beta(ϵ,ϵ) converge to the moments of Bernoulli(0.5) as ϵ↓0).
The Beta distribution can be viewed as the distribution over the probabilities (p and 1−p) of an experiment with two outcomes. The Dirichlet distribution is the distribution over the probabilities p1,…,pk of an experiment with k outcomes. Specifically given a1,…,ak, the Dirichlet(a1,…,ak) distribution corresponds to the density:
Note that this should be viewed as density on (k−1)-dimensional space (as opposed to k-dimensional space) because of the restriction p1+…pk=1. In other words, for every A⊆Rk, we have
As for the case of the Beta, the following facts about the Dirichlet can also be proved using moment calculations:
Dirichlet(0,…,0)=limϵ↓0Dirichlet(ϵ,…,ϵ) is the discrete uniform distribution on e1…,ek with ei is the vector which has 1 in the i-th location and 0 everywhere else i.e.,
If k=4 and a2,a4>0, then Dirichlet(0,a2,0,a4)=limϵ↓0Dirichlet(ϵ,a2,ϵ,a4) is supported on {(p1,p2,p3,p4):p1=0,p2≥0,p3=0,p4≥0,p1+p2+p3+p4=1}, and the distribution of (p2,p4) is Dirichlet(a2,a4). Informally, we write
More generally, let a=(a1,…,ak) with each ai≥0. Let I={i:ai>0} and let m be the cardinality of I. Then Dirichlet(a):=limϵ↓0Dirichlet(a1+ϵ,…,ak+ϵ) satisfies the following.
The support of Dirichlet(a) equals {(p1,…,pk):pi≥0,∑i=1kpi=1,pi=0 for all i∈/I}.
If p∼Dirichlet(a), then the subvector pi,i∈I satisfies:
(p1,…,pk)∼Dirichlet(a1,…,ak) and (X1,…,Xk)∣(p1,…,pk)∼Multinomial(n;p1,…,pk)⟹(p1,…,pk)∣X1=x1,…,Xk=xk∼Dirichlet(a1+x1,a2+x2,…,ak+xk).
In the special case where all ai=0, then the posterior mean of pi is simply xi/n; so we are estimating the probability of the i-th outcome by simply the proportion of times the i-th outcome appeared in the n trials. This is just the frequentist MLE. So the posterior mean estimate coincides with the frequentist MLE when the prior is Dirichlet(0,…,0).
More precisely, the posterior corresponding to the Dirichlet(0,…,0) prior equals Dirichlet(x1,…,xk). A key property of this posterior is that if xi=0 for some i, then, by the properties of Dirichlet distributions with some zero hyperparameters discussed in the previous section, the posterior places all its mass on the set {pi=0}. In other words, any outcome that is not observed in the sample is assigned zero probability under the posterior and is therefore completely excluded from future consideration.