Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Thirteen

Spring 2026, UC Berkeley

Dirichlet Prior and Multinomial Likelihood

We saw the following fact in the last lecture:

(p1,,pk)Dirichlet(a1,,ak) and (X1,,Xk)(p1,,pk)Multinomial(n;p1,,pk)    (p1,,pk)X1=x1,,Xk=xkDirichlet(a1+x1,a2+x2,,ak+xk).\begin{align*} & (p_1, \dots, p_k) \sim \text{Dirichlet}(a_1, \dots, a_k) \text{ and } (X_1, \dots, X_k) \mid (p_1, \dots, p_k) \sim \text{Multinomial}(n; p_1, \dots, p_k) \\ & \implies (p_1, \dots, p_k) \mid X_1 = x_1, \dots, X_k = x_k \sim \text{Dirichlet}(a_1 + x_1, a_2 + x_2, \dots, a_k + x_k). \end{align*}

Here the data are given by x1,,xkx_1, \dots, x_k where xix_i represents count of the ii-th outcome in nn repeated trials (so x1++xk=nx_1 + \dots + x_k = n). The probabilities p1,,pkp_1, \dots, p_k associated with kk outcomes represent the unknown parameters which are to be estimated.

The posterior mean estimate of pip_i is given by:

E(piX1=x1,,Xk=xk)=ai+xi(a1+x1)++(ak+xk)=ai+xin+a1++ak\begin{align*} \E (p_i \mid X_1 = x_1, \dots, X_k = x_k) = \frac{a_i + x_i}{(a_1 + x_1) + \dots + (a_k + x_k)} = \frac{a_i + x_i}{n + a_1 + \dots + a_k} \end{align*}

where we used x1++xk=nx_1 + \dots + x_k = n.

Special Case: the Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0) prior

In the special case where all ai=0a_i = 0, then the posterior mean of pip_i is simply xi/nx_i/n; so we are estimating the probability of the ii-th outcome by simply the proportion of times the ii-th outcome appeared in the nn trials. This is just the frequentist MLE. So the posterior mean estimate coincides with the frequentist MLE when the prior is Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0).

More precisely, the posterior for the Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0) prior is Dirichlet(x1,,xk)\text{Dirichlet}(x_1, \dots, x_k). A key property of this posterior is that if xi=0x_i = 0 for some ii, then the posterior places all its mass on the set {pi=0}\{p_i = 0\}. In other words, any outcome that is not observed in the sample is assigned zero probability under the posterior and is therefore completely excluded from future consideration.

As an explicit example, suppose k=6k = 6, n=4n = 4, x1=0,x2=0,x3=0,x4=0,x5=3,x6=1x_1 = 0, x_2 = 0, x_3 = 0, x_4 = 0, x_5 = 3, x_6 = 1. In other words, we rolled a six-faced die 4 times; in 3 of the 4 rolls, we observed the outcome 5, and in the other roll, we observed a 6. After observing this data (note that we started with the prior Dirichlet(0,0,0,0,0,0)\text{Dirichlet}(0, 0, 0, 0, 0, 0)), our posterior becomes Dirichlet(0,0,0,0,3,1)\text{Dirichlet}(0, 0, 0, 0, 3, 1). This means that the posterior is concentrated on the probabilities (0,0,0,0,p5,1p5)(0, 0, 0, 0, p_5, 1 - p_5) with p5Beta(3,1)p_5 \sim \text{Beta}(3, 1).

Further special case: Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0) prior and data in which no outcome is repeated

Suppose the actual outcomes observed in the nn trials are y1,,yny_1, \dots, y_n and assume that these are all distinct (which means that we did not observe any repeated outcome). As before, we work with the Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0) prior. Then the posterior can be written as:

i=1nwiδ{yi}   where (w1,,wn)Dirichlet(1,,1),\begin{align*} \sum_{i=1}^n w_i \delta_{\{y_i\}} ~~ \text{ where } (w_1, \dots, w_n) \sim \text{Dirichlet}(1, \dots, 1), \end{align*}

where δ{yi}\delta_{\{y_i\}} denotes the point mass at the point yiy_i.

Note also that the density corresponding to the Dirichlet(1,,1)\text{Dirichlet}(1, \dots, 1) distribution is constant, so the above posterior can be understood as the uniform distribution over all probability measures that are supported on the observed data points.

Posterior samples therefore can be generated by:

(w1(j),,wn(j))i.i.dDirichlet(1,,1)\begin{align*} (w_1^{(j)}, \dots, w_n^{(j)}) \overset{\text{i.i.d}}{\sim} \text{Dirichlet}(1, \dots, 1) \end{align*}

for j=1,,Mj = 1, \dots, M (for a large number MM). Inference done with these samples is referred to as the Bayesian Bootstrap (see Rubin1981BayesianBootstrap). This is because the generation process for these samples is similar to the usual nonparametric bootstrap sample generation. Indeed, the usual bootstrap samples can be seen as being generated from:

i=1nviδ{yi}   where (v1,,vn)1nMultinomial(n;1/n,,1/n).\begin{align*} \sum_{i=1}^n v_i \delta_{\{y_i\}} ~~ \text{ where } (v_1, \dots, v_n) \sim \frac{1}{n} \text{Multinomial}(n; 1/n, \dots, 1/n). \end{align*}

The distributions (w1,,wn)Dir(1,,1)(w_1, \dots, w_n) \sim \text{Dir}(1, \dots, 1) &\& (v1,,vn)Mult(n;1/n,,1/n)/n(v_1, \dots, v_n) \sim \text{Mult}(n; 1/n, \dots, 1/n)/n are quite close when nn is large. For example,

Ewi=Evi=1/nvar(wi)=n1n2(n+1)var(vi)=n1n3correlation(wi,wj)=correlation(vi,vj)=1n1   when ij.\begin{align*} & \E w_i = \E v_i = 1/n \\ & \text{var}(w_i) = \frac{n-1}{n^2(n+1)} \approx \text{var}(v_i) = \frac{n-1}{n^3} \\ & \text{correlation}(w_i, w_j) = \text{correlation}(v_i, v_j) = \frac{-1}{n-1} ~~ \text{ when } i \neq j. \end{align*}

So operationally, bootstrap and the Bayesian bootstrap give very similar results, although conceptually they are quite different. For more information on the Bayesian bootstrap and its relation to the classical bootstrap, see the blog post by Rasmus Bååth: https://www.sumsar.net/blog/2015/04/the-non-parametric-bootstrap-as-a-bayesian-model/.