(p1,…,pk)∼Dirichlet(a1,…,ak) and (X1,…,Xk)∣(p1,…,pk)∼Multinomial(n;p1,…,pk)⟹(p1,…,pk)∣X1=x1,…,Xk=xk∼Dirichlet(a1+x1,a2+x2,…,ak+xk).
Here the data are given by x1,…,xk where xi represents count of the i-th outcome in n repeated trials (so x1+⋯+xk=n). The probabilities p1,…,pk associated with k outcomes represent the unknown parameters which are to be estimated.
In the special case where all ai=0, then the posterior mean of pi is simply xi/n; so we are estimating the probability of the i-th outcome by simply the proportion of times the i-th outcome appeared in the n trials. This is just the frequentist MLE. So the posterior mean estimate coincides with the frequentist MLE when the prior is Dirichlet(0,…,0).
More precisely, the posterior for the Dirichlet(0,…,0) prior is Dirichlet(x1,…,xk). A key property of this posterior is that if xi=0 for some i, then the posterior places all its mass on the set {pi=0}. In other words, any outcome that is not observed in the sample is assigned zero probability under the posterior and is therefore completely excluded from future consideration.
As an explicit example, suppose k=6, n=4, x1=0,x2=0,x3=0,x4=0,x5=3,x6=1. In other words, we rolled a six-faced die 4 times; in 3 of the 4 rolls, we observed the outcome 5, and in the other roll, we observed a 6. After observing this data (note that we started with the prior Dirichlet(0,0,0,0,0,0)), our posterior becomes Dirichlet(0,0,0,0,3,1). This means that the posterior is concentrated on the probabilities (0,0,0,0,p5,1−p5) with p5∼Beta(3,1).
Further special case: Dirichlet(0,…,0) prior and data in which no outcome is repeated¶
Suppose the actual outcomes observed in the n trials are y1,…,yn and assume that these are all distinct (which means that we did not observe any repeated outcome). As before, we work with the Dirichlet(0,…,0) prior. Then the posterior can be written as:
i=1∑nwiδ{yi} where (w1,…,wn)∼Dirichlet(1,…,1),
where δ{yi} denotes the point mass at the point yi.
Note also that the density corresponding to the Dirichlet(1,…,1) distribution is constant, so the above posterior can be understood as the uniform distribution over all probability measures that are supported on the observed data points.
for j=1,…,M (for a large number M). Inference done with these samples is referred to as the Bayesian Bootstrap (see Rubin1981BayesianBootstrap). This is because the generation process for these samples is similar to the usual nonparametric bootstrap sample generation. Indeed, the usual bootstrap samples can be seen as being generated from:
i=1∑nviδ{yi} where (v1,…,vn)∼n1Multinomial(n;1/n,…,1/n).
So operationally, bootstrap and the Bayesian bootstrap give very similar results, although conceptually they are quite different. For more information on the Bayesian bootstrap and its relation to the classical bootstrap, see the blog post by Rasmus Bååth: https://www.sumsar.net/blog/2015/04/the-non-parametric-bootstrap-as-a-bayesian-model/.