Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Fourteen

Spring 2026, UC Berkeley

We will illustrate Dirichlet-Multinomial inference on a text data example. This analysis is taken from MacKay1995HierarchicalDirichlet.

A Text Example

We observe some text y1,,yny_1, \dots, y_n. The vocabulary consists of the number of distinct words (including punctuation marks) in the text. Assume the vocabulary size is kk.

A simple i.i.d model

The simplest model assumes that y1,,yny_1, \dots, y_n are i.i.d with an distribution PP that is supported on the vocabulary. The likelihood then becomes:

i=1npyi=i=1kpiNi\begin{align} \prod_{i=1}^n p_{y_i} = \prod_{i=1}^k p_i^{N_i} \end{align}

where NiN_i is the observed count for the ii-th word in the vocabulary. We can place a Dirichlet prior on (p1,,pk)(p_1, \dots, p_k) (e.g., Dirichlet(0,,0)\text{Dirichlet}(0, \dots, 0)) which would result in the Dirichlet(N1,,Nk)\text{Dirichlet}(N_1, \dots, N_k) posterior distribution. One can then generate new text in the following way:

  1. First generate (p1,,pk)(p_1, \dots, p_k) from the posterior distribution.

  2. Then generate new words yn+1,yn+2,y_{n+1}, y_{n+2}, \dots i.i.d from the multinomial distribution Mult(1;p1,,pk)\text{Mult}(1; p_1, \dots, p_k) until a specific word (e.g., the period symbol ’\cdot’)) is generated.

Obviously, this will not result in realistic text. This is because the i.i.d assumption collapses the text into a bag-of-words representation, discarding all sequential information. Indeed, in this model, we are effectively working with the counts N1,,NkN_1, \dots, N_k instead of the raw data y1,,yny_1, \dots, y_n. The likelihood in terms of the counts is:

nN1!Nk!p1N1pkNki=1kpiNi\begin{align*} \frac{n}{N_1! \dots N_k!} p_1^{N_1} \dots p_k^{N_k} \propto \prod_{i=1}^k p_i^{N_i} \end{align*}

which coincides with (1). As a result, it cannot capture even the most basic linguistic dependencies (e.g., that “the” is likely to be followed by a noun).

AR(1) model

A slightly better model will be obtained if we replace the i.i.d assumption with a Markovian one. This is also referred to as the first order AutoRegressive AR(1) model.

The likelihood here then becomes:

py1py2y1py3y2,y1pynyn1,,y1\begin{align*} p_{y_1} p_{y_2 \mid y_1} p_{y_3 \mid y_2, y_1} \dots p_{y_n \mid y_{n-1}, \dots, y_1} \end{align*}

We assume now that pyiyi1,,y1=pyiyi1p_{y_i \mid y_{i-1}, \dots, y_1} = p_{y_i \mid y_{i-1}} for each i=3,,ni = 3, \dots, n. This gives

Likelihood=py1i=2npyiyi1.\begin{align*} \text{Likelihood} = p_{y_1} \prod_{i=2}^n p_{y_i \mid y_{i-1}}. \end{align*}

We also assume that py1p_{y_1} is constant so we don’t need to model it. This gives:

Likelihoodi=2npyiyi1.\begin{align} \text{Likelihood} \propto \prod_{i=2}^n p_{y_i \mid y_{i-1}}. \end{align}

If you don’t like the assumption that py1p_{y_1} is constant, you can interpret the above as the conditional likelihood of (y2,,yn)(y_2, \dots, y_n) given y1y_1.

Just as in the i.i.d model where we could write the likelihood in terms of counts, we can rewrite the likelihood (5) using ’bi-gram counts’. For each ii and jj in {1,,k}\{1, \dots, k\}, let xjix_{j \mid i} denote the number of times the word jj follows word ii in the text. Also let xi=j=1kxjix_i = \sum_{j=1}^k x_{j \mid i} denote the number of times word ii appears as the “previous word” in the text.

Then the likelihood (5) is the same as:

i=1kj=1kpjixji\begin{align} \prod_{i=1}^k \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i}} \end{align}

where pjip_{j \mid i} denotes the probability that word jj appears as the next word given that the previous word is ii.

We can also directly write the likelihood in terms of the bigram counts (xji,j=1,,k)(x_{j \mid i}, j = 1, \dots, k) and i=1,,ki = 1, \dots, k. The model here is:

(xji,j=1,,k)indMultinomial(xi;pji,j=1,,k).\begin{align*} (x_{j \mid i}, j = 1, \dots, k) \overset{\text{ind}}{\sim} \text{Multinomial}(x_i; p_{j \mid i}, j = 1, \dots, k). \end{align*}

so that the likelihood in terms of the bigram counts is:

i=1kxi!j=1kxji!j=1kpjixji.\begin{align} \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!} \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i}}. \end{align}

The unknown parameters in this model are the probability vectors (pji,1jk)(p_{j \mid i}, 1 \leq j \leq k) for each i=1,,ki = 1, \dots, k.

MacKay1995HierarchicalDirichlet perform Bayesian inference with the following prior:

(pji,j=1,,k)i.i.dDirichlet(a1,,ak)\begin{align*} (p_{j \mid i}, j = 1, \dots, k) \overset{\text{i.i.d}}{\sim} \text{Dirichlet}(a_1, \dots, a_k) \end{align*}

for some a1,,ak>0a_1, \dots, a_k > 0. This means that all the kk probability vectors (pji,1jk)(p_{j \mid i}, 1 \leq j \leq k) are assumed to be independently distributed according to the same Dirichlet distribution Dirichlet(a1,,ak)\text{Dirichlet}(a_1, \dots, a_k). The prior density is therefore

i=1kΓ(a1++ak)Γ(a1)Γ(ak)j=1kpjiaj1.\begin{align*} \prod_{i=1}^k \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} \prod_{j=1}^k p_{j \mid i}^{a_j - 1}. \end{align*}

The posterior is then:

i=1kxi!j=1kxji!j=1kpjixji×i=1kΓ(a1++ak)Γ(a1)Γ(ak)j=1kpjiaj1i=1kj=1kpjixji+aj1\begin{align*} \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!} \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i}} \times \prod_{i=1}^k \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} \prod_{j=1}^k p_{j \mid i}^{a_j - 1} \propto \prod_{i=1}^k \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i} + a_j - 1} \end{align*}

which means that:

(pji,j=1,,k)indDirichlet(xji+aj,j=1,,k).\begin{align*} (p_{j \mid i}, j = 1, \dots, k) \overset{\text{ind}}{\sim} \text{Dirichlet}(x_{j \mid i} + a_j, j = 1, \dots, k). \end{align*}

So the posterior mean estimate of pjip_{j \mid i} is:

p^ji=xji+ajxi+a1++ak.\begin{align*} \hat{p}_{j \mid i} = \frac{x_{j \mid i} + a_j}{x_i + a_1 + \dots + a_k}. \end{align*}

The choice of a1,,aka_1, \dots, a_k controls the properties of the above posterior. Instead of plugging in some values for a1,,aka_1, \dots, a_k, MacKay1995HierarchicalDirichlet propose estimating them by maximizing the marginal likelihood of the data (xji,1i,jk)(x_{j \mid i}, 1 \leq i, j \leq k) over all values of a1,,aka_1, \dots, a_k.

The marginal distribution of the data given the Dirichlet hyperparameters a1,,aka_1, \dots, a_k is:

P{xji,1i,jka1,,ak}=i=1kxi!j=1kxji!j=1kpjixji×i=1kΓ(a1++ak)Γ(a1)Γ(ak)j=1kpjiaj1d(pji all j,i)=i=1kxi!j=1kxji!Γ(a1++ak)Γ(a1)Γ(ak)j=1kpjixji+aj1d(pji,1jk)=i=1kxi!j=1kxji![Γ(a1++ak)Γ(xi+a1++ak)j=1kΓ(xji+aj)Γ(aj)].\begin{align*} & \P \left\{x_{j \mid i}, 1 \leq i, j \leq k \mid a_1, \dots, a_k \right\} \\ &= \int \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!} \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i}} \times \prod_{i=1}^k \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} \prod_{j=1}^k p_{j \mid i}^{a_j - 1} d(p_{j \mid i} \text{ all } j, i) \\ &= \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!} \frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(a_1) \dots \Gamma(a_k)} \int \prod_{j=1}^k p_{j \mid i}^{x_{j \mid i} + a_j - 1} d(p_{j \mid i}, 1 \leq j \leq k)\\ &= \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!}\left[\frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(x_i + a_1 + \dots + a_k)} \prod_{j=1}^k \frac{\Gamma(x_{j \mid i} + a_j)}{\Gamma(a_j)} \right]. \end{align*}

In the machine learning literature, the marginal distribution of the data given the hyperparameters of the prior is referred to as the Evidence (see e.g., Marginal likelihood). So:

Evidence(a1,,ak)=i=1kxi!j=1kxji![Γ(a1++ak)Γ(xi+a1++ak)j=1kΓ(xji+aj)Γ(aj)].\begin{align*} \text{Evidence}(a_1, \dots, a_k) = \prod_{i=1}^k \frac{x_i!}{\prod_{j=1}^k x_{j \mid i}!}\left[\frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(x_i + a_1 + \dots + a_k)} \prod_{j=1}^k \frac{\Gamma(x_{j \mid i} + a_j)}{\Gamma(a_j)} \right]. \end{align*}

The Evidence can be maximized over a1,,aka_1, \dots, a_k to get a good choice of the hyperparameters. Note that the factorial terms do not involve aja_js so they can dropped in this maximization:

Evidence(a1,,ak)i=1k[Γ(a1++ak)Γ(xi+a1++ak)j=1kΓ(xji+aj)Γ(aj)].\begin{align*} \text{Evidence}(a_1, \dots, a_k) \propto \prod_{i=1}^k \left[\frac{\Gamma(a_1 + \dots + a_k)}{\Gamma(x_i + a_1 + \dots + a_k)} \prod_{j=1}^k \frac{\Gamma(x_{j \mid i} + a_j)}{\Gamma(a_j)} \right]. \end{align*}

It is always better to optimize the logarithm of the evidence (rather than the evidence directly):

logEvidence(a1,,ak)=constant+i=1k{logΓ(A)logΓ(xi+A)+j=1k[logΓ(xji+aj)logΓ(aj)]}.\begin{align*} & \log \text{Evidence}(a_1, \dots, a_k) \\ &= \text{constant} + \sum_{i=1}^k \left\{ \log \Gamma(A) - \log \Gamma(x_i + A) + \sum_{j=1}^k \left[ \log \Gamma(x_{j \mid i} + a_j) - \log \Gamma(a_j) \right] \right\}. \end{align*}

where A:=a1++akA := a_1 + \dots + a_k and xi=j=1kxjix_i = \sum_{j=1}^k x_{j \mid i}. One can compute the gradient with respect to aa using the digamma function (see Digamma function):

Ψ(z):=ddzlogΓ(z).\begin{align*} \Psi(z) := \frac{d}{dz} \log \Gamma(z). \end{align*}

This gives:

ajlogEvidence(a1,,ak)=i=1k{Ψ(A)Ψ(xi+A)+Ψ(xji+aj)Ψ(aj)}.\begin{align*} \frac{\partial}{\partial a_j} \log \text{Evidence}(a_1, \dots, a_k) = \sum_{i=1}^k \left\{\Psi(A) - \Psi(x_i + A) + \Psi(x_{j \mid i} + a_j) - \Psi(a_j) \right\}. \end{align*}

With these gradients (scipy has an inbuilt function for digamma calculation), some numerical optimization routine can be used for maximizing the log-Evidence over a1,,aka_1, \dots, a_k (see code).