We observe some text y1,…,yn. The vocabulary consists of the number of distinct words (including punctuation marks) in the text. Assume the vocabulary size is k.
where Ni is the observed count for the i-th word in the vocabulary. We can place a Dirichlet prior on (p1,…,pk) (e.g., Dirichlet(0,…,0)) which would result in the Dirichlet(N1,…,Nk) posterior distribution. One can then generate new text in the following way:
First generate (p1,…,pk) from the posterior distribution.
Then generate new words yn+1,yn+2,… i.i.d from the multinomial distribution Mult(1;p1,…,pk) until a specific word (e.g., the period symbol ’⋅’)) is generated.
Obviously, this will not result in realistic text. This is because the i.i.d assumption collapses the text into a bag-of-words representation, discarding all sequential information. Indeed, in this model, we are effectively working with the counts N1,…,Nk instead of the raw data y1,…,yn. The likelihood in terms of the counts is:
which coincides with (1). As a result, it cannot capture even the most basic linguistic dependencies (e.g., that “the” is likely to be followed by a noun).
A slightly better model will be obtained if we replace the i.i.d assumption with a Markovian one. This is also referred to as the first order AutoRegressive AR(1) model.
If you don’t like the assumption that py1 is constant, you can interpret the above as the conditional likelihood of (y2,…,yn) given y1.
Just as in the i.i.d model where we could write the likelihood in terms of counts, we can rewrite the likelihood (5) using ’bi-gram counts’. For each i and j in {1,…,k}, let xj∣i denote the number of times the word j follows word i in the text. Also let xi=∑j=1kxj∣i denote the number of times word i appears as the “previous word” in the text.
for some a1,…,ak>0. This means that all the k probability vectors (pj∣i,1≤j≤k) are assumed to be independently distributed according to the same Dirichlet distribution Dirichlet(a1,…,ak). The prior density is therefore
The choice of a1,…,ak controls the properties of the above posterior. Instead of plugging in some values for a1,…,ak, MacKay1995HierarchicalDirichlet propose estimating them by maximizing the marginal likelihood of the data (xj∣i,1≤i,j≤k) over all values of a1,…,ak.
The marginal distribution of the data given the Dirichlet hyperparameters a1,…,ak is:
P{xj∣i,1≤i,j≤k∣a1,…,ak}=∫i=1∏k∏j=1kxj∣i!xi!j=1∏kpj∣ixj∣i×i=1∏kΓ(a1)…Γ(ak)Γ(a1+⋯+ak)j=1∏kpj∣iaj−1d(pj∣i all j,i)=i=1∏k∏j=1kxj∣i!xi!Γ(a1)…Γ(ak)Γ(a1+⋯+ak)∫j=1∏kpj∣ixj∣i+aj−1d(pj∣i,1≤j≤k)=i=1∏k∏j=1kxj∣i!xi![Γ(xi+a1+⋯+ak)Γ(a1+⋯+ak)j=1∏kΓ(aj)Γ(xj∣i+aj)].
In the machine learning literature, the marginal distribution of the data given the hyperparameters of the prior is referred to as the Evidence (see e.g., Marginal likelihood). So:
The Evidence can be maximized over a1,…,ak to get a good choice of the hyperparameters. Note that the factorial terms do not involve ajs so they can dropped in this maximization:
With these gradients (scipy has an inbuilt function for digamma calculation), some numerical optimization routine can be used for maximizing the log-Evidence over a1,…,ak (see code).