Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Ten

Spring 2026, UC Berkeley

Additional Comments on the Kidney Cancer Data Analysis

For the kidney cancer dataset (Xi,ni),i=1,,N(X_i, n_i), i = 1, \dots, N (NN denotes the total number of counties, nin_i is the average population of county ii and XiX_i is the number of deaths due to kidney cancer in the fixed time period 1980891980-89). Consider the following three models for this dataset:

  1. Model One: θii.i.dBeta(0,0)\theta_i \overset{\text{i.i.d}}{\sim} Beta(0, 0) and XiθiindBin(ni,θi)X_i \mid \theta_i \overset{\text{ind}}{\sim} Bin(n_i, \theta_i). This model uses an uniformative prior on θi\theta_i, and the posterior mean estimate of each θi\theta_i coincides with the frequentist MLE Xi/niX_i/n_i. This model performs poorly for the present dataset. In particular, rankings of counties by estimated risk (e.g., the top 100 values of θ^i\hat{\theta}_i) are dominated by counties with small populations. Moreover, the model yields inaccurate predictions for future county-level death counts.

  2. Model Two: θii.i.dBeta(15,150000)\theta_i \overset{\text{i.i.d}}{\sim} Beta(15, 150000) and XiθiindBin(ni,θi)X_i \mid \theta_i \overset{\text{ind}}{\sim} Bin(n_i, \theta_i). This model employs a highly informative prior on θi\theta_i, where the hyperparameters (15,150000)(15, 150000) may be viewed as arising from a Beta distribution fitted to historical data. The prior encodes strong prior knowledge about the prevalence of kidney cancer mortality. This leads to substantial shrinkage of the posterior estimates towards the prior mean. Unsurprisingly, given the strength of the prior information, this model yields good empirical performance on the dataset.

  3. Model Three: loga,logbi.i.duniform(,)\log a, \log b \overset{\text{i.i.d}}{\sim} \text{uniform}(-\infty, \infty), θii.i.dBeta(a,b)\theta_i \overset{\text{i.i.d}}{\sim} Beta(a, b), XiθiindBin(ni,θi)X_i \mid \theta_i \overset{\text{ind}}{\sim} Bin(n_i, \theta_i). This model places an uninformative prior on the hyperparameters aa and bb, and hence does not inject strong prior knowledge about the prevalence of kidney cancer mortality. Compared to Models One and Two, this is a substantially more flexible and principled model, as it allows the amount of shrinkage to be learned from the data rather than fixed a priori.

    When fitted to the data, the posterior distribution of (a,b)(a, b) concentrates sharply around values close to (15,150000)(15, 150000) (see the code file for this lecture), indicating that the model successfully infers from the data that kidney cancer deaths are rate events. As a consequence, the posterior estimates of the individual θi\theta_i are adaptively shrunk toward a common mean, with the degree of shrinkage depending on the corresponding population sizes nin_i. This hierarchical borrowing of strength leads to stable estimation, sensible county-level rankings, and accurate predictions of future death counts, and overall the model performs very well.

    Note that this model does not require any a priori knowledge specific to kidney cancer and is therefore broadly applicable. For example, the same hierarchical framework can be used to predict batting averages in the baseball example, which will be explored in Homework Two.

Normal Likelihoods

We will next discuss Bayesian inference with normal likelihoods, and also study some connections to the James-Stein estimator. Normal likelihoods are applicable even in Binomial situations, such as in the baseball data analysis example from Efron. Here XiX_i denotes the number of hits in ni=n=45n_i = n = 45 at bats for player ii. The natural model is: XiBin(ni,pi)X_i \sim Bin(n_i, p_i) with parameter pip_i but a normal likelihood (as opposed to Binomial likelihood) can also be used because by invoking the Central Limit Theorem, we can write

XiniapproxN(pi,pi(1pi)ni).\begin{align*} \frac{X_i}{n_i} \overset{\text{approx}}{\sim} N\left(p_i, \frac{p_i(1-p_i)}{n_i} \right). \end{align*}

This is because:

n(Xinipi)LawN(0,pi(1pi)) as n.\begin{align} \sqrt{n} \left(\frac{X_i}{n_i} - p_i \right) \overset{\text{Law}}{\rightarrow} N(0, p_i(1 - p_i)) \text{ as } n \rightarrow \infty. \end{align}

Note that the unknown parameter pip_i appears in the variance above. If we want to work with normal likelihoods with known variance, a clean way is to use a variance stabilizing transformation. This is justified by the use of Delta method.

Informally, the Delta method states that if TnT_n has a limiting Normal distribution, then g(Tn)g(T_n) also has a limiting normal distribution and also gives an explicit formula for the asymptotic variance of g(Tn)g(T_n). This is surprising because gg can be linear or non-linear. In general, non-linear functions of normal random variables do not have a normal distribution. But the Delta method works because under the assumption that n(Tnp)LawN(0,τ2)\sqrt{n}(T_n - p) \overset{\text{Law}}{\rightarrow} N(0, \tau^2), it follows that TnT_n converges in probability to pp so that TnT_n will be close to pp at least for large nn. In a neighborhood of pp, the non-linear function gg can be approximated by a linear function which means that gg effectively behaves like a linear function. Indeed, the Delta method is a consequence of the approximation:

g(Tn)g(p)g(p)(Tnp).g(T_n) - g(p) \approx g'(p) \left(T_n - p \right).

By the Delta method and (2), we have

ni(g(Xi/ni)g(pi))LawN(0,(g(pi))2pi(1pi)).\sqrt{n_i} \left(g(X_i/n_i) - g(p_i) \right) \overset{\text{Law}}{\rightarrow} N(0, (g'(p_i))^2 p_i(1 - p_i)).

The variance above will not depend on pip_i if we choose the function gg so that

g(pi)=1pi(1pi)g'(p_i) = \frac{1}{\sqrt{p_i(1 - p_i)}}

Solving this for gg, we get g(pi)=2arcsin(pi)g(p_i) = 2 \arcsin(\sqrt{p_i}). This is a variance stabilizing transformation for the Binomial. Thus by the Delta method, we have

2n(arcsin(Xi/ni)arcsin(pi))LawN(0,1).2 \sqrt{n} \left(\arcsin(\sqrt{X_i/n_i}) - \arcsin(\sqrt{p_i}) \right) \overset{\text{Law}}{\rightarrow} N(0, 1).

While working with binomial counts, it thus makes sense to transform the observation XiX_i and parameter pip_i into

Yi:=2niarcsin(Xi/ni)   and   θi:=2niarcsin(pi)Y_i := 2 \sqrt{n_i} \arcsin(\sqrt{X_i/n_i}) ~~ \text{ and } ~~ \theta_i := 2 \sqrt{n_i} \arcsin(\sqrt{p_i})

respectively and then work with the likelihood

YiθiN(θi,1).Y_i | \theta_i \sim N(\theta_i, 1).

We will study Bayesian estimation under this model in the next lecture.