Spring 2026, UC Berkeley
Additional Comments on the Kidney Cancer Data Analysis¶
For the kidney cancer dataset ( denotes the total number of counties, is the average population of county and is the number of deaths due to kidney cancer in the fixed time period ). Consider the following three models for this dataset:
Model One: and . This model uses an uniformative prior on , and the posterior mean estimate of each coincides with the frequentist MLE . This model performs poorly for the present dataset. In particular, rankings of counties by estimated risk (e.g., the top 100 values of ) are dominated by counties with small populations. Moreover, the model yields inaccurate predictions for future county-level death counts.
Model Two: and . This model employs a highly informative prior on , where the hyperparameters may be viewed as arising from a Beta distribution fitted to historical data. The prior encodes strong prior knowledge about the prevalence of kidney cancer mortality. This leads to substantial shrinkage of the posterior estimates towards the prior mean. Unsurprisingly, given the strength of the prior information, this model yields good empirical performance on the dataset.
Model Three: , , . This model places an uninformative prior on the hyperparameters and , and hence does not inject strong prior knowledge about the prevalence of kidney cancer mortality. Compared to Models One and Two, this is a substantially more flexible and principled model, as it allows the amount of shrinkage to be learned from the data rather than fixed a priori.
When fitted to the data, the posterior distribution of concentrates sharply around values close to (see the code file for this lecture), indicating that the model successfully infers from the data that kidney cancer deaths are rate events. As a consequence, the posterior estimates of the individual are adaptively shrunk toward a common mean, with the degree of shrinkage depending on the corresponding population sizes . This hierarchical borrowing of strength leads to stable estimation, sensible county-level rankings, and accurate predictions of future death counts, and overall the model performs very well.
Note that this model does not require any a priori knowledge specific to kidney cancer and is therefore broadly applicable. For example, the same hierarchical framework can be used to predict batting averages in the baseball example, which will be explored in Homework Two.
Normal Likelihoods¶
We will next discuss Bayesian inference with normal likelihoods, and also study some connections to the James-Stein estimator. Normal likelihoods are applicable even in Binomial situations, such as in the baseball data analysis example from Efron. Here denotes the number of hits in at bats for player . The natural model is: with parameter but a normal likelihood (as opposed to Binomial likelihood) can also be used because by invoking the Central Limit Theorem, we can write
This is because:
Note that the unknown parameter appears in the variance above. If we want to work with normal likelihoods with known variance, a clean way is to use a variance stabilizing transformation. This is justified by the use of Delta method.
Informally, the Delta method states that if has a limiting Normal distribution, then also has a limiting normal distribution and also gives an explicit formula for the asymptotic variance of . This is surprising because can be linear or non-linear. In general, non-linear functions of normal random variables do not have a normal distribution. But the Delta method works because under the assumption that , it follows that converges in probability to so that will be close to at least for large . In a neighborhood of , the non-linear function can be approximated by a linear function which means that effectively behaves like a linear function. Indeed, the Delta method is a consequence of the approximation:
By the Delta method and (2), we have
The variance above will not depend on if we choose the function so that
Solving this for , we get . This is a variance stabilizing transformation for the Binomial. Thus by the Delta method, we have
While working with binomial counts, it thus makes sense to transform the observation and parameter into
respectively and then work with the likelihood
We will study Bayesian estimation under this model in the next lecture.