import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
import pandas as pd

Lecture 9: Kidney Cancer Data Analysis¶

This is an example on “Bayesian Disease Mapping”, and is taken from Section 2.7 on estimation of kidney cancer rates of the book “Bayesian Data Analysis Edition 3” by Gelman et al. Some of the code is taken from a blog post by Robin Ryder (google this). Consider the following dataset.

#Read the kidney cancer dataset
d_full = pd.read_csv("KidneyCancerClean.csv", skiprows=4)
display(d_full.head(15))
summary = d_full.describe(include = 'all')
display(summary)

This is county-level data on population and deaths due to kidney cancer. There are quite a few columns in the dataset but we shall only work with the four columns: dc, dc.2, pop and pop.2:

The columns dc and dc.2 represent death counts due to kidney cancer in 1980-84 and 1985-89 respectively.
The columns pop and pop.2 represent some measure of the average population in 1980-84 and 1985-89 respectively.

Let us create a smaller dataframe with only these columns.

d = d_full[['state', 'Location', 'fips', 'dc', 'pop', 'dc.2', 'pop.2']].copy()
display(d.head(20))

The purpose of this data analysis is to estimate the kidney cancer death rates for each county and, more importantly, to flag the counties with high kidney cancer death rates. Before proceeding to the analysis, let us first combine the death counts for the two periods 80-84 and 85-89 to get one overall death count for each county. Further, let us take the average of pop and pop.2 to get a measure of the average population of each county in the period 80-89.

# Combine the death and population counts for the two periods 80-84 and 85-89
d['dct'] = d['dc'] + d['dc.2'] #dct stands for death count total
d['popm'] = (d['pop'] + d['pop.2']) / 2
display(d.head(15))

A simple way to estimate the kidney cancer rate of each county is to use the proportion:

\begin{align*} \text{Naive Estimate of Kidney Cancer Death Rate} = \frac{\text{Number of Kidney Cancer Deaths}}{\text{Population}} \end{align*}

(1)

We can then rank all the counties according to their naive proportions and flag the counties for whom the naive proportion is high.

d['naiveproportion'] = d['dct'] / d['popm']
summary = d['naiveproportion'].describe()
print(summary)
#Histogram of naive proportion estimates:
import matplotlib.pyplot as plt
plt.hist(d['naiveproportion'], bins=300)
plt.xlabel('naiveproportion')
plt.ylabel('Frequency')
plt.title('Histogram of Naive Proportions')
plt.grid(True)
plt.show()
#Some observations on the histogram: 
#There is a spike at zero (corresponding to counties with no kidney cancer deaths)
#Most proportions are around 0.0001 but there are a few counties which have large proportions (more than 0.0004)

count    3110.000000
mean        0.000108
std         0.000068
min         0.000000
25%         0.000067
50%         0.000100
75%         0.000139
max         0.000691
Name: naiveproportion, dtype: float64

#Flagging Counties with High Kidney Cancer Death Proportions. 
#One way to do this is to get the counties with the 100 highest naive proportion:
threshold = d['naiveproportion'].nlargest(100).iloc[-1]
print(threshold)

d['cancerhigh'] = d['naiveproportion'] >= threshold
with pd.option_context('display.max_rows', None):
    display(d[d['cancerhigh'] >= True])

0.00025143527636927463

#It will be cool to plot the high cancer counties (in terms of the naive proportions) on the US map
#This can be done in the following way.
import plotly.express as px
#First convert fips to standard form:
d['modfips'] = d['fips'].apply(lambda x: str(x).zfill(5))
#Plotting the counties with high cancer proportion:
fig = px.choropleth(
    d, 
    geojson="https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json", # This is a dataset containing county geometries
    locations='modfips', # Use the 'fips' column for county identification
    color='cancerhigh',
    hover_name='Location',  # Display the 'Location' column on hover
    scope="usa",
    title="Top 100 kidney cancer death rate counties (in terms of naive proportions)",
    color_continuous_scale="YlOrRd"
)
fig.update_geos(fitbounds="locations")
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
fig.show()

Question: Is this a good way of flagging the high kidney cancer rate counties? Should we now look into specific reasons (such as dietary patterns, pollution etc, lack of access to quality medical care etc.) why these counties have high rates?

It turns out that this analysis is misleading because of the use of naive proportions as estimates of kidney cancer death rates. In particular, the counties that we flagged are mostly counties with low populations. This can be understood as follows. The typical rate of kidney cancer is about 0.0001 as can be seen, for example, from the naive proportion estimates of counties with large populations.

sorted_populations = d['popm'].sort_values()
print(list(sorted_populations))
#Select counties with large populations
#By large population, let us take populations more than 300000
largecounties = d[d['popm'] >= 300000]
largecounties.shape[0]
#So there are 323 counties with population more than 300000
#Let us evaluate the naive proportions for these counties. 
summary = d[d['popm'] >= 300000]['naiveproportion'].describe()
display(summary)
#the mean and median are about 0.0001 which can therefore be viewed as a typical value of the kidney cancer rate

Fetching long content....

count    323.000000
mean       0.000098
std        0.000029
min        0.000034
25%        0.000081
50%        0.000095
75%        0.000115
max        0.000227
Name: naiveproportion, dtype: float64

Now we can explain why our initial flagged set of high cancer rate counties is dominated by counties with low populations. Consider a county with a small population of 1000. In the given ten-year period, suppose this county has 1 kidney cancer death. The naive proportion would then be 1/1000 = 0.001 which is already 10 times the typical rate of 0.0001. So this county would immediately be flagged as a high cancer rate county just because of the single death. Go back to our flagged list of high population counties and check the population sizes.

d[d['naiveproportion'] >= threshold]['popm'].describe()

count      100.000000
mean     16940.950000
std      11922.792393
min       1448.000000
25%       8442.625000
50%      14204.250000
75%      21550.375000
max      60380.500000
Name: popm, dtype: float64

The county with the largest population in this list is 60000 but there are counties with population as small as 1448. So this entire list of high kidney cancer rate counties is dominated by counties with small populations.

Note that there are many large counties (e.g., with population more than 300000) in the dataset.

#Summary statistics of counties with population more than 300000:
print(d[d['popm'] >= 300000]['popm'].describe())
print(d.shape)
#So there are 323 counties with population more than 300000 (total number of counties in the dataset is 3110).

count    3.230000e+02
mean     9.672240e+05
std      1.224408e+06
min      3.024855e+05
25%      4.001872e+05
50%      6.118650e+05
75%      1.091832e+06
max      1.593715e+07
Name: popm, dtype: float64
(3110, 12)

Bayesian estimates of the kidney cancer death rates¶

For Bayesian estimates, we need an appropriate prior. One way of obtaining such a prior is to consider the naive proportions of the large population counties (say, with population at least 300000) and to fit a Beta density to the naive proportions. This can be done via mean and variance as follows.

proportions_largecounties = d['naiveproportion'][d['popm'] >= 300000] #filter out the high population counties
mean_largecounties = np.mean(proportions_largecounties)
var_largecounties = np.var(proportions_largecounties)
print(mean_largecounties, var_largecounties)

9.813238615946435e-05 8.172661645662482e-10

For what values of $a$ and $b$ would the $\text{Beta}(a, b)$ density have mean $m$ and variance $V$ . To answer this, we need to basically solve

\begin{align*} \frac{a}{a+b} = m ~~ \text{ and } ~~ \frac{a}{a+b} \frac{b}{a+b} \frac{1}{a+b+1} = V. \end{align*}

(2)

Plugging $m$ for $\frac{a}{a+b}$ and $1-m$ for $\frac{b}{a+b}$ in the second equation, we obtain

\begin{align*} \frac{m(1-m)}{a+b+1} = V \implies a+b = \frac{m(1-m)}{V} - 1. \end{align*}

(3)

Combining this with $\frac{a}{a+b} = m$ , we obtain

\begin{align*} a = m\left(\frac{m(1-m)}{V} - 1 \right) ~~~ \text{ and } ~~~ b = (1-m)\left(\frac{m(1-m)}{V} - 1 \right) \end{align*}

(4)

Of course, $m$ and $V$ should be such that $a$ and $b$ as computed above should be strictly bigger than zero. If not, this method of figuring out $a$ and $b$ using mean and variance does not work.

#Using the above formula, we obtain
m = mean_largecounties
V = var_largecounties
ahat = ((m*m*(1-m))/V) - m
bhat = (((1-m)*(1-m)*m)/V) - (1-m)
a = ahat
b = bhat
print(a, b)

11.781889938777498 120049.39668603126

Note that here $a$ is much smaller than $b$ which means that the Beta density will be concentrated on very small values (this makes sense because this is intended to be a prior for the kidney cancer rate which we believe to be small with typical values around 0.0001).

#Plot of this Beta density
x = np.arange(0, 8e-4, 1e-6)
y = beta.pdf(x, a, b)
plt.plot(x, y, label = f"Beta({a}, {b})")
plt.title(f'Beta Density for a = {a} and b = {b}')
plt.xlabel("x")
plt.ylabel("Density")
plt.legend()
plt.show()

We now compute the Bayesian estimates of kidney cancer death rates. For a county with population $POP$ and death count $DC$ , our posterior density of the rate would be

\begin{align*} \text{Beta}(a + DC, b + POP - DC) \end{align*}

(5)

Thus the posterior mean (which will be our Bayesian estimate of the kidney cancer death rate) is

\begin{align*} \frac{a + DC}{a + b + POP}. \end{align*}

(6)

d['bayes1'] = (a + d['dct']) / (a + b + d['popm'])
#We can now flag the top 100 counties using the Bayesian estimate:
threshold = d['bayes1'].nlargest(100).iloc[-1]
d['bayeshigh1'] = d['bayes1'] >= threshold
#Let us look at these counties to see if they are different from the previous flagged list:
with pd.option_context('display.max_rows', None):
    display(d[d['bayeshigh1'] >= True])

Clearly now there are many counties with large populations in the list. The list of counties flagged as ‘high rate counties’ by the Bayesian method is quite different from the corresponding list by the naive frequentist proportions.

#Table of the top 100 counties flagged by the naive proportion and the Bayesian estimate:
#2 by 2 cross-tabulation of the two lists:
ct = pd.crosstab(d['cancerhigh'], d['bayeshigh1'])
display(ct)
#There are only 18 (out of 100) counties that are flagged by both the naive and the Bayesian method.

Here is a plot of the high Bayesian estimate counties on the US map.

#Plotting counties with high Bayesian estimates of kidney cancer rates on the US map
import plotly.express as px
#First convert fips to standard form:
d['modfips'] = d['fips'].apply(lambda x: str(x).zfill(5))
#Plotting the counties with high cancer proportion:

fig = px.choropleth(
    d, 
    geojson="https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json", # This is a dataset containing county geometries
    locations='modfips', # Use the 'fips' column for county identification
    color='bayeshigh1',
    hover_name='Location',  # Display the 'Location' column on hover
    scope="usa",
    title="Top 100 kidney cancer death rate counties (in terms of Bayes estimates)",
    color_continuous_scale="YlOrRd"
)
fig.update_geos(fitbounds="locations")
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
fig.show()

Here are both plots superimposed.

d['category'] = "Neither"
d.loc[d['bayeshigh1'] & ~d['cancerhigh'], 'category'] = "Bayes only"
d.loc[~d['bayeshigh1'] & d['cancerhigh'], 'category'] = "Naive only"
d.loc[d['bayeshigh1'] & d['cancerhigh'], 'category'] = "Both"
fig = px.choropleth(
    d,
    geojson="https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json",
    locations='modfips',
    color='category',
    scope="usa",
    hover_name='Location',
    color_discrete_map={
        "Bayes only": "crimson",
        "Naive only": "dodgerblue",
        "Both": "lime",        # 🔥 bright green
        "Neither": "lightgray"
    }
)

fig.update_geos(fitbounds="locations")
fig.show()

#Look at the states for the high counties (under both the naive and Bayes estimates)
counts = d.groupby('state').agg({'cancerhigh': 'sum', 'bayeshigh1': 'sum'})
print(counts)

                      cancerhigh  bayeshigh1
state                                       
ALABAMA                        1           1
ARIZONA                        0           3
ARKANSAS                       2           2
CALIFORNIA                     0           2
COLORADO                       1           0
CONNECTICUT                    0           1
DELAWARE                       0           0
DISTRICT.OF.COLUMBIA           0           0
FLORIDA                        0          14
GEORGIA                        4           0
HAWAII                         0           0
IDAHO                          2           0
ILLINOIS                       3           6
INDIANA                        0           2
IOWA                           5           6
KANSAS                         6           3
KENTUCKY                       2           1
LOUISIANA                      0           3
MAINE                          0           0
MARYLAND                       0           1
MASSACHUSETTS                  0           1
MICHIGAN                       4           4
MINNESOTA                      1           7
MISSISSIPPI                    8           0
MISSOURI                       7           6
MONTANA                        6           0
NEBRASKA                       5           1
NEVADA                         1           0
NEW.HAMPSHIRE                  0           0
NEW.JERSEY                     0           3
NEW.MEXICO                     0           0
NEW.YORK                       0           1
NORTH.CAROLINA                 2           3
NORTH.DAKOTA                   4           0
OHIO                           0           0
OKLAHOMA                       5           6
OREGON                         2           2
PENNSYLVANIA                   0           8
RHODE.ISLAND                   0           1
SOUTH.CAROLINA                 1           2
SOUTH.DAKOTA                   7           0
TENNESSEE                      1           2
TEXAS                         13           4
UTAH                           0           0
VERMONT                        0           0
VIRGINIA                       4           0
WASHINGTON                     0           0
WEST.VIRGINIA                  1           0
WISCONSIN                      2           4
WYOMING                        0           0

Clearly, there are quite many differences between the two estimates.

Let us now see how the Bayes rule converts prior to posterior for some specific counties.

#Consider the county in row 1681 of the dataset:
print(d.iloc[1681])

state                                NEBRASKA
Location           McPherson County, Nebraska
fips                                    31117
dc                                          0
pop                                      1504
dc.2                                        1
pop.2                                    1392
dct                                         1
popm                                   1448.0
naiveproportion                      0.000691
cancerhigh                               True
modfips                                 31117
bayes1                               0.000105
bayeshigh1                              False
category                           Naive only
Name: 1681, dtype: object

#This county has low population (1448) with one death. The naive proportion is 0.000691 (which is roughly seven times the typical value of 0.0001)
#So this has been flagged as a high risk county by the naive estimate
#This is NOT a high risk county according to the Bayesian estimate
#Below we plot the prior and the posterior for this county to see why this is the case.
x = np.arange(0, 8e-4, 1e-6)
Tot = round(d.loc[1681, 'popm'])
Pos = d.loc[1681, 'dct']
Neg = Tot - Pos
plt.plot(x, beta.pdf(x, a, b), color = 'red', label = 'Prior')
plt.plot(x, beta.pdf(x, a + Pos, b + Neg), label = 'Posterior')
plt.axvline(x = Pos/Tot, color = 'blue') #line at the naive proportion
plt.legend()
plt.show()
#Basically there is very little difference between the prior and the posterior
#In other words, the Bayesian posterior is largely ignoring the "review" data and only going by the prior
#This makes sense because there is not much information from the observed count for such a small county.

#Next let us look at the county in row 1624. This is Arthur County again Nebraska
#It has a small population of 1233 and death count of 0. 
print(d.iloc[1624])
x = np.arange(0, 8e-4, 1e-6)
Tot = round(d.loc[1624, 'popm'])
Pos = d.loc[1624, 'dct']
Neg = Tot - Pos
plt.plot(x, beta.pdf(x, a, b), color = 'red', label = 'Prior')
plt.plot(x, beta.pdf(x, a + Pos, b + Neg), label = 'Posterior')
plt.axvline(x = Pos/Tot, color = 'blue') #line at the naive proportion
plt.legend()
plt.show()
#Similar story here; not much difference between prior and posterior

state                             NEBRASKA
Location           Arthur County, Nebraska
fips                                 31005
dc                                       0
pop                                   1295
dc.2                                     0
pop.2                                 1172
dct                                      0
popm                                1233.5
naiveproportion                        0.0
cancerhigh                           False
modfips                              31005
bayes1                            0.000097
bayeshigh1                           False
category                           Neither
Name: 1624, dtype: object

Now let us look at an example of a county which was flagged by the Bayesian analysis but not by the frequentist analysis.

print(d.iloc[346])
#This is a large county (population around half a million) and a large death count (120)
x = np.arange(0, 8e-4, 1e-6)
Tot = round(d.loc[346, 'popm'])
Pos = d.loc[346, 'dct']
Neg = Tot - Pos
plt.plot(x, beta.pdf(x, a, b), color = 'red', label = 'Prior')
plt.plot(x, beta.pdf(x, a + Pos, b + Neg), label = 'Posterior')
plt.axvline(x = Pos/Tot, color = 'blue') #line at the naive proportion
plt.legend()
plt.show()
#The Bayesian posterior moves quite a bit in the direction of the frequentist estimate
#The frequentist analysis does not flag this as a high-risk county because there were many low population counties
#which dominated its proportion.

state                               FLORIDA
Location           Sarasota County, Florida
fips                                  12115
dc                                       55
pop                                  485595
dc.2                                     65
pop.2                                573711
dct                                     120
popm                               529653.0
naiveproportion                    0.000227
cancerhigh                            False
modfips                               12115
bayes1                             0.000203
bayeshigh1                             True
category                         Bayes only
Name: 346, dtype: object

Bayesian estimates are often said to perform “Shrinkage”. They shrink the frequentist estimates towards the prior. Overall, the variability of Bayesian estimates will be quite a bit smaller compared to frequentist estimates.

h1, _, _ = plt.hist(d['naiveproportion'], bins=100, color='blue', label='Naive Estimate', alpha=0.5)
h2, _, _ = plt.hist(d['bayes1'], bins=100, color='red', label='Bayes Estimate', alpha=0.5)

plt.title("Histograms of the Naive and Bayesian estimates")
plt.xlabel("Estimates")
plt.legend() 
plt.show()

This shrinkage is one reason why the Bayes method was able to flag Sarasota County, Florida as a high risk county even though its estimate was only 0.000203; this was already extreme compared to the rest of the Bayesian estimates (but not extreme relative to naive proportions).

Recall that the original dataset had two time periods 80-84 and 85-89. We now repeat the analysis only for the first time period and attempt to predict the proportions for the second time period. We can compare the performances of the naive proportions and the Bayes estimates in this prediction problem.

d['naiveproportion1'] = d['dc'] / d['pop'] #these are the naive estimates

#For the Bayes estimates, we first need to estimate the prior:
proportions1_largecounties = d['naiveproportion1'][d['pop'] >= 300000] #filter out the high population counties
m1 = np.mean(proportions1_largecounties)
V1 = np.var(proportions1_largecounties)
a1 = ((m1*m1*(1-m1))/V1) - m1
b1 = (((1-m1)*(1-m1)*m1)/V1) - (1-m1)

d['bayesestimate1'] = (d['dc'] + a1)/(d['pop'] + a1 + b1)

#Proportions from the second half of the data:
d['proportion2'] = d['dc.2'] / d['pop.2']

#Prediction Accuracy:
diff_naiveproportion1 = d['naiveproportion1'] - d['proportion2']
sum_abs_diff_naiveproportion1 = diff_naiveproportion1.abs().sum()
sum_squared_diff_naiveproportion1 = (diff_naiveproportion1**2).sum()
kl_loss_naiveproportion1 = (d['proportion2'] * np.log(d['proportion2'] / (d['naiveproportion1'] + 1e-10) + 1e-10)).sum() + ((1 - d['proportion2']) * np.log((1 - d['proportion2']) / (1 - d['naiveproportion1'] + 1e-10) + 1e-10)).sum()
negloglik_naiveproportion1 = -(d['dc.2'] * np.log(d['naiveproportion1'] + 1e-10)).sum() - ((d['pop.2'] - d['dc.2']) * np.log(1 - d['naiveproportion1'] + 1e-10)).sum()


diff_bayesestimate1 = d['bayesestimate1'] - d['proportion2']
sum_abs_diff_bayesestimate1 = diff_bayesestimate1.abs().sum()
sum_squared_diff_bayesestimate1 = (diff_bayesestimate1**2).sum()
kl_loss_bayesestimate1 = (d['proportion2'] * np.log(d['proportion2'] / (d['bayesestimate1'] + 1e-10) + 1e-10)).sum() + ((1 - d['proportion2']) * np.log((1 - d['proportion2']) / (1 - d['bayesestimate1'] + 1e-10) + 1e-10)).sum()
negloglik_bayesestimate1 = -(d['dc.2'] * np.log(d['bayesestimate1'] + 1e-10)).sum() - ((d['pop.2'] - d['dc.2']) * np.log(1 - d['bayesestimate1'] + 1e-10)).sum() 

bothresults_absdiff = np.array([sum_abs_diff_naiveproportion1, sum_abs_diff_bayesestimate1])
bothresults_squareddiff = np.array([sum_squared_diff_naiveproportion1, sum_squared_diff_bayesestimate1])
bothresults_klloss = np.array([kl_loss_naiveproportion1, kl_loss_bayesestimate1])
bothresults_negloglik = np.array([negloglik_naiveproportion1, negloglik_bayesestimate1])
print(bothresults_absdiff)
print(bothresults_squareddiff)
print(bothresults_klloss)
print(bothresults_negloglik)

[0.12508483 0.1053034 ]
[1.29012723e-05 8.07678746e-06]
[0.50184892 0.06705131]
[290602.92682912 282151.84239388]

There is thus an improvement in the prediction accuracy using Bayes estimates as opposed to the naive estimates, in terms of the considered comparison metrics.

Maximum Marginal Likelihood Empirical Bayes¶

In the above Bayesian analysis, we used a frequentist method for selecting $a$ and $b$ for use in the Beta(a, b) prior. Specifically, we looked at the large counties (defined as counties with population more than 300000), calculated the naive proportions for these large counties, and then used method of moments (by matching mean and variance) to determine $a$ and $b$ . This approach might be unsatisfactory for several reasons:

The number 300000 was arbitrarily chosen. How do the results if we instead used 250000 or 350000 or some other cutoff for determination of a large county
The choice of method of moments approach for determining $a$ and $b$ might also seem somewhat arbitrary

Here is an alternative method for estimating $a$ and $b$ . We take $a$ and $b$ to be unknown parameters, then write the likelihood of the data in terms of $a$ and $b$ (by marginalizing $\theta_1, \dots \theta_n$ ) and then by maximizing the marginalized likelihood.

The basic fact is:

\begin{align*} X \sim \text{Bin}(n, \theta) \text{ and } \theta \sim \text{Beta}(a, b) \implies \mathbb{P}\{X = x\} = {n \choose x} \frac{B(x + a, n-x + b)}{B(a, b)} \end{align*}

(7)

for every $x = 0, 1, \dots, n$ . So the marginal likelihood of the data $X_1, \dots, X_N$ in terms of $a$ and $b$ is given by:

\begin{align*} f_{\text{data} \mid a, b}(X_1, \dots, X_n) = \prod_{i=1}^N {n_i \choose X_i} \frac{B(X_i + a, n_i-X_i + b)}{B(a, b)} \end{align*}

(8)

The term $n_i \choose X_i$ above can be dropped as it does not involve $a$ and $b$ , leading to:

\begin{align*} \text{marginal likelihood for } a, b \propto \prod_{i=1}^N \frac{B(X_i + a, n_i-X_i + b)}{B(a, b)}. \end{align*}

(9)

A natural way of estimating $a$ and $b$ is to maximize the above marginal likelihood. This is the approach we take below. We employ a simple grid-based method for maximizing $a$ and $b$ .

from scipy.special import betaln

We will place the grid for both $a$ and $b$ in the range $[e^{-20}, e^{20}]$ that is equi-spaced in the logarithmic domain.

print(np.exp(-20), np.exp(20))

2.061153622438558e-09 485165195.4097903

For each pair of values of $a$ and $b$ in the grid, we compute the marginal likelihood, and select the value of $a$ and $b$ with the highest likelihood.

x = d['dct'].to_numpy()
n = d['popm'].to_numpy()
nmx = n-x
N = len(x)

gridres = 400
#loga_low = np.log(1) #use this for getting good enough resolution for obtaining the maximizer
#loga_high = np.log(100) #use this for getting good enough resolution for obtaining the maximizer
#loga_low =  np.log(12) #for getting a good plot of the likelihood surface, use this
#loga_high = np.log(20) #for getting a good plot of the likelihood surface, use this
#loga_low = -20 #try this first
#loga_high = 20 #try this first

#logb_low = np.log(50000) #use this for getting good enough resolution for obtaining the maximizer
#logb_high = np.log(300000) #use this for getting good enough resolution for obtaining the maximizer
#logb_low = np.log(120000) #for getting a good plot of the likelihood surface, use this
#logb_high = np.log(200000) #for getting a good plot of the likelihood surface, use this
#logb_low = -20 #try this first
#logb_high = 20 #try this first
a_grid = np.exp(np.linspace(loga_low, loga_high, gridres))
b_grid = np.exp(np.linspace(logb_low, logb_high, gridres))

logB_ab = betaln(a_grid[:, None], b_grid[None, :]) #this is a gridres by gridres matrix where the (i, j)th entry is log(B(a_grid[i], b_grid[j]))

loglik = np.empty((gridres, gridres), dtype = float)

for ia, a in enumerate(a_grid): 
    s = betaln(x[:, None] + a, nmx[:, None] + b_grid[None, :]).sum(axis = 0)
    loglik[ia, :] = s - N*logB_ab[ia, :]

idx = np.unravel_index(np.argmax(loglik), loglik.shape)
#smallest negative loglikelihood: 
print(-np.max(loglik))
a_mle = a_grid[idx[0]]
b_mle = b_grid[idx[1]]
print(a_mle, b_mle)

503419.78963374044
15.750816899734765 159524.03423062366

Below we plot the likelihood function to see if it has multiple peaks.

lik = np.exp(loglik - np.max(loglik))
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

A, B = np.meshgrid(np.log(a_grid), np.log(b_grid), indexing="ij")

fig = plt.figure(figsize=(14, 7))
ax = fig.add_subplot(111, projection="3d")

surf = ax.plot_surface(
    A, B, lik,
    cmap="viridis",
    linewidth=0,
    antialiased=True
)

ax.set_xlabel("log(a)")
ax.set_ylabel("log(b)")
ax.set_zlabel("Likelihood (relative)")
ax.set_title("Likelihood surface (rescaled)")

# mark MLE
ax.scatter(
    np.log(a_mle),
    np.log(b_mle),
    1.0,
    color="red",
    s=60
)

fig.colorbar(surf, shrink=0.5, aspect=12)

plt.show()

The likelihood seems to be mostly zero except for one peak. To get a better picture of the likelihood, we need to increase the grid resolution near the peak. Go back and change the grid to lie in the interval $[12, 20]$ for $a$ and $[120000, 200000]$ for $b$ .

x = d['dct'].to_numpy()
n = d['popm'].to_numpy()
nmx = n-x
N = len(x)

gridres = 400
#loga_low = np.log(1) #use this for getting good enough resolution for obtaining the maximizer
#loga_high = np.log(100) #use this for getting good enough resolution for obtaining the maximizer
loga_low =  np.log(12) #for getting a good plot of the likelihood surface, use this
loga_high = np.log(20) #for getting a good plot of the likelihood surface, use this
#loga_low = -20 #try this first (it does not give an accurate MLE because of poor resolution in the grid)
#loga_high = 20 #try this first (it does not give an accurate MLE because of poor resolution in the grid)

#logb_low = np.log(50000) #use this for getting good enough resolution for obtaining the maximizer
#logb_high = np.log(300000) #use this for getting good enough resolution for obtaining the maximizer
logb_low = np.log(120000) #for getting a good plot of the likelihood surface, use this
logb_high = np.log(200000) #for getting a good plot of the likelihood surface, use this
#logb_low = -20 #try this first (it does not give an accurate MLE because of poor resolution in the grid)
#logb_high = 20 #try this first (it does not give an accurate MLE because of poor resolution in the grid)
a_grid = np.exp(np.linspace(loga_low, loga_high, gridres))
b_grid = np.exp(np.linspace(logb_low, logb_high, gridres))

logB_ab = betaln(a_grid[:, None], b_grid[None, :]) #this is a gridres by gridres matrix where the (i, j)th entry is log(B(a_grid[i], b_grid[j]))

loglik = np.empty((gridres, gridres), dtype = float)

for ia, a in enumerate(a_grid): 
    s = betaln(x[:, None] + a, nmx[:, None] + b_grid[None, :]).sum(axis = 0)
    loglik[ia, :] = s - N*logB_ab[ia, :]

idx = np.unravel_index(np.argmax(loglik), loglik.shape)
#smallest negative loglikelihood: 
print(-np.max(loglik))
a_mle = a_grid[idx[0]]
b_mle = b_grid[idx[1]]
print(a_mle, b_mle)

lik = np.exp(loglik - np.max(loglik))
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

A, B = np.meshgrid(np.log(a_grid), np.log(b_grid), indexing="ij")

fig = plt.figure(figsize=(14, 7))
ax = fig.add_subplot(111, projection="3d")

surf = ax.plot_surface(
    A, B, lik,
    cmap="viridis",
    linewidth=0,
    antialiased=True
)

ax.set_xlabel("log(a)")
ax.set_ylabel("log(b)")
ax.set_zlabel("Likelihood (relative)")
ax.set_title("Likelihood surface (rescaled)")

# mark MLE
ax.scatter(
    np.log(a_mle),
    np.log(b_mle),
    1.0,
    color="red",
    s=60
)

fig.colorbar(surf, shrink=0.5, aspect=12)

plt.show()

503402.3817747312
15.701594084486949 151487.01656615367

With the estimates of $a$ and $b$ obtained by grid maximization as above, we can compute the Bayes estimates for each county.

a = a_mle
b = b_mle
d['bayesestimate_margmle'] = (a + d['dct']) / (a + b + d['popm'])
#We can now flag the top 100 counties using the Bayesian estimate:
threshold_margmle = d['bayesestimate_margmle'].nlargest(100).iloc[-1]
d['bayeshigh_margmle'] = d['bayesestimate_margmle'] >= threshold_margmle
#Let us see how different this list is from the previous Bayes list: 

#Table of the top 100 counties flagged by both Bayesian estimate:
#2 by 2 cross-tabulation of the two lists:
ct = pd.crosstab(d['bayeshigh1'], d['bayeshigh_margmle'])
display(ct)

The list of the top 100 counties are the same in both the Bayes methods.

#scatter plot of d['bayes1'] vs d['bayesestimate_margmle']
plt.figure(figsize=(12,6))
plt.scatter(d['bayes1'], d['bayesestimate_margmle'], alpha = 0.5)
plt.xlabel('bayes1')
plt.ylabel('bayesestimate_margmle')
#superimpose y = x line
min_val = min(d['bayes1'].min(), d['bayesestimate_margmle'].min())
max_val = max(d['bayes1'].max(), d['bayesestimate_margmle'].max())
plt.plot([min_val, max_val], [min_val, max_val],
         linestyle='--', linewidth=1, color='red')
plt.title('Scatter plot of bayes1 vs bayesestimate_margmle')
plt.grid(True)
plt.show()

Below we check prediction accuracy of all the methods (naive frequentist proporition, and the Bayes estimates). We apply the methods to the data from 1980-84 and then predict the proportions for the period from 1985-89.

d['naiveproportion1'] = d['dc'] / d['pop'] #these are the naive estimates (frequentist)

#Bayes Estimate ONE
proportions1_largecounties = d['naiveproportion1'][d['pop'] >= 300000] #filter out the high population counties
m1 = np.mean(proportions1_largecounties)
V1 = np.var(proportions1_largecounties)
a1 = ((m1*m1*(1-m1))/V1) - m1
b1 = (((1-m1)*(1-m1)*m1)/V1) - (1-m1)
print(f"Estimated a1: {a1}, Estimated b1: {b1}")

d['bayes1_firsthalf'] = (d['dc'] + a1)/(d['pop'] + a1 + b1)

#Bayes Estimate TWO
loga_low = np.log(1)
loga_high = np.log(100)

logb_low = np.log(50000)
logb_high = np.log(300000)

a_grid = np.exp(np.linspace(loga_low, loga_high, gridres))
b_grid = np.exp(np.linspace(logb_low, logb_high, gridres))

x = d['dc'].to_numpy()
n = d['pop'].to_numpy()
nmx = n-x
N = len(x)

logB_ab = betaln(a_grid[:, None], b_grid[None, :]) #this is a gridres by gridres matrix where the (i, j)th entry is log(B(a_grid[i], b_grid[j]))
loglik = np.empty((gridres, gridres), dtype = float)
for ia, a in enumerate(a_grid): 
    s = betaln(x[:, None] + a, nmx[:, None] + b_grid[None, :]).sum(axis = 0)
    loglik[ia, :] = s - N*logB_ab[ia, :]
idx = np.unravel_index(np.argmax(loglik), loglik.shape)
a_mle_1 = a_grid[idx[0]]
b_mle_1 = b_grid[idx[1]]
print(f"Estimated amle: {a_mle_1}, Estimated bmle: {b_mle_1}")


d['bayesestimate_margmle_firsthalf'] = (d['dc'] + a_mle_1)/(d['pop'] + a_mle_1 + b_mle_1)


#Proportions from the second half of the data:
d['proportion2'] = d['dc.2'] / d['pop.2']


#Proportions from the second half of the data:
d['proportion2'] = d['dc.2'] / d['pop.2']

#Prediction Accuracy:
diff_naiveproportion1 = d['naiveproportion1'] - d['proportion2']
sum_abs_diff_naiveproportion1 = diff_naiveproportion1.abs().sum()
sum_squared_diff_naiveproportion1 = (diff_naiveproportion1**2).sum()
kl_loss_naiveproportion1 = (d['proportion2'] * np.log(d['proportion2'] / (d['naiveproportion1'] + 1e-10) + 1e-10)).sum() + ((1 - d['proportion2']) * np.log((1 - d['proportion2']) / (1 - d['naiveproportion1'] + 1e-10) + 1e-10)).sum()
negloglik_naiveproportion1 = -(d['dc.2'] * np.log(d['naiveproportion1'] + 1e-10)).sum() - ((d['pop.2'] - d['dc.2']) * np.log(1 - d['naiveproportion1'] + 1e-10)).sum()


diff_bayes1 = d['bayes1_firsthalf'] - d['proportion2']
sum_abs_diff_bayes1 = diff_bayes1.abs().sum()
sum_squared_diff_bayes1 = (diff_bayes1**2).sum()
kl_loss_bayes1_firsthalf = (d['proportion2'] * np.log(d['proportion2'] / (d['bayes1_firsthalf'] + 1e-10) + 1e-10)).sum() + ((1 - d['proportion2']) * np.log((1 - d['proportion2']) / (1 - d['bayes1_firsthalf'] + 1e-10) + 1e-10)).sum()
negloglik_bayes1_firsthalf = -(d['dc.2'] * np.log(d['bayes1_firsthalf'] + 1e-10)).sum() - ((d['pop.2'] - d['dc.2']) * np.log(1 - d['bayes1_firsthalf'] + 1e-10)).sum() 

diff_bayes2 = d['bayesestimate_margmle_firsthalf'] - d['proportion2']
sum_abs_diff_bayes2 = diff_bayes2.abs().sum()
sum_squared_diff_bayes2 = (diff_bayes2**2).sum()
kl_loss_bayes2_firsthalf = (d['proportion2'] * np.log(d['proportion2'] / (d['bayesestimate_margmle_firsthalf'] + 1e-10) + 1e-10)).sum() + ((1 - d['proportion2']) * np.log((1 - d['proportion2']) / (1 - d['bayesestimate_margmle_firsthalf'] + 1e-10) + 1e-10)).sum()
negloglik_bayes2_firsthalf = -(d['dc.2'] * np.log(d['bayesestimate_margmle_firsthalf'] + 1e-10)).sum() - ((d['pop.2'] - d['dc.2']) * np.log(1 - d['bayesestimate_margmle_firsthalf'] + 1e-10)).sum() 



allresults_absdiff = np.array([sum_abs_diff_naiveproportion1, sum_abs_diff_bayes1, sum_abs_diff_bayes2])
allresults_squareddiff = np.array([sum_squared_diff_naiveproportion1, sum_squared_diff_bayes1, sum_squared_diff_bayes2])
allresults_klloss = np.array([kl_loss_naiveproportion1, kl_loss_bayes1_firsthalf, kl_loss_bayes2_firsthalf])
allresults_negloglik = np.array([negloglik_naiveproportion1, negloglik_bayes1_firsthalf, negloglik_bayes2_firsthalf])
print(allresults_absdiff)
print(allresults_squareddiff)
print(allresults_klloss)
print(allresults_negloglik)

Estimated a1: 9.642510227719821, Estimated b1: 203435.3366143135
Estimated amle: 14.551483398731948, Estimated bmle: 298655.83276364877
[0.12508483 0.1053034  0.10488608]
[1.29012723e-05 8.07678746e-06 8.00891733e-06]
[0.50184892 0.06705131 0.06625212]
[290602.92682912 282151.84239388 282097.23532142]

The accuracy of this second Empirical Bayes method based on marginal MLE is almost the same (but slightly smaller) compared to the previous ‘Method of Moments’ based approach.

Full Hierarchical Bayes¶

Instead of relying on methods such as the MLE or method of moments, we can use full probability, and place priors on $a$ and $b$ . We will use the following noninformative prior:

\begin{align*} \log a, \log b \overset{\text{i.i.d}}{\sim} \text{uniform}(-\infty, \infty). \end{align*}

(10)

In other words,

\begin{align*} f_{a, b}(a, b) \propto \frac{I\{a > 0, b > 0\}}{a b}. \end{align*}

(11)

The posterior density of $(a, b)$ is therefore given by:

\begin{align*} f_{a, b \mid \text{data}}(a, b) \propto \frac{I\{a > 0, b > 0\}}{a b} \times {n \choose x} \frac{B(x + a, n-x + b)}{B(a, b)} \propto \frac{I\{a > 0, b > 0\}}{a b} \frac{B(x + a, n-x + b)}{B(a, b)} \end{align*}

(12)

We compute the unnormalized posterior for each point $(a, b)$ in the same grid that we previously used for marginal maximum likelihood. After that, we normalize it by dividing each weight by the sum of the unnormalized posterior weights. This gives a discrete approximation for the posterior of $a$ and $b$ , which we can use for computing the posterior means, obtaining posterior samples of $a, b$ etc.

from scipy.special import logsumexp

#same grids that we used for the marginal MLE estimation:
loga_low = np.log(1)
loga_high = np.log(100)

logb_low = np.log(50000)
logb_high = np.log(300000)

a_grid = np.exp(np.linspace(loga_low, loga_high, gridres))
b_grid = np.exp(np.linspace(logb_low, logb_high, gridres))

x = d['dct'].to_numpy()
n = d['popm'].to_numpy()
nmx = n-x
N = len(x)

log_post_ab = np.empty((len(a_grid), len(b_grid)))
log_b = np.log(b_grid)

for i, a in enumerate(a_grid):
    ll = np.sum(betaln(x[:, None] + a, nmx[:, None] + b_grid[None, :]),axis=0) - N * betaln(a, b_grid)  
    log_post_ab[i, :] = ll -  np.log(a) - log_b #log of the unnormalized posterior

post_ab = np.exp(log_post_ab - logsumexp(log_post_ab))

a_postmean = np.sum(a_grid[:, None] * post_ab)
b_postmean = np.sum(b_grid[None, :] * post_ab)
print(f"Estimated a (post mean): {a_postmean}, Estimated b (post mean): {b_postmean}")

Estimated a (post mean): 15.602962225114446, Estimated b (post mean): 150546.7020690343

We shall get back to this full Bayes model in the next lecture.