import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
import pandas as pd

Kidney Cancer Data Analysis¶

This is an example on “Bayesian Disease Mapping”, and is taken from Section 2.7 on estimation of kidney cancer rates of the book “Bayesian Data Analysis Edition 3” by Gelman et al. Some of the code is taken from a blog post by Robin Ryder (google this).

#Read the kidney cancer dataset
d_full = pd.read_csv("KidneyCancerClean.csv", skiprows=4)
display(d_full.head(15))
summary = d_full.describe(include = 'all')
display(summary)

This is county-level data on population and deaths due to kidney cancer. There are quite a few columns in the dataset but we shall only work with the four columns: dc, dc.2, pop and pop.2:

The columns dc and dc.2 represent death counts due to kidney cancer in 1980-84 and 1985-89 respectively.
The columns pop and pop.2 represent some measure of the average population in 1980-84 and 1985-89 respectively.

Let us create a smaller dataframe with only these columns.

d = d_full[['state', 'Location', 'fips', 'dc', 'pop', 'dc.2', 'pop.2']].copy()
display(d.head(20))

The purpose of this data analysis is to estimate the kidney cancer death rates for each county and, more importantly, to flag the counties with high kidney cancer death rates. Before proceeding to the analysis, let us first combine the death counts for the two periods 80-84 and 85-89 to get one overall death count for each county. Further, let us take the average of pop and pop.2 to get a measure of the average population of each county in the period 80-89.

# Combine the death and population counts for the two periods 80-84 and 85-89
d['dct'] = d['dc'] + d['dc.2'] #dct stands for death count total
d['popm'] = (d['pop'] + d['pop.2']) / 2
display(d.head(15))

A simple way to estimate the kidney cancer rate of each county is to use the proportion:

\begin{align*} \text{Naive Estimate of Kidney Cancer Death Rate} = \frac{\text{Number of Kidney Cancer Deaths}}{\text{Population}} \end{align*}

(1)

We can then rank all the counties according to their naive proportions and flag the counties for whom the naive proportion is high.

d['naiveproportion'] = d['dct'] / d['popm']
summary = d['naiveproportion'].describe()
print(summary)
#Histogram of naive proportion estimates:
import matplotlib.pyplot as plt
plt.hist(d['naiveproportion'], bins=300)
plt.xlabel('naiveproportion')
plt.ylabel('Frequency')
plt.title('Histogram of Naive Proportions')
plt.grid(True)
plt.show()
#Some observations on the histogram: 
#There is a spike at zero (corresponding to counties with no kidney cancer deaths)
#Most proportions are around 0.0001 but there are a few counties which have large proportions (more than 0.0004)

count    3110.000000
mean        0.000108
std         0.000068
min         0.000000
25%         0.000067
50%         0.000100
75%         0.000139
max         0.000691
Name: naiveproportion, dtype: float64

#Flagging Counties with High Kidney Cancer Death Proportions. 
#One way to do this is to get the counties with the 100 highest naive proportion:
threshold = d['naiveproportion'].nlargest(100).iloc[-1]
print(threshold)

d['cancerhigh'] = d['naiveproportion'] >= threshold
with pd.option_context('display.max_rows', None):
    display(d[d['cancerhigh'] >= True])

0.00025143527636927463

#It will be cool to plot the high cancer counties (in terms of the naive proportions) on the US map
#This can be done in the following way.
import plotly.express as px
#First convert fips to standard form:
d['modfips'] = d['fips'].apply(lambda x: str(x).zfill(5))
#Plotting the counties with high cancer proportion:
fig = px.choropleth(
    d, 
    geojson="https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json", # This is a dataset containing county geometries
    locations='modfips', # Use the 'fips' column for county identification
    color='cancerhigh',
    hover_name='Location',  # Display the 'Location' column on hover
    scope="usa",
    title="Top 100 kidney cancer death rate counties (in terms of naive proportions)",
    color_continuous_scale="YlOrRd"
)
fig.update_geos(fitbounds="locations")
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
fig.show()

Question: Is this a good way of flagging the high kidney cancer rate counties? Should we now look into specific reasons (such as dietary patterns, pollution etc, lack of access to quality medical care etc.) why these counties have high rates?

It turns out that this analysis is misleading because of the use of naive proportions as estimates of kidney cancer death rates. In particular, the counties that we flagged are mostly counties with low populations. This can be understood as follows. The typical rate of kidney cancer is about 0.0001 as can be seen, for example, from the naive proportion estimates of counties with large populations.

sorted_populations = d['popm'].sort_values()
print(list(sorted_populations))
#Select counties with large populations
#By large population, let us take populations more than 300000
largecounties = d[d['popm'] >= 300000]
largecounties.shape[0]
#So there are 323 counties with population more than 300000
#Let us evaluate the naive proportions for these counties. 
summary = d[d['popm'] >= 300000]['naiveproportion'].describe()
display(summary)
#the mean and median are about 0.0001 which can therefore be viewed as a typical value of the kidney cancer rate

Fetching long content....

count    323.000000
mean       0.000098
std        0.000029
min        0.000034
25%        0.000081
50%        0.000095
75%        0.000115
max        0.000227
Name: naiveproportion, dtype: float64

Now we can explain why our initial flagged set of high cancer rate counties is dominated by counties with low populations. Consider a county with a small population of 1000. In the given ten-year period, suppose this county has 1 kidney cancer death. The naive proportion would then be 1/1000 = 0.001 which is already 10 times the typical rate of 0.0001. So this county would immediately be flagged as a high cancer rate county just because of the single death. Go back to our flagged list of high population counties and check the population sizes.

d[d['naiveproportion'] >= threshold]['popm'].describe()

count      100.000000
mean     16940.950000
std      11922.792393
min       1448.000000
25%       8442.625000
50%      14204.250000
75%      21550.375000
max      60380.500000
Name: popm, dtype: float64

The county with the largest population in this list is 60000 but there are counties with population as small as 1448. So this entire list of high kidney cancer rate counties is dominated by counties with small populations.

We shall study a Bayesian solution to this problem in the next lecture.