import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
from scipy.stats import invgamma
from scipy.special import betaln
import pandas as pd
from scipy.special import logsumexp
from mpl_toolkits.mplot3d import Axes3DLecture 11¶
Baseball Dataset¶
Consider the following dataset from Chapter 1 in Bradley Efron’s book on Large Scale Inference. It corresponds to baseball batting averages for 18 players from the 1970 baseball season. From the data on the number of hits for each player in their first 45 at bats, the goal is to predict their end of season batting averages. The end of season batting averages are also provided in the dataset, but we will only use them to verify the accuracy of estimators.
baseball = pd.DataFrame({
"players": [
"Clemente", "F Robinson", "F Howard", "Johnstone", "Berry",
"Spencer", "Kessinger", "L Alvarado", "Santo", "Swoboda",
"Unser", "Williams", "Scott", "Petrocelli", "E Rodriguez",
"Campaneris", "Munson", "Alvis"
],
"hits": [
18, 17, 16, 15, 14, 14, 13, 12, 11,
11, 10, 10, 10, 10, 10, 9, 8, 7
],
"atbats": [45] * 18,
"EoSaverage": [
0.346, 0.298, 0.276, 0.222, 0.273, 0.270, 0.263, 0.210,
0.269, 0.230, 0.264, 0.256, 0.303, 0.264, 0.226,
0.286, 0.316, 0.200
]
})
print(baseball)
players hits atbats EoSaverage
0 Clemente 18 45 0.346
1 F Robinson 17 45 0.298
2 F Howard 16 45 0.276
3 Johnstone 15 45 0.222
4 Berry 14 45 0.273
5 Spencer 14 45 0.270
6 Kessinger 13 45 0.263
7 L Alvarado 12 45 0.210
8 Santo 11 45 0.269
9 Swoboda 11 45 0.230
10 Unser 10 45 0.264
11 Williams 10 45 0.256
12 Scott 10 45 0.303
13 Petrocelli 10 45 0.264
14 E Rodriguez 10 45 0.226
15 Campaneris 9 45 0.286
16 Munson 8 45 0.316
17 Alvis 7 45 0.200
It is natural here to use the model where denotes the end of season average (estimation target) of player , is the number of hits for player in their first at bats. As in the kidney cancer dataset, we can model with a further uninformative prior for and . This will be done in a homework problem.
In this lecture, we shall use the normal distribution to model the likelihood here. We assume
This can be justified by the normal approximation (Central Limit Theorem) applied to . Here denotes the number of players.
Because the variance in the normal distribution depends on the unknown parameter , we use the variance stabilizing transformation as discussed in the last lecture. Specifically, we take and take and . Now the unknown parameters are , the data are and we use the likelihood:
The naive estimator of is simply . The accuracy of this estimator is computed below.
# Apply the transformation: 2 * sqrt(n) * arcsin(sqrt(p))
n = 45
baseball["norm_data"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["hits"] / n))
baseball["true_mean"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["EoSaverage"]))
print(baseball)
#Accuracy of the naive estimator:
print(f"Accuracy of the naive estimator: {np.sum((baseball['norm_data'] - baseball['true_mean']) ** 2)}")
players hits atbats EoSaverage norm_data true_mean
0 Clemente 18 45 0.346 9.186472 8.436950
1 F Robinson 17 45 0.298 8.880653 7.747378
2 F Howard 16 45 0.276 8.571275 7.421080
3 Johnstone 15 45 0.222 8.257527 6.582296
4 Berry 14 45 0.273 7.938497 7.375984
5 Spencer 14 45 0.270 7.938497 7.330733
6 Kessinger 13 45 0.263 7.613147 7.224523
7 L Alvarado 12 45 0.210 7.280268 6.386664
8 Santo 11 45 0.269 6.938425 7.315614
9 Swoboda 11 45 0.230 6.938425 6.710614
10 Unser 10 45 0.264 6.585882 7.239751
11 Williams 10 45 0.256 6.585882 7.117400
12 Scott 10 45 0.303 6.585882 7.820536
13 Petrocelli 10 45 0.264 6.585882 7.239751
14 E Rodriguez 10 45 0.226 6.585882 6.646656
15 Campaneris 9 45 0.286 6.220485 7.570326
16 Munson 8 45 0.316 5.839490 8.009187
17 Alvis 7 45 0.200 5.439288 6.220485
Accuracy of the naive estimator: 17.610557418678166
Normal Mean Estimation with Normal Prior¶
Consider the Bayesian model (with ) with the prior . The posterior distribution for is given by:
The naive estimator for is simply . The James-Stein estimator is given by:
where is the mean of the observations .
Below we compute the James-Stein estimator and compare the squared errors for both the James-Stein and the naive estimator. We are able to compute these squared errors because the ground truth values are known.
# Mean of normalized data
mu = baseball["norm_data"].mean()
# James–Stein estimator
fctr = 1 - (len(baseball["norm_data"]) - 3) / np.sum((baseball["norm_data"] - mu) ** 2)
baseball["js"] = mu + (baseball["norm_data"] - mu) * fctr
# Compare squared errors in transformed space
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2))17.610557418678166 5.015274172196248
It is easy to see that the James-Stein estimator is better by more than a factor of 3 compared to the naive estimator. We can convert back to the original scale (by applying the inverse transformation of ), and check prediction accuracy on the original scale.
Convert to original scale.
baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])))Here also the James-Stein estimator is much better than the naive estimator.
For Bayesian inference, we take with . The prior on is uniform on and the prior on is . The posterior of is then:
The posterior of given is:
Details behind these calculations will be given in the next lab.
The following code generates the posterior samples.
y = baseball["norm_data"].to_numpy()
N = len(y)
M = 10000 #number of posterior samples
sigma = 1 #known number
ybar = np.mean(y)
S = np.sum((y - ybar) ** 2)
alpha = (N - 1) / 2
beta = S / 2
v0 = sigma ** 2
F0 = invgamma.cdf(v0, a = alpha, scale = beta)
np.random.seed(123)
u = np.random.uniform(F0, 1, size = M)
v = invgamma.ppf(u, a = alpha, scale = beta)
gamma_samples = np.sqrt(v)
mu_samples = np.random.normal(loc = ybar, scale = gamma_samples/np.sqrt(N), size = M)
#plotting the histograms of gamma_samples and mu_samples
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
axes[0].hist(gamma_samples, bins=30, color='blue', alpha=0.7, label='a_samples')
axes[0].set_title('Histogram of gamma_samples')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')
axes[1].hist(mu_samples, bins=30, color='red', alpha=0.7, label='b_samples')
axes[1].set_title('Histogram of mu_samples')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()

For each and , we generate using
shrink = 1 - (sigma ** 2)/(gamma_samples ** 2)
theta_mean = mu_samples[:, None] + shrink[:, None] * (y[None, :] - mu_samples[:, None])
theta_var = (sigma ** 2) * shrink
theta_sd = np.sqrt(theta_var)[:, None]
theta_samples = np.random.normal(loc = theta_mean, scale = theta_sd, size = (M, N))The accuracy of the Bayes estimates are computed below.
theta_post_mean = theta_samples.mean(axis = 0)
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2), np.sum((theta_post_mean - baseball["true_mean"]) ** 2))17.610557418678166 5.015274172196248 5.174785770311955
baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
orig_bayes = np.sin(theta_post_mean / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2), np.sum((orig_bayes - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])), np.sum(np.abs(orig_bayes - baseball["EoSaverage"])))0.07548795061728396 0.02158970459666336 0.02223911010853747
0.996 0.47556635567349737 0.48859205956799184
The accuracy of the full Bayes posterior estimates are very similar to that of the James-Stein estimator.
Below we are printing the James-Stein estimator as well as the full Bayes estimates of each (after applying the inverse transformation from to ).
baseball["average"] = baseball["hits"] / n
baseball["orig_bayes"] = orig_bayes
print(baseball[["players", "hits", "average", "EoSaverage", "orig_js", "orig_bayes"]]) players hits average EoSaverage orig_js orig_bayes
0 Clemente 18 0.400000 0.346 0.290127 0.298750
1 F Robinson 17 0.377778 0.298 0.285813 0.293906
2 F Howard 16 0.355556 0.276 0.281469 0.287929
3 Johnstone 15 0.333333 0.222 0.277084 0.281900
4 Berry 14 0.311111 0.273 0.272647 0.276093
5 Spencer 14 0.311111 0.270 0.272647 0.275933
6 Kessinger 13 0.288889 0.263 0.268145 0.269990
7 L Alvarado 12 0.266667 0.210 0.263564 0.264650
8 Santo 11 0.244444 0.269 0.258886 0.258069
9 Swoboda 11 0.244444 0.230 0.258886 0.257935
10 Unser 10 0.222222 0.264 0.254090 0.251453
11 Williams 10 0.222222 0.256 0.254090 0.251932
12 Scott 10 0.222222 0.303 0.254090 0.251634
13 Petrocelli 10 0.222222 0.264 0.254090 0.251395
14 E Rodriguez 10 0.222222 0.226 0.254090 0.251542
15 Campaneris 9 0.200000 0.286 0.249151 0.245233
16 Munson 8 0.177778 0.316 0.244035 0.238545
17 Alvis 7 0.155556 0.200 0.238700 0.231536
There appear to be slight differences between the James-Stein estimator and the Bayes estimates.