Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
from scipy.stats import invgamma
from scipy.special import betaln
import pandas as pd
from scipy.special import logsumexp
from mpl_toolkits.mplot3d import Axes3D

Lecture 11

Baseball Dataset

Consider the following dataset from Chapter 1 in Bradley Efron’s book on Large Scale Inference. It corresponds to baseball batting averages for 18 players from the 1970 baseball season. From the data on the number of hits for each player in their first 45 at bats, the goal is to predict their end of season batting averages. The end of season batting averages are also provided in the dataset, but we will only use them to verify the accuracy of estimators.

baseball = pd.DataFrame({
    "players": [
        "Clemente", "F Robinson", "F Howard", "Johnstone", "Berry",
        "Spencer", "Kessinger", "L Alvarado", "Santo", "Swoboda",
        "Unser", "Williams", "Scott", "Petrocelli", "E Rodriguez",
        "Campaneris", "Munson", "Alvis"
    ],
    "hits": [
        18, 17, 16, 15, 14, 14, 13, 12, 11,
        11, 10, 10, 10, 10, 10, 9, 8, 7
    ],
    "atbats": [45] * 18,
    "EoSaverage": [
        0.346, 0.298, 0.276, 0.222, 0.273, 0.270, 0.263, 0.210,
        0.269, 0.230, 0.264, 0.256, 0.303, 0.264, 0.226,
        0.286, 0.316, 0.200
    ]
})
print(baseball)
        players  hits  atbats  EoSaverage
0      Clemente    18      45       0.346
1    F Robinson    17      45       0.298
2      F Howard    16      45       0.276
3     Johnstone    15      45       0.222
4         Berry    14      45       0.273
5       Spencer    14      45       0.270
6     Kessinger    13      45       0.263
7    L Alvarado    12      45       0.210
8         Santo    11      45       0.269
9       Swoboda    11      45       0.230
10        Unser    10      45       0.264
11     Williams    10      45       0.256
12        Scott    10      45       0.303
13   Petrocelli    10      45       0.264
14  E Rodriguez    10      45       0.226
15   Campaneris     9      45       0.286
16       Munson     8      45       0.316
17        Alvis     7      45       0.200

It is natural here to use the model XiBin(n,pi)X_i \sim Bin(n, p_i) where pip_i denotes the end of season average (estimation target) of player ii, XiX_i is the number of hits for player ii in their first n=45n = 45 at bats. As in the kidney cancer dataset, we can model piBeta(a,b)p_i \sim Beta(a, b) with a further uninformative prior for aa and bb. This will be done in a homework problem.

In this lecture, we shall use the normal distribution to model the likelihood here. We assume

XinindN(pi,pi(1pi)n)  for i=1,,N\begin{align*} \frac{X_i}{n} \overset{\text{ind}}{\sim} N\left(p_i, \frac{p_i(1-p_i)}{n} \right) ~\text{ for $i = 1, \dots, N$} \end{align*}

This can be justified by the normal approximation (Central Limit Theorem) applied to Bin(n,pi)Bin(n, p_i). Here N=18N = 18 denotes the number of players.

Because the variance in the normal distribution pi(1pi)/np_i(1-p_i)/n depends on the unknown parameter pip_i, we use the variance stabilizing transformation as discussed in the last lecture. Specifically, we take g(x)=2narcsin(x)g(x) = 2 \sqrt{n} \arcsin(\sqrt{x}) and take Yi=g(Xi/n)Y_i = g(X_i/n) and θi=g(pi)\theta_i = g(p_i). Now the unknown parameters are θ1,,θN\theta_1, \dots, \theta_N, the data are Y1,,YNY_1, \dots, Y_N and we use the likelihood:

YiindN(θi,1).\begin{align*} Y_i \overset{\text{ind}}{\sim} N(\theta_i, 1). \end{align*}

The naive estimator of θi\theta_i is simply YiY_i. The accuracy of this estimator is computed below.

# Apply the transformation: 2 * sqrt(n) * arcsin(sqrt(p))
n = 45
baseball["norm_data"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["hits"] / n))
baseball["true_mean"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["EoSaverage"]))
print(baseball)

#Accuracy of the naive estimator: 
print(f"Accuracy of the naive estimator: {np.sum((baseball['norm_data'] - baseball['true_mean']) ** 2)}")
        players  hits  atbats  EoSaverage  norm_data  true_mean
0      Clemente    18      45       0.346   9.186472   8.436950
1    F Robinson    17      45       0.298   8.880653   7.747378
2      F Howard    16      45       0.276   8.571275   7.421080
3     Johnstone    15      45       0.222   8.257527   6.582296
4         Berry    14      45       0.273   7.938497   7.375984
5       Spencer    14      45       0.270   7.938497   7.330733
6     Kessinger    13      45       0.263   7.613147   7.224523
7    L Alvarado    12      45       0.210   7.280268   6.386664
8         Santo    11      45       0.269   6.938425   7.315614
9       Swoboda    11      45       0.230   6.938425   6.710614
10        Unser    10      45       0.264   6.585882   7.239751
11     Williams    10      45       0.256   6.585882   7.117400
12        Scott    10      45       0.303   6.585882   7.820536
13   Petrocelli    10      45       0.264   6.585882   7.239751
14  E Rodriguez    10      45       0.226   6.585882   6.646656
15   Campaneris     9      45       0.286   6.220485   7.570326
16       Munson     8      45       0.316   5.839490   8.009187
17        Alvis     7      45       0.200   5.439288   6.220485
Accuracy of the naive estimator: 17.610557418678166

Normal Mean Estimation with Normal Prior

Consider the Bayesian model YiN(θi,σ2)Y_i \sim N(\theta_i, \sigma^2) (with σ=1\sigma = 1) with the prior θiN(μ,τ2)\theta_i \sim N(\mu, \tau^2). The posterior distribution for θi\theta_i is given by:

The naive estimator for θi\theta_i is simply YiY_i. The James-Stein estimator is given by:

θ^iJS=YˉN+(1N3i=1N(YiYˉN)2)(YiYˉN)\begin{align*} \hat{\theta}^{JS}_i = \bar{Y}_N + \left(1 - \frac{N-3}{\sum_{i=1}^N(Y_i - \bar{Y}_N)^2} \right) \left(Y_i - \bar{Y}_N \right) \end{align*}

where YˉN:=(Y1++YN)/N\bar{Y}_N := (Y_1 + \dots + Y_N)/N is the mean of the observations YiY_i.

Below we compute the James-Stein estimator and compare the squared errors for both the James-Stein and the naive estimator. We are able to compute these squared errors because the ground truth θi\theta_i values are known.

# Mean of normalized data
mu = baseball["norm_data"].mean()
# James–Stein estimator
fctr = 1 - (len(baseball["norm_data"]) - 3) / np.sum((baseball["norm_data"] - mu) ** 2)
baseball["js"] = mu + (baseball["norm_data"] - mu) * fctr
# Compare squared errors in transformed space
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2))
17.610557418678166 5.015274172196248

It is easy to see that the James-Stein estimator is better by more than a factor of 3 compared to the naive estimator. We can convert back to the original scale pip_i (by applying the inverse transformation of gg), and check prediction accuracy on the original scale.

Convert to original scale.

baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])))

Here also the James-Stein estimator is much better than the naive estimator.

For Bayesian inference, we take γ=σ2+τ2\gamma = \sqrt{\sigma^2 + \tau^2} with σ=1\sigma = 1. The prior on μ\mu is uniform on (,)(-\infty, \infty) and the prior on γ\gamma is (1/γ)I{γ>σ}(1/\gamma)I\{\gamma > \sigma\}. The posterior of γ\gamma is then:

γNexp(i=1N(yiyˉ)22γ2)I{γ>σ}.\begin{align*} \propto \gamma^{-N} \exp \left(-\frac{\sum_{i=1}^N (y_i - \bar{y})^2}{2 \gamma^2} \right) I\{\gamma > \sigma\}. \end{align*}

The posterior of μ\mu given γ\gamma is:

μγ,dataN(yˉ,γ2/N).\begin{align*} \mu \mid \gamma, \text{data} \sim N(\bar{y}, \gamma^2/N). \end{align*}

Details behind these calculations will be given in the next lab.

Below we use that v=γ2v = \gamma^2 has posterior Inverse-Gamma truncated to vσ2v \geq \sigma^2. This density is:

vy1,,ynv(α+1)exp(β/v)I{v>σ2}\begin{align*} v \mid y_1, \dots, y_n \sim v^{-(\alpha + 1)} \exp\left(-\beta/v \right) I\{v > \sigma^2\} \end{align*}

with α=(N1)/2\alpha = (N-1)/2 and β=i=1N(yiyˉ)2/2\beta = \sum_{i=1}^N (y_i - \bar{y})^2/2.

The following code generates the posterior samples.

y = baseball["norm_data"].to_numpy()
N = len(y)
M = 10000 #number of posterior samples
sigma = 1 #known number

ybar = np.mean(y)
S = np.sum((y - ybar) ** 2)

alpha = (N - 1) / 2
beta = S / 2    

v0 = sigma ** 2
F0 = invgamma.cdf(v0, a = alpha, scale = beta)
np.random.seed(123)
u = np.random.uniform(F0, 1, size = M)
v = invgamma.ppf(u, a = alpha, scale = beta)

gamma_samples = np.sqrt(v)
mu_samples = np.random.normal(loc = ybar, scale = gamma_samples/np.sqrt(N), size = M)

#plotting the histograms of gamma_samples and mu_samples
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

axes[0].hist(gamma_samples, bins=30, color='blue', alpha=0.7, label='a_samples')
axes[0].set_title('Histogram of gamma_samples')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(mu_samples, bins=30, color='red', alpha=0.7, label='b_samples')
axes[1].set_title('Histogram of mu_samples')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()
<Figure size 1200x600 with 2 Axes>

For each μ\mu and γ\gamma, we generate θ1,,θN\theta_1, \dots, \theta_N using

θiμ,γ,dataN(μ+(1σ2γ2)(yiμ),σ2(1σ2γ2)).\begin{align*} \theta_i \mid \mu, \gamma, \text{data} \sim N \left(\mu + \left(1 - \frac{\sigma^2}{\gamma^2} \right) (y_i - \mu), \sigma^2 \left(1 - \frac{\sigma^2}{\gamma^2} \right)\right). \end{align*}
shrink = 1 - (sigma ** 2)/(gamma_samples ** 2)
theta_mean = mu_samples[:, None] + shrink[:, None] * (y[None, :] - mu_samples[:, None])
theta_var = (sigma ** 2) * shrink
theta_sd = np.sqrt(theta_var)[:, None]
theta_samples = np.random.normal(loc = theta_mean, scale = theta_sd, size = (M, N))

The accuracy of the Bayes estimates are computed below.

theta_post_mean = theta_samples.mean(axis = 0)
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2), np.sum((theta_post_mean - baseball["true_mean"]) ** 2))
17.610557418678166 5.015274172196248 5.174785770311955
baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
orig_bayes = np.sin(theta_post_mean / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2), np.sum((orig_bayes - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])), np.sum(np.abs(orig_bayes - baseball["EoSaverage"])))
0.07548795061728396 0.02158970459666336 0.02223911010853747
0.996 0.47556635567349737 0.48859205956799184

The accuracy of the full Bayes posterior estimates are very similar to that of the James-Stein estimator.

Below we are printing the James-Stein estimator as well as the full Bayes estimates of each pip_i (after applying the inverse transformation from θi\theta_i to pip_i).

baseball["average"] = baseball["hits"] / n
baseball["orig_bayes"] = orig_bayes
print(baseball[["players", "hits",  "average", "EoSaverage", "orig_js", "orig_bayes"]])
        players  hits   average  EoSaverage   orig_js  orig_bayes
0      Clemente    18  0.400000       0.346  0.290127    0.298750
1    F Robinson    17  0.377778       0.298  0.285813    0.293906
2      F Howard    16  0.355556       0.276  0.281469    0.287929
3     Johnstone    15  0.333333       0.222  0.277084    0.281900
4         Berry    14  0.311111       0.273  0.272647    0.276093
5       Spencer    14  0.311111       0.270  0.272647    0.275933
6     Kessinger    13  0.288889       0.263  0.268145    0.269990
7    L Alvarado    12  0.266667       0.210  0.263564    0.264650
8         Santo    11  0.244444       0.269  0.258886    0.258069
9       Swoboda    11  0.244444       0.230  0.258886    0.257935
10        Unser    10  0.222222       0.264  0.254090    0.251453
11     Williams    10  0.222222       0.256  0.254090    0.251932
12        Scott    10  0.222222       0.303  0.254090    0.251634
13   Petrocelli    10  0.222222       0.264  0.254090    0.251395
14  E Rodriguez    10  0.222222       0.226  0.254090    0.251542
15   Campaneris     9  0.200000       0.286  0.249151    0.245233
16       Munson     8  0.177778       0.316  0.244035    0.238545
17        Alvis     7  0.155556       0.200  0.238700    0.231536

There appear to be slight differences between the James-Stein estimator and the Bayes estimates.