import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
from scipy.stats import invgamma
from scipy.special import betaln
import pandas as pd
from scipy.special import logsumexp
from mpl_toolkits.mplot3d import Axes3D

Lecture 11¶

Baseball Dataset¶

Consider the following dataset from Chapter 1 in Bradley Efron’s book on Large Scale Inference. It corresponds to baseball batting averages for 18 players from the 1970 baseball season. From the data on the number of hits for each player in their first 45 at bats, the goal is to predict their end of season batting averages. The end of season batting averages are also provided in the dataset, but we will only use them to verify the accuracy of estimators.

baseball = pd.DataFrame({
    "players": [
        "Clemente", "F Robinson", "F Howard", "Johnstone", "Berry",
        "Spencer", "Kessinger", "L Alvarado", "Santo", "Swoboda",
        "Unser", "Williams", "Scott", "Petrocelli", "E Rodriguez",
        "Campaneris", "Munson", "Alvis"
    ],
    "hits": [
        18, 17, 16, 15, 14, 14, 13, 12, 11,
        11, 10, 10, 10, 10, 10, 9, 8, 7
    ],
    "atbats": [45] * 18,
    "EoSaverage": [
        0.346, 0.298, 0.276, 0.222, 0.273, 0.270, 0.263, 0.210,
        0.269, 0.230, 0.264, 0.256, 0.303, 0.264, 0.226,
        0.286, 0.316, 0.200
    ]
})
print(baseball)

        players  hits  atbats  EoSaverage
0      Clemente    18      45       0.346
1    F Robinson    17      45       0.298
2      F Howard    16      45       0.276
3     Johnstone    15      45       0.222
4         Berry    14      45       0.273
5       Spencer    14      45       0.270
6     Kessinger    13      45       0.263
7    L Alvarado    12      45       0.210
8         Santo    11      45       0.269
9       Swoboda    11      45       0.230
10        Unser    10      45       0.264
11     Williams    10      45       0.256
12        Scott    10      45       0.303
13   Petrocelli    10      45       0.264
14  E Rodriguez    10      45       0.226
15   Campaneris     9      45       0.286
16       Munson     8      45       0.316
17        Alvis     7      45       0.200

It is natural here to use the model $X_i \sim Bin(n, p_i)$ where $p_i$ denotes the end of season average (estimation target) of player $i$ , $X_i$ is the number of hits for player $i$ in their first $n = 45$ at bats. As in the kidney cancer dataset, we can model $p_i \sim Beta(a, b)$ with a further uninformative prior for $a$ and $b$ . This will be done in a homework problem.

In this lecture, we shall use the normal distribution to model the likelihood here. We assume

\begin{align*} \frac{X_i}{n} \overset{\text{ind}}{\sim} N\left(p_i, \frac{p_i(1-p_i)}{n} \right) ~\text{ for $i = 1, \dots, N$} \end{align*}

(1)

This can be justified by the normal approximation (Central Limit Theorem) applied to $Bin(n, p_i)$ . Here $N = 18$ denotes the number of players.

Because the variance in the normal distribution $p_i(1-p_i)/n$ depends on the unknown parameter $p_i$ , we use the variance stabilizing transformation as discussed in the last lecture. Specifically, we take $g(x) = 2 \sqrt{n} \arcsin(\sqrt{x})$ and take $Y_i = g(X_i/n)$ and $\theta_i = g(p_i)$ . Now the unknown parameters are $\theta_1, \dots, \theta_N$ , the data are $Y_1, \dots, Y_N$ and we use the likelihood:

\begin{align*} Y_i \overset{\text{ind}}{\sim} N(\theta_i, 1). \end{align*}

(2)

The naive estimator of $\theta_i$ is simply $Y_i$ . The accuracy of this estimator is computed below.

# Apply the transformation: 2 * sqrt(n) * arcsin(sqrt(p))
n = 45
baseball["norm_data"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["hits"] / n))
baseball["true_mean"] = 2 * np.sqrt(n) * np.arcsin(np.sqrt(baseball["EoSaverage"]))
print(baseball)

#Accuracy of the naive estimator: 
print(f"Accuracy of the naive estimator: {np.sum((baseball['norm_data'] - baseball['true_mean']) ** 2)}")

        players  hits  atbats  EoSaverage  norm_data  true_mean
0      Clemente    18      45       0.346   9.186472   8.436950
1    F Robinson    17      45       0.298   8.880653   7.747378
2      F Howard    16      45       0.276   8.571275   7.421080
3     Johnstone    15      45       0.222   8.257527   6.582296
4         Berry    14      45       0.273   7.938497   7.375984
5       Spencer    14      45       0.270   7.938497   7.330733
6     Kessinger    13      45       0.263   7.613147   7.224523
7    L Alvarado    12      45       0.210   7.280268   6.386664
8         Santo    11      45       0.269   6.938425   7.315614
9       Swoboda    11      45       0.230   6.938425   6.710614
10        Unser    10      45       0.264   6.585882   7.239751
11     Williams    10      45       0.256   6.585882   7.117400
12        Scott    10      45       0.303   6.585882   7.820536
13   Petrocelli    10      45       0.264   6.585882   7.239751
14  E Rodriguez    10      45       0.226   6.585882   6.646656
15   Campaneris     9      45       0.286   6.220485   7.570326
16       Munson     8      45       0.316   5.839490   8.009187
17        Alvis     7      45       0.200   5.439288   6.220485
Accuracy of the naive estimator: 17.610557418678166

Normal Mean Estimation with Normal Prior¶

Consider the Bayesian model $Y_i \sim N(\theta_i, \sigma^2)$ (with $\sigma = 1$ ) with the prior $\theta_i \sim N(\mu, \tau^2)$ . The posterior distribution for $\theta_i$ is given by:

The naive estimator for $\theta_i$ is simply $Y_i$ . The James-Stein estimator is given by:

\begin{align*} \hat{\theta}^{JS}_i = \bar{Y}_N + \left(1 - \frac{N-3}{\sum_{i=1}^N(Y_i - \bar{Y}_N)^2} \right) \left(Y_i - \bar{Y}_N \right) \end{align*}

(3)

where $\bar{Y}_N := (Y_1 + \dots + Y_N)/N$ is the mean of the observations $Y_i$ .

Below we compute the James-Stein estimator and compare the squared errors for both the James-Stein and the naive estimator. We are able to compute these squared errors because the ground truth $\theta_i$ values are known.

# Mean of normalized data
mu = baseball["norm_data"].mean()
# James–Stein estimator
fctr = 1 - (len(baseball["norm_data"]) - 3) / np.sum((baseball["norm_data"] - mu) ** 2)
baseball["js"] = mu + (baseball["norm_data"] - mu) * fctr
# Compare squared errors in transformed space
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2))

17.610557418678166 5.015274172196248

It is easy to see that the James-Stein estimator is better by more than a factor of 3 compared to the naive estimator. We can convert back to the original scale $p_i$ (by applying the inverse transformation of $g$ ), and check prediction accuracy on the original scale.

Convert to original scale.

baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])))

Here also the James-Stein estimator is much better than the naive estimator.

For Bayesian inference, we take $\gamma = \sqrt{\sigma^2 + \tau^2}$ with $\sigma = 1$ . The prior on $\mu$ is uniform on $(-\infty, \infty)$ and the prior on $\gamma$ is $(1/\gamma)I\{\gamma > \sigma\}$ . The posterior of $\gamma$ is then:

\begin{align*} \propto \gamma^{-N} \exp \left(-\frac{\sum_{i=1}^N (y_i - \bar{y})^2}{2 \gamma^2} \right) I\{\gamma > \sigma\}. \end{align*}

(4)

The posterior of $\mu$ given $\gamma$ is:

\begin{align*} \mu \mid \gamma, \text{data} \sim N(\bar{y}, \gamma^2/N). \end{align*}

(5)

Details behind these calculations will be given in the next lab.

Below we use that $v = \gamma^2$ has posterior Inverse-Gamma truncated to $v \geq \sigma^2$ . This density is:

\begin{align*} v \mid y_1, \dots, y_n \sim v^{-(\alpha + 1)} \exp\left(-\beta/v \right) I\{v > \sigma^2\} \end{align*}

(6)

with $\alpha = (N-1)/2$ and $\beta = \sum_{i=1}^N (y_i - \bar{y})^2/2$ .

The following code generates the posterior samples.

y = baseball["norm_data"].to_numpy()
N = len(y)
M = 10000 #number of posterior samples
sigma = 1 #known number

ybar = np.mean(y)
S = np.sum((y - ybar) ** 2)

alpha = (N - 1) / 2
beta = S / 2    

v0 = sigma ** 2
F0 = invgamma.cdf(v0, a = alpha, scale = beta)
np.random.seed(123)
u = np.random.uniform(F0, 1, size = M)
v = invgamma.ppf(u, a = alpha, scale = beta)

gamma_samples = np.sqrt(v)
mu_samples = np.random.normal(loc = ybar, scale = gamma_samples/np.sqrt(N), size = M)


#plotting the histograms of gamma_samples and mu_samples
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

axes[0].hist(gamma_samples, bins=30, color='blue', alpha=0.7, label='a_samples')
axes[0].set_title('Histogram of gamma_samples')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(mu_samples, bins=30, color='red', alpha=0.7, label='b_samples')
axes[1].set_title('Histogram of mu_samples')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

For each $\mu$ and $\gamma$ , we generate $\theta_1, \dots, \theta_N$ using

\begin{align*} \theta_i \mid \mu, \gamma, \text{data} \sim N \left(\mu + \left(1 - \frac{\sigma^2}{\gamma^2} \right) (y_i - \mu), \sigma^2 \left(1 - \frac{\sigma^2}{\gamma^2} \right)\right). \end{align*}

(7)

shrink = 1 - (sigma ** 2)/(gamma_samples ** 2)
theta_mean = mu_samples[:, None] + shrink[:, None] * (y[None, :] - mu_samples[:, None])
theta_var = (sigma ** 2) * shrink
theta_sd = np.sqrt(theta_var)[:, None]
theta_samples = np.random.normal(loc = theta_mean, scale = theta_sd, size = (M, N))

The accuracy of the Bayes estimates are computed below.

theta_post_mean = theta_samples.mean(axis = 0)
print(np.sum((baseball["norm_data"] - baseball["true_mean"]) ** 2), np.sum((baseball["js"] - baseball["true_mean"]) ** 2), np.sum((theta_post_mean - baseball["true_mean"]) ** 2))

17.610557418678166 5.015274172196248 5.174785770311955

baseball["orig_js"] = np.sin(baseball["js"] / (2 * np.sqrt(n))) ** 2
orig_bayes = np.sin(theta_post_mean / (2 * np.sqrt(n))) ** 2
print(np.sum((baseball["hits"] / n - baseball["EoSaverage"]) ** 2), np.sum((baseball["orig_js"] - baseball["EoSaverage"]) ** 2), np.sum((orig_bayes - baseball["EoSaverage"]) ** 2))
print(np.sum(np.abs(baseball["hits"] / n - baseball["EoSaverage"])),np.sum(np.abs(baseball["orig_js"] - baseball["EoSaverage"])), np.sum(np.abs(orig_bayes - baseball["EoSaverage"])))

0.07548795061728396 0.02158970459666336 0.02223911010853747
0.996 0.47556635567349737 0.48859205956799184

The accuracy of the full Bayes posterior estimates are very similar to that of the James-Stein estimator.

Below we are printing the James-Stein estimator as well as the full Bayes estimates of each $p_i$ (after applying the inverse transformation from $\theta_i$ to $p_i$ ).

baseball["average"] = baseball["hits"] / n
baseball["orig_bayes"] = orig_bayes
print(baseball[["players", "hits",  "average", "EoSaverage", "orig_js", "orig_bayes"]])

        players  hits   average  EoSaverage   orig_js  orig_bayes
0      Clemente    18  0.400000       0.346  0.290127    0.298750
1    F Robinson    17  0.377778       0.298  0.285813    0.293906
2      F Howard    16  0.355556       0.276  0.281469    0.287929
3     Johnstone    15  0.333333       0.222  0.277084    0.281900
4         Berry    14  0.311111       0.273  0.272647    0.276093
5       Spencer    14  0.311111       0.270  0.272647    0.275933
6     Kessinger    13  0.288889       0.263  0.268145    0.269990
7    L Alvarado    12  0.266667       0.210  0.263564    0.264650
8         Santo    11  0.244444       0.269  0.258886    0.258069
9       Swoboda    11  0.244444       0.230  0.258886    0.257935
10        Unser    10  0.222222       0.264  0.254090    0.251453
11     Williams    10  0.222222       0.256  0.254090    0.251932
12        Scott    10  0.222222       0.303  0.254090    0.251634
13   Petrocelli    10  0.222222       0.264  0.254090    0.251395
14  E Rodriguez    10  0.222222       0.226  0.254090    0.251542
15   Campaneris     9  0.200000       0.286  0.249151    0.245233
16       Munson     8  0.177778       0.316  0.244035    0.238545
17        Alvis     7  0.155556       0.200  0.238700    0.231536

There appear to be slight differences between the James-Stein estimator and the Bayes estimates.