Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

High-Dimensional Linear Regression

Today we shall look at the idea of regularization in the context of regression with many covariates (also known as ‘high-dimensional’ regression). We work with a very simple dataset to illustrate the key ideas. This simple dataset, which comes from the introductory statistics textbook “Statistical Sleuth” by Ramsey and Schafer, contains weekly wages in 1987 for a sample of 25437 males between the ages of 18 and 70 who worked full-time along with their years of education, years of experience, indicator variable for whether they worked near a city, and a code for the region in the US where they worked.

#Load the dataset wagedata.csv
dt = pd.read_csv("wagedata.csv")
print(dt.head(10))
      Region   MetropolitanStatus  Exper  Educ  WeeklyEarnings
0      South  NotMetropolitanArea      8    12          859.71
1    Midwest     MetropolitanArea     30    12          786.73
2       West     MetropolitanArea     31    14         1424.50
3       West     MetropolitanArea     17    16          959.16
4       West     MetropolitanArea      6    12          154.32
5    Midwest     MetropolitanArea     24    17          282.19
6  Northeast     MetropolitanArea     14    12          444.44
7  Northeast     MetropolitanArea     28    12          451.09
8      South     MetropolitanArea     40    12          490.27
9      South     MetropolitanArea      7    16          493.83

We shall fit regressions to this dataset using logarithm of weekly earnings as the response variable:

y=log(weekly earnings)\begin{align*} y = \log \left( \text{weekly earnings} \right) \end{align*}

and years of experience as the covariate:

x=Exper=Years of Experience.\begin{align*} x = \text{Exper} = \text{Years of Experience}. \end{align*}

The following is the scatter plot between yy and xx.

import numpy as np
y = np.log(dt['WeeklyEarnings'])
x = dt['Exper']
import matplotlib.pyplot as plt
plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Scatter Plot of Years of Experience vs Log(Weekly Earnings)')
plt.show()
<Figure size 1000x600 with 1 Axes>

A simple quadratic regression model can be fit to this dataset as:

X = sm.add_constant(x)
X['Exper_Square'] = np.square(x)
model_2 = sm.OLS(y, X).fit()

print(model_2.summary())

#Plotting the fitted quadratic on the scatter plot

b0, b1, b2 = model_2.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_2 = b0 + b1*x_range + b2*np.square(x_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Quadratic Regression of Log(Weekly Earnings) on Experience')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.132
Model:                            OLS   Adj. R-squared:                  0.132
Method:                 Least Squares   F-statistic:                     1939.
Date:                Mon, 09 Mar 2026   Prob (F-statistic):               0.00
Time:                        13:52:38   Log-Likelihood:                -22294.
No. Observations:               25437   AIC:                         4.459e+04
Df Residuals:                   25434   BIC:                         4.462e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            5.6996      0.010    569.429      0.000       5.680       5.719
Exper            0.0601      0.001     58.167      0.000       0.058       0.062
Exper_Square    -0.0011    2.2e-05    -49.400      0.000      -0.001      -0.001
==============================================================================
Omnibus:                      706.284   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1008.724
Skew:                          -0.302   Prob(JB):                    9.09e-220
Kurtosis:                       3.766   Cond. No.                     2.12e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
<Figure size 1000x600 with 1 Axes>

High-dimensional linear regression model

In the last lecture, we motivated the following high-dimensional linear regression model for this dataset. We arrived at this model by considering broken stick regression models and increasing the knots.

In this dataset, the variable xx takes the values 0,1,,630, 1, \dots, 63. Considering all these possible knots (with the exception of the last value 63 because (x63)+(x - 63)_+ is always zero) leads us to the model equation: $$ y = \beta_0 + \beta_1 x + \beta_2 (x - 1)+ + \beta_3 (x - 2)+

  • \dots + \beta_{63}(x - 62)_+ \tag{1} $Thisisalinearregressionmodelwith This is a linear regression model with 64coefficients.Theinterpretationoftheregressioncoefficientsareasfollows. coefficients. The interpretation of the regression coefficients are as follows. \beta_0isthesimplythevalueof is the simply the value of y = \log(\text{Earnings})when when x = 0(i.e.,forsomeonejustjoiningtheworkforce).Toobtaintheinterpretationfor (i.e., for someone just joining the workforce). To obtain the interpretation for \beta_1,firstplugin, first plug inx = 1inthemodelequationandthen in the model equation and then x = 0$ and subtract the second equation from the first. This gives

β1=y1y0=log(E1)log(E0)=log(E1E0)E1E0E0.\beta_1 = y_1 - y_0 = \log(E_1) - \log(E_0) = \log\left(\frac{E_1}{E_0} \right) \approx \frac{E_1 - E_0}{E_0}.

Here E0E_0 and E1E_1 are Earnings for x=0x = 0 and x=1x = 1 respectively, and in the last equality, we again used log(u)u1\log(u) \approx u - 1 for u1u \approx 1. Thus 100β1100\beta_1 represents the increment (in percent) in salary from year 0 to year 1. For example, β1=0.05\beta_1 = 0.05 means that the salary increases by 5%5\% from year 0 to year 1. What is the interpretation for β2\beta_2? It is easy to see that:

logE2=β0+2β1+β2   logE1=β0+β1   logE0=β0.\log E_2 = \beta_0 + 2 \beta_1 + \beta_2 ~~~ \log E_1 = \beta_0 + \beta_1 ~~~ \log E_0 = \beta_0.

Thus

β2=logE22logE1+logE0=logE2E1logE1E0E2E1E1E1E0E0.\beta_2 = \log E_2 - 2 \log E_1 + \log E_0 = \log \frac{E_2}{E_1} - \log \frac{E_1}{E_0} \approx \frac{E_2 - E_1}{E_1} - \frac{E_1 - E_0}{E_0}.

Thus 100β2100 \beta_2 represents the change in the percent increment between years 1 and 2 compared to the percent increment between years 0 and 1. For example, β2=0.0003\beta_2 = 0.0003 means that the percent increment decreases by 0.03 after year 2. If β1=0.05\beta_1 = 0.05, we would have a 5%5\% increment after year 1 and a 50.03=4.97%5 - 0.03 = 4.97\% increment after year 2. More generally, the interpretation for βj,j2\beta_j, j \geq 2 is as follows: 100βj100\beta_j is the change in the percent increment between years j1j-1 and jj compared to the percent increment between years j2j-2 and j1j-1. For a concrete example, suppose

β0=5.74   β1=0.05   β2=0.0003   β3=0.0008   β4=0.001   \beta_0 = 5.74 ~~~ \beta_1 = 0.05 ~~~ \beta_2 = -0.0003 ~~~ \beta_3 = -0.0008 ~~~ \beta_4 = -0.001 ~~~ \dots

then

  1. weekly earnings for someone just joining the workforce is exp(5.74)=$311.06\exp(5.74) = \$311.06,

  2. increment after year 1 is 5%5 \%,

  3. increment after year 2 is (50.03)=4.97%(5 - 0.03) = 4.97 \%,

  4. increment after year 3 is (4.970.08)=4.89%(4.97 - 0.08) = 4.89\%,

  5. increment after year 4 is (4.890.1)=4.79%(4.89 - 0.1) = 4.79 \%, and so on.

If all βj,j2\beta_j, j \geq 2 are negative, then, after a while, the increments may become negative which means that the salary actually starts decreasing after a certain number of years of experience.

The all knots regression model (in equation (1)) can also be fit by the OLS function in statsmodels. However it is usually not a good idea to fit a linear regression model with many covariates using plain old OLS. The reason is that there will be overfitting leading to poor interpretability and poor predictive power. We shall demonstrate this below.

#Here is how we can use OLS to fit the regression model with all possible knots 1, 2,.., 62.
X = sm.add_constant(x) #adding intercept
for knot in range(1, 63):
    X[f'knot_{knot}'] = np.maximum(x - knot, 0)

model_all_knots = sm.OLS(y, X).fit()
print(model_all_knots.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.139
Model:                            OLS   Adj. R-squared:                  0.137
Method:                 Least Squares   F-statistic:                     67.42
Date:                Mon, 09 Mar 2026   Prob (F-statistic):               0.00
Time:                        13:52:48   Log-Likelihood:                -22188.
No. Observations:               25437   AIC:                         4.450e+04
Df Residuals:                   25375   BIC:                         4.501e+04
Df Model:                          61                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.4711      0.030    185.217      0.000       5.413       5.529
Exper          0.1347      0.040      3.390      0.001       0.057       0.213
knot_1         0.0323      0.066      0.491      0.623      -0.097       0.161
knot_2        -0.0847      0.061     -1.386      0.166      -0.205       0.035
knot_3         0.0185      0.056      0.331      0.741      -0.091       0.128
knot_4        -0.0753      0.053     -1.425      0.154      -0.179       0.028
knot_5         0.0762      0.052      1.465      0.143      -0.026       0.178
knot_6        -0.0510      0.051     -1.005      0.315      -0.151       0.049
knot_7        -0.0773      0.049     -1.565      0.118      -0.174       0.020
knot_8         0.1382      0.049      2.804      0.005       0.042       0.235
knot_9        -0.0891      0.049     -1.832      0.067      -0.184       0.006
knot_10        0.0031      0.048      0.064      0.949      -0.091       0.097
knot_11       -0.0265      0.048     -0.556      0.578      -0.120       0.067
knot_12        0.0711      0.048      1.490      0.136      -0.022       0.165
knot_13       -0.0524      0.048     -1.098      0.272      -0.146       0.041
knot_14       -0.0099      0.049     -0.201      0.841      -0.107       0.087
knot_15        0.0158      0.049      0.319      0.750      -0.081       0.113
knot_16        0.0372      0.050      0.739      0.460      -0.061       0.136
knot_17       -0.0554      0.050     -1.097      0.272      -0.154       0.044
knot_18       -0.0034      0.051     -0.066      0.947      -0.103       0.096
knot_19       -0.0130      0.054     -0.243      0.808      -0.118       0.092
knot_20        0.0394      0.056      0.698      0.485      -0.071       0.150
knot_21       -0.0460      0.059     -0.783      0.434      -0.161       0.069
knot_22        0.0370      0.059      0.623      0.533      -0.079       0.153
knot_23       -0.0154      0.061     -0.250      0.802      -0.136       0.105
knot_24       -0.0035      0.064     -0.055      0.956      -0.128       0.121
knot_25        0.0113      0.065      0.174      0.862      -0.116       0.139
knot_26       -0.0121      0.066     -0.182      0.856      -0.142       0.118
knot_27       -0.0291      0.068     -0.426      0.670      -0.163       0.105
knot_28        0.0647      0.069      0.933      0.351      -0.071       0.201
knot_29       -0.0132      0.070     -0.189      0.850      -0.150       0.124
knot_30       -0.0305      0.071     -0.429      0.668      -0.170       0.109
knot_31       -0.0261      0.073     -0.360      0.719      -0.168       0.116
knot_32       -0.0034      0.074     -0.045      0.964      -0.149       0.142
knot_33        0.0766      0.076      1.002      0.317      -0.073       0.226
knot_34       -0.0943      0.076     -1.235      0.217      -0.244       0.055
knot_35        0.1039      0.078      1.337      0.181      -0.048       0.256
knot_36       -0.1690      0.081     -2.079      0.038      -0.328      -0.010
knot_37        0.1986      0.084      2.362      0.018       0.034       0.363
knot_38       -0.0801      0.085     -0.938      0.348      -0.247       0.087
knot_39       -0.0127      0.087     -0.146      0.884      -0.183       0.158
knot_40        0.0045      0.083      0.054      0.957      -0.157       0.166
knot_41       -0.0718      0.087     -0.824      0.410      -0.243       0.099
knot_42       -0.0043      0.091     -0.047      0.962      -0.183       0.175
knot_43        0.1796      0.096      1.879      0.060      -0.008       0.367
knot_44       -0.1397      0.103     -1.350      0.177      -0.342       0.063
knot_45       -0.0091      0.111     -0.082      0.934      -0.226       0.208
knot_46        0.0326      0.120      0.271      0.786      -0.203       0.268
knot_47       -0.0828      0.131     -0.630      0.529      -0.340       0.175
knot_48        0.0582      0.146      0.399      0.690      -0.228       0.344
knot_49       -0.0267      0.162     -0.165      0.869      -0.344       0.290
knot_50        0.1124      0.194      0.581      0.562      -0.267       0.492
knot_51       -0.1186      0.235     -0.504      0.614      -0.580       0.342
knot_52       -0.0133      0.268     -0.050      0.960      -0.539       0.512
knot_53        0.1658      0.294      0.564      0.573      -0.410       0.742
knot_54       -0.2045      0.335     -0.611      0.541      -0.861       0.452
knot_55        0.1716      0.429      0.400      0.689      -0.669       1.012
knot_56       -0.4076      0.465     -0.877      0.380      -1.318       0.503
knot_57        0.5338      0.627      0.851      0.395      -0.696       1.763
knot_58        0.1181      0.700      0.169      0.866      -1.255       1.491
knot_59       -0.3092      0.314     -0.984      0.325      -0.925       0.306
knot_60       -0.7365      0.952     -0.774      0.439      -2.603       1.130
knot_61        0.9303      0.867      1.072      0.284      -0.770       2.631
knot_62        0.4651      0.434      1.072      0.284      -0.385       1.315
==============================================================================
Omnibus:                      724.158   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1054.343
Skew:                          -0.302   Prob(JB):                    1.13e-229
Kurtosis:                       3.794   Cond. No.                     1.76e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.47e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The main problem with the above OLS output is that the estimated regression coefficients are not interpretable. To see this, note that the fitted coefficients are:

β0=5.4711   β1=0.1347   β2=0.0323   β3=0.0847   β4=0.0185   \beta_0 = 5.4711 ~~~ \beta_1 = 0.1347 ~~~ \beta_2 = 0.0323 ~~~ \beta_3 = -0.0847 ~~~ \beta_4 = 0.0185 ~~~ \dots

so that the interpretations become

  1. weekly earnings for someone just joining the workforce is exp(5.4711)=$237.72\exp(5.4711) = \$237.72,

  2. increment after year 1 is 13.47%13.47 \%,

  3. increment after year 2 is (13.47+3.23)=16.7%(13.47 + 3.23) = 16.7 \%,

  4. increment after year 3 is (16.78.47)=8.23%(16.7 - 8.47) = 8.23\%,

  5. increment after year 4 is (8.23+1.85)=10.08%(8.23 + 1.85) = 10.08 \%, and so on.

Clearly the increments are quite large, and they also oscillate somewhat wildly from year to year. So the fitted model is somewhat nonsensical.

It is also interesting to compare this fitted model to the earlier quadratic model in terms of AIC and BIC. The AIC of this model is actually smaller than that of the quadratic model. However the BIC of this model is larger. Even though, the AIC is smaller, no one would want to use this fitted model because of poor interpretability. One can also plot the fitted function on the scatter plot of the data as shown below. The fitted function is quite wiggly.

#Plotting the fitted values of this model: 
# Generate x_range for plotting
x_range = np.linspace(x.min(), x.max(), 100)

# Prepare a DataFrame for x_range
X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, 63):
    X_range[f'knot_{knot}'] = np.maximum(X_range['x'] - knot, 0)

fitted_model_all_knots = model_all_knots.predict(X_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression')
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

The reason why this ‘all-knots’ regression model is giving us a poor fit is not because the model itself is inadequate but it is rather because of the use of OLS to estimate its coefficients. The sample size here is quite large (n=25437n = 25437). The problem is much worse if the sample size were smaller. To illustrate this, consider fitting this model to a smaller dataset obtained by randomly sampling n=500n = 500 observations from the full dataset. This is demonstrated below.

#Repeating the analysis on a smaller dataset: 
n = 500
random_seed = 42
dt_small = dt.sample(n = 500, random_state = random_seed)
print(dt_small.shape)
print(dt_small.head())
#dt_small = dt #this is to switch back to the full dataset
(500, 5)
          Region   MetropolitanStatus  Exper  Educ  WeeklyEarnings
8735        West     MetropolitanArea     19    14          735.99
13369  Northeast  NotMetropolitanArea     15    16          807.22
5512     Midwest  NotMetropolitanArea     18    16          341.88
24277      South  NotMetropolitanArea     32    12          240.38
24959      South     MetropolitanArea     14    16         1210.83
#Let us now repeat the exercise fitting the quadratic model as well as the all-knots regression model on this smaller dataset: 
y = np.log(dt_small['WeeklyEarnings'])
x = dt_small['Exper']

#Quadratic model: 
X = sm.add_constant(x)
X['Exper_Square'] = np.square(x)
model_quad = sm.OLS(y, X).fit()

print(model_quad.summary())

#Plotting the fitted quadratic on the scatter plot
b0, b1, b2 = model_quad.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_quad = b0 + b1*x_range + b2*np.square(x_range)

X_all_knots = sm.add_constant(x) #adding intercept
for knot in range(1, 63):
    X_all_knots[f'knot_{knot}'] = np.maximum(x - knot, 0)
model_all_knots = sm.OLS(y, X_all_knots).fit()

print(model_all_knots.summary())

# Prepare a DataFrame for x_range
X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, 63):
    X_range[f'knot_{knot}'] = np.maximum(X_range['x'] - knot, 0)

fitted_model_all_knots = model_all_knots.predict(X_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.180
Method:                 Least Squares   F-statistic:                     55.83
Date:                Mon, 09 Mar 2026   Prob (F-statistic):           1.34e-22
Time:                        14:02:07   Log-Likelihood:                -405.85
No. Observations:                 500   AIC:                             817.7
Df Residuals:                     497   BIC:                             830.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            5.6176      0.069     81.159      0.000       5.482       5.754
Exper            0.0666      0.007      9.404      0.000       0.053       0.080
Exper_Square    -0.0011      0.000     -7.593      0.000      -0.001      -0.001
==============================================================================
Omnibus:                        9.502   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.009   Jarque-Bera (JB):               11.214
Skew:                          -0.226   Prob(JB):                      0.00367
Kurtosis:                       3.577   Cond. No.                     2.09e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.09e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.190
Method:                 Least Squares   F-statistic:                     3.298
Date:                Mon, 09 Mar 2026   Prob (F-statistic):           9.95e-12
Time:                        14:02:07   Log-Likelihood:                -376.83
No. Observations:                 500   AIC:                             857.7
Df Residuals:                     448   BIC:                             1077.
Df Model:                          51                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1547      0.272     18.980      0.000       4.621       5.688
Exper          0.1953      0.351      0.557      0.578      -0.494       0.884
knot_1         0.1042      0.551      0.189      0.850      -0.978       1.186
knot_2        -0.0938      0.453     -0.207      0.836      -0.983       0.796
knot_3        -0.1610      0.384     -0.419      0.675      -0.916       0.594
knot_4        -0.2014      0.326     -0.617      0.538      -0.843       0.440
knot_5         0.6183      0.309      2.002      0.046       0.011       1.225
knot_6        -0.6134      0.296     -2.075      0.039      -1.194      -0.032
knot_7         0.3069      0.315      0.974      0.330      -0.312       0.926
knot_8        -0.1287      0.346     -0.372      0.710      -0.809       0.551
knot_9         0.0483      0.428      0.113      0.910      -0.794       0.890
knot_10       -0.0715      0.366     -0.195      0.845      -0.791       0.648
knot_11       -0.1556      0.334     -0.465      0.642      -0.813       0.502
knot_12        0.1899      0.331      0.574      0.566      -0.461       0.840
knot_13        0.2660      0.328      0.811      0.418      -0.378       0.910
knot_14       -0.5626      0.329     -1.709      0.088      -1.209       0.084
knot_15        0.3070      0.314      0.979      0.328      -0.309       0.923
knot_16        0.2728      0.389      0.701      0.484      -0.492       1.037
knot_17       -0.4239      0.364     -1.166      0.244      -1.139       0.291
knot_18        0.1405      0.329      0.427      0.669      -0.506       0.787
knot_19       -0.4726      0.305     -1.549      0.122      -1.072       0.127
knot_20        0.6368      0.350      1.821      0.069      -0.050       1.324
knot_21       -0.1141      0.373     -0.306      0.760      -0.847       0.619
knot_22        0.1367      0.361      0.379      0.705      -0.572       0.846
knot_23       -0.1656      0.387     -0.428      0.669      -0.926       0.595
knot_24       -0.1908      0.388     -0.492      0.623      -0.953       0.571
knot_25        0.1208      0.432      0.280      0.780      -0.728       0.970
knot_26       -0.2886      0.454     -0.636      0.525      -1.181       0.603
knot_27        0.6466      0.493      1.310      0.191      -0.323       1.616
knot_28       -0.1814      0.498     -0.365      0.716      -1.159       0.797
knot_29       -0.4298      0.476     -0.903      0.367      -1.365       0.506
knot_30        0.4724      0.462      1.021      0.308      -0.437       1.381
knot_31       -0.5066      0.438     -1.157      0.248      -1.367       0.354
knot_32        0.4254      0.462      0.920      0.358      -0.483       1.334
knot_33       -0.1973      0.476     -0.415      0.679      -1.133       0.738
knot_34        0.2300      0.505      0.456      0.649      -0.762       1.222
knot_35       -0.3746      0.530     -0.707      0.480      -1.416       0.667
knot_36        0.5390      0.562      0.959      0.338      -0.566       1.644
knot_37       -0.2597      0.703     -0.369      0.712      -1.642       1.122
knot_38       -0.5455      0.639     -0.854      0.394      -1.801       0.710
knot_39        0.7818      0.643      1.217      0.224      -0.481       2.045
knot_40       -0.0695      0.639     -0.109      0.913      -1.325       1.186
knot_41       -0.3461      0.725     -0.477      0.633      -1.772       1.079
knot_42       -0.4418      0.735     -0.601      0.548      -1.887       1.004
knot_43        0.1942      0.979      0.198      0.843      -1.730       2.119
knot_44        1.1360      1.194      0.951      0.342      -1.211       3.483
knot_45       -0.9202      0.914     -1.006      0.315      -2.717       0.877
knot_46       -0.5969      0.873     -0.684      0.494      -2.313       1.119
knot_47        0.4265      0.859      0.497      0.620      -1.261       2.114
knot_48        1.5575      1.152      1.352      0.177      -0.707       3.822
knot_49       -2.1921      0.941     -2.330      0.020      -4.041      -0.343
knot_50        0.6258      0.504      1.242      0.215      -0.365       1.616
knot_51        0.4172      0.336      1.242      0.215      -0.243       1.078
knot_52        0.2086      0.168      1.242      0.215      -0.122       0.539
knot_53             0          0        nan        nan           0           0
knot_54             0          0        nan        nan           0           0
knot_55             0          0        nan        nan           0           0
knot_56             0          0        nan        nan           0           0
knot_57             0          0        nan        nan           0           0
knot_58             0          0        nan        nan           0           0
knot_59             0          0        nan        nan           0           0
knot_60             0          0        nan        nan           0           0
knot_61             0          0        nan        nan           0           0
knot_62             0          0        nan        nan           0           0
==============================================================================
Omnibus:                       12.537   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               15.793
Skew:                          -0.262   Prob(JB):                     0.000372
Kurtosis:                       3.696   Cond. No.                     1.30e+18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.49e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
<Figure size 1000x600 with 1 Axes>

Now the coefficients oscillate even more wildly to make interpretation fully nonsensical. Indeed now

β0=5.1547   β1=0.1953   β2=0.1042   β3=0.0938   β4=0.1610   \beta_0 = 5.1547 ~~~ \beta_1 = 0.1953 ~~~ \beta_2 = 0.1042 ~~~ \beta_3 = -0.0938 ~~~ \beta_4 = -0.1610 ~~~ \dots

so that the interpretations become

  1. weekly earnings for someone just joining the workforce is exp(5.4711)=$173.24\exp(5.4711) = \$173.24,

  2. increment after year 1 is 19.53%19.53 \%,

  3. increment after year 2 is (19.53+10.42)=29.95%(19.53 + 10.42) = 29.95 \%,

  4. increment after year 3 is (29.959.38)=20.57%(29.95 - 9.38) = 20.57\%,

  5. increment after year 4 is (20.5716.10)=4.47%(20.57 - 16.10) = 4.47 \%, and so on.

Clearly these numbers are meaningless. Moreover the AIC and BIC now are both larger than that of the quadratic model. In situations such as this where the OLS is giving meaningless results, one uses regularization.

Regularization via Bayesian Inference

We work with the following prior:

β0,β1i.i.dunif(C,C)   β2,,βmi.i.dN(0,γ2σ2)\begin{align*} \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C) ~~~ \beta_2, \dots, \beta_{m} \overset{\text{i.i.d}}{\sim} N(0, \gamma^2 \sigma^2) \end{align*}

We treat γ\gamma and σ\sigma also as unknown parameters and assign the prior:

logγ,logσi.i.dunif(C,C).\begin{align*} \log \gamma, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C). \end{align*}

In the last lecture, we discussed the prior logτ,logσi.i.dunif(C,C)\log \tau, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C).

The only difference between this prior and the prior used in the last lecture is that we are now parametrizing in terms of (γ,σ)(\gamma, \sigma) as opposed to (τ,σ)(\tau, \sigma) (the connection is τ=σγ\tau = \sigma \gamma). This is strictly a different prior however because under the assumption logγ,logσi.i.dunif(C,C)\log \gamma, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C), the parameters γ×σ\gamma \times \sigma and σ\sigma become dependent (under the prior) but the previous model assumes prior independence of τ\tau and σ\sigma.

The posterior of γ,σ,β\gamma, \sigma, \beta can be described as follows.

The posterior of β\beta conditional on σ\sigma and γ\gamma is given by:

βdata,σ,τN((XTXσ2+Q1)1XTyσ2,(XTXσ2+Q1)1).\begin{align*} \beta \mid \text{data}, \sigma, \tau \sim N \left(\left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}, \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1}\right). \end{align*}

where QQ is the diagonal matrix with diagonal entries C,C,γ2σ2,,γ2σ2C, C, \gamma^2\sigma^2, \dots, \gamma^2\sigma^2. We use the approximation:

Q1=diag(1/C,1/C,1/τ2,,1/τ2)diag(0,0,1/τ2,,1/τ2)=Jτ2\begin{align*} Q^{-1} = \text{diag}(1/C, 1/C, 1/\tau^2, \dots, 1/\tau^2) \approx \text{diag}(0, 0, 1/\tau^2, \dots, 1/\tau^2) = \frac{J}{\tau^2} \end{align*}

where JJ is the diagonal matrix with diagonals 0,0,1,,10, 0, 1, \dots, 1.

The posterior mean of β\beta given σ,τ\sigma, \tau is

E(βdata,σ,γ)=(XTXσ2+Jσ2γ2)1XTyσ2=(XTX+Jγ2)1XTy\begin{align*} \mathbb{E}(\beta \mid \text{data}, \sigma, \gamma) = \left(\frac{X^T X}{\sigma^2} + \frac{J}{\sigma^2 \gamma^2} \right)^{-1} \frac{X^T y} {\sigma^2} = \left(X^T X + \frac{J}{\gamma^2} \right)^{-1} X^T y \end{align*}

This only depends on γ\gamma. The above can be viewed as ridge regularization with penalty λ=1/γ2\lambda = 1/\gamma^2.

The XX matrix in this high-dimensional linear regression is the following.

X = X_all_knots.to_numpy()
print(X.shape)
m = 63 #the parameters are b_0, \dots, b_m
(500, 64)

Given a value of γ\gamma, the posterior mean estimate of β\beta is computed below. It is identical to ridge regression with λ=1/γ2\lambda = 1/\gamma^2.

gamma_value = 0.0001
gamma_value = 0.0223
J_by_gammasq = np.diag(np.concatenate([[0, 0], np.repeat(gamma_value**(-2), m-1)])) 
Mat =  X.T @ X + J_by_gammasq
Matinv = np.linalg.inv(Mat)
betahat_gamma = Matinv @ X.T @ y.to_numpy().reshape(-1, 1)

#calculating the fitted values for a grid of x-values:
x_range = np.linspace(x.min(), x.max(), 100)

X_range = pd.DataFrame({'const': 1, 'x': x_range})
kmax = m-1
for knot in range(1, (1 + kmax)):
    X_range[f'knot_{knot}'] = np.maximum(x_range - knot, 0)

fitted_model_postmean_gamma = X_range.values @ betahat_gamma

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.plot(x_range,fitted_model_postmean_gamma, color = 'black', label = 'Fitted Regularization')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

We next describe the posterior of γ\gamma and σ\sigma. Unlike the case of the previous prior, the posterior of σ\sigma conditional on γ\gamma can be described in closed form as:

1σ2data,γGamma(n21,yTyyTX(XTX+γ2J)1XTy2)\frac{1}{\sigma^2} \mid \text{data}, \gamma \sim \text{Gamma} \left(\frac{n}{2} - 1, \frac{y^T y - y^T X (X^T X + \gamma^{-2} J)^{-1} X^T y}{2} \right)

Finally the posterior of γ\gamma is given by:

γdataγn+1det(XTX+γ2J)1(1yTyyTX(XTX+γ2J)1XTy)(n/2)1\begin{align*} \gamma \mid \text{data} \sim \gamma^{-n+1} \sqrt{\det (X^T X + \gamma^{-2} J)^{-1}} \left(\frac{1}{y^T y - y^T X (X^T X + \gamma^{-2} J)^{-1} X^T y} \right)^{(n/2) - 1} \end{align*}

We can compute this posterior on a grid of γ\gamma values. Now we only need a 1D grid for γ\gamma (unlike the previous model, where we needed two grids for τ\tau and σ\sigma).

gamma_gr = np.logspace(np.log10(1e-6), np.log10(1e4), 2000)
logpost_gamma = np.zeros(len(gamma_gr))

for i in range(len(gamma_gr)):
    gamma = gamma_gr[i]
    J_by_gammasq = np.diag(np.concatenate([[0, 0], np.repeat(gamma**(-2), m-1)])) 
    Mat =  X.T @ X + J_by_gammasq
    Matinv = np.linalg.inv(Mat)
    sgn, logcovdet = np.linalg.slogdet(Matinv)
    logpost_gamma[i] =(-m)*np.log(gamma) + 0.5 * logcovdet - (n/2 - 1)*np.log(y.T @ y - y.T @ X @ Matinv @ X.T @ y)

Below we compute the posterior of γ\gamma (correctly normalized) as well as the posterior mean.

post_gamma  = np.exp(logpost_gamma - np.max(logpost_gamma))
post_gamma = post_gamma/np.sum(post_gamma)
postmean_gamma = np.sum(gamma_gr * post_gamma)
print(postmean_gamma)
print(1/(postmean_gamma ** 2)) #this is lambda = 1/gamma^2
0.022375927631204152
1997.2751679918663

Below we plot the posterior of γ\gamma as a function of γ\gamma. The x-axis is on the log-scale.

#Plotting post_gamma as a function of gamma:
plt.figure(figsize = (10, 6))
plt.plot(gamma_gr, post_gamma, color = 'blue')
plt.xscale('log')
plt.xlabel('Gamma (log scale)')
plt.ylabel('Posterior Probability')
plt.title('Posterior Distribution of Gamma')
plt.show()
<Figure size 1000x600 with 1 Axes>

Clearly, there is a single well-defined peak in this posterior distribution for γ\gamma. This means that the model is preferring a specific value of γ\gamma that is neither too big nor too small. We can create samples from this posterior of γ\gamma by simply sampling from the discretized distribution.

N = 2000
gamma_samples = np.random.choice(gamma_gr, size=N, p=post_gamma, replace=True)
plt.hist(gamma_samples, bins=30, color='lightgreen', edgecolor='black')
plt.title('Histogram of gamma samples')
plt.xlabel('gamma')
plt.ylabel('Frequency')
plt.show()
<Figure size 640x480 with 1 Axes>

Using the samples of γ\gamma, samples from β\beta and σ\sigma can be constructed as follows.

sig_samples = np.zeros(N)
betahats = np.zeros((m+1, N))
for i in range(N):
    gamma = gamma_samples[i]
    J_by_gammasq = np.diag(np.concatenate([[0, 0], np.repeat(gamma**(-2), m-1)])) 
    Mat =  X.T @ X + J_by_gammasq
    Matinv = np.linalg.inv(Mat)
    gamma_dist_lambda_parameter = (y.T @ y - y.T @ X @ Matinv @ X.T @ y)/2
    gamma_dist_alpha_parameter = n/2 - 1
    sig = np.sqrt(1/np.random.gamma(gamma_dist_alpha_parameter, 1/gamma_dist_lambda_parameter))
    sig_samples[i] = sig
    XTX = np.dot(X.T, X)
    TempMat = np.linalg.inv((J_by_gammasq/(sig ** 2)) + (XTX/(sig ** 2)))
    XTy = np.dot(X.T, y)
    #generate betahat from the normal distribution with mean: 
    norm_mean = np.dot(TempMat, XTy/(sig ** 2)) 
    #and covariance matrix: 
    norm_cov = TempMat  
    betahat = np.random.multivariate_normal(norm_mean, norm_cov)
    betahats[:,i] = betahat
    

The overall estimate of β\beta can be obtained by taking the mean of the individual β\beta samples. This is the posterior mean estimate of β\beta in the full Bayesian model.

beta_est = np.mean(betahats, axis = 1)

Below we plot the fitted values corresponding to the posterior mean estimate of β\beta.

#calculating the fitted values for a grid of x-values:
x_range = np.linspace(x.min(), x.max(), 100)

X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, (1 + kmax)):
    X_range[f'knot_{knot}'] = np.maximum(x_range - knot, 0)

fitted_model_postmean_fullmodel = X_range.values @ beta_est

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.plot(x_range,fitted_model_postmean_fullmodel, color = 'black', label = 'Fitted Regularization')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

It is interesting that the posterior mean estimate of β\beta corresponds to a smooth estimate of the underlying regression curve.

Below we plot the individual curves coming from posterior samples of β\beta. This gives an idea of the uncertainty associated with the fitted regression function.

#Plotting all the posterior fitted values: 
plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
for i in range(N):
    fitted_model_postdraw = X_range.values @ betahats[:,i]
    plt.plot(x_range, fitted_model_postdraw, color = 'lightsalmon', alpha = 0.25)
plt.plot(x_range, fitted_model_postmean_fullmodel, color = 'black', label = 'Posterior Mean')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

It is interesting that the uncertainty bands get wider near the end of the dataset where the data is sparse. Go back and repeat the analysis on the whole dataset (as opposed to the sample dataset of smaller size 500).