Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Today we shall look at the idea of regularization in the context of regression with many covariates (also known as ‘high-dimensional’ regression). We work with a very simple dataset to illustrate the key ideas. This simple dataset, which comes from the introductory statistics textbook “Statistical Sleuth” by Ramsey and Schafer, contains weekly wages in 1987 for a sample of 25437 males between the ages of 18 and 70 who worked full-time along with their years of education, years of experience, indicator variable for whether they worked near a city, and a code for the region in the US where they worked.

#Load the dataset wagedata.csv
import pandas as pd
dt = pd.read_csv("wagedata.csv")
print(dt.head(10))
      Region   MetropolitanStatus  Exper  Educ  WeeklyEarnings
0      South  NotMetropolitanArea      8    12          859.71
1    Midwest     MetropolitanArea     30    12          786.73
2       West     MetropolitanArea     31    14         1424.50
3       West     MetropolitanArea     17    16          959.16
4       West     MetropolitanArea      6    12          154.32
5    Midwest     MetropolitanArea     24    17          282.19
6  Northeast     MetropolitanArea     14    12          444.44
7  Northeast     MetropolitanArea     28    12          451.09
8      South     MetropolitanArea     40    12          490.27
9      South     MetropolitanArea      7    16          493.83

Two Simple Regressions (linear and quadratic)

Many regressions can be fit to this dataset. We shall use logarithm of weekly earnings as the response variable:

Y=log(weekly earnings).\begin{align*} Y = \log \left( \text{weekly earnings} \right). \end{align*}

Taking the logarithm leads to regressions where coefficients are more interpretable. Several regressions can be fit to this dataset with YY as the response variable. We shall only use years of experience as the covariate:

X=Exper=Years of Experience.\begin{align*} X = \text{Exper} = \text{Years of Experience}. \end{align*}

Here is the scatter plot of YY and XX.

import numpy as np
y = np.log(dt['WeeklyEarnings'])
x = dt['Exper']
import matplotlib.pyplot as plt
plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Scatter Plot of Years of Experience vs Log(Weekly Earnings)')
plt.show()
<Figure size 1000x600 with 1 Axes>

The simplest model is linear regression: Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon. Let us fit this model using statsmodels, and plot the fitted regression line on the observed scatter plot.

#The simplest model to fit is linear regression: y = b0 + b1 x
import statsmodels.api as sm

X = sm.add_constant(x) #adding intercept
model_1 = sm.OLS(y, X).fit()

print(model_1.summary())

#Plotting the fitted line on the scatter plot
plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x, model_1.predict(X), color = 'red', label = 'Fitted line')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Linear Regression of Log(Weekly Earnings) on Years of Experience')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.049
Method:                 Least Squares   F-statistic:                     1311.
Date:                Fri, 06 Mar 2026   Prob (F-statistic):          4.67e-280
Time:                        20:49:01   Log-Likelihood:                -23459.
No. Observations:               25437   AIC:                         4.692e+04
Df Residuals:                   25435   BIC:                         4.694e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.0704      0.007    875.641      0.000       6.057       6.084
Exper          0.0112      0.000     36.214      0.000       0.011       0.012
==============================================================================
Omnibus:                      587.040   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              706.522
Skew:                          -0.317   Prob(JB):                    3.81e-154
Kurtosis:                       3.515   Cond. No.                         40.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Figure size 1000x600 with 1 Axes>

The fitted regression equation is: Y=6.0704+0.0112XY = 6.0704 + 0.0112 X. The interpretation of the coefficient 0.0112 is as follows. It represents the change in YY for unit increase in XX. As Y=logEY = \log E where EE denotes earning. We have

logEnewlogEold=0.01120\begin{align*} \log E_{\text{new}} - \log E_{\text{old}} = 0.01120 \end{align*}

where EnewE_{\text{new}} denotes weekly earnings for X=x+1X = x + 1 and EoldE_{\text{old}} denotes weekly earnings for X=xX = x (the exact value of xx is immaterial for this argument). Because EnewE_{\text{new}} and EoldE_{\text{old}} are expected to be close to each other, we can use logxx1\log x \approx x-1 for x1x \approx 1 to write

0.0112=logEnewlogEold=logEnewEoldEnewEold1=EnewEoldEold\begin{align*} 0.0112 = \log E_{\text{new}} - \log E_{\text{old}} = \log \frac{E_{\text{new}}}{E_{\text{old}}} \approx \frac{E_{\text{new}}}{E_{\text{old}}} - 1 = \frac{E_{\text{new}} - E_{\text{old}}}{E_{\text{old}}} \end{align*}

Therefore

EnewEoldEold×1001.12\begin{align*} \frac{E_{\text{new}} - E_{\text{old}}}{E_{\text{old}}} \times 100 \approx 1.12 \end{align*}

which says that weekly earnings increase by 1.12 % for an additional year of experience. From a look at the scatter plot, it is clear that earnings seems to be decrease when experience is large. This suggests that fitting a single line to this dataset is not the best idea. The common fix is then to include a quadratic term Exper2\text{Exper}^2 in the model. This leads to the following model:

Y=β0+β1X+β2X2\begin{align*} Y = \beta_0 + \beta_1 X + \beta_2 X^2 \end{align*}

Even though the right hand side is a quadratic function of XX, this is also considered a linear model and can be fit by least squares (using the OLS function in statsmodels) as follows:

X = sm.add_constant(x)
X['Exper_Square'] = np.square(x)
model_2 = sm.OLS(y, X).fit()

print(model_2.summary())

#Plotting the fitted quadratic on the scatter plot

b0, b1, b2 = model_2.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_2 = b0 + b1*x_range + b2*np.square(x_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Quadratic Regression of Log(Weekly Earnings) on Experience')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.132
Model:                            OLS   Adj. R-squared:                  0.132
Method:                 Least Squares   F-statistic:                     1939.
Date:                Fri, 06 Mar 2026   Prob (F-statistic):               0.00
Time:                        20:49:03   Log-Likelihood:                -22294.
No. Observations:               25437   AIC:                         4.459e+04
Df Residuals:                   25434   BIC:                         4.462e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            5.6996      0.010    569.429      0.000       5.680       5.719
Exper            0.0601      0.001     58.167      0.000       0.058       0.062
Exper_Square    -0.0011    2.2e-05    -49.400      0.000      -0.001      -0.001
==============================================================================
Omnibus:                      706.284   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1008.724
Skew:                          -0.302   Prob(JB):                    9.09e-220
Kurtosis:                       3.766   Cond. No.                     2.12e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
<Figure size 1000x600 with 1 Axes>

Visually the quadratic model gives a better fit than before. Also the AIC for the second quadratic model is much smaller compared to the simpler linear model. However interpretation of the coefficients is now trickier because of the presence of both XX and X2X^2 in the model. The increase in YY when the experience changes from xx to x+1x+1 is now given by:

(β0+β1(x+1)+β2(x+1)2)(β0+β1x+β2x2)=β1+β2(2x+1)\begin{align*} \left(\beta_0 + \beta_1 (x+1) + \beta_2 (x+1)^2 \right) - \left(\beta_0 + \beta_1 x + \beta_2 x^2 \right) = \beta_1 + \beta_2 (2x + 1) \end{align*}

Thus the percentage increase for an additional year of experience when the experience is xx is given by 100(β1+β2(2x+1))100 \left( \beta_1 + \beta_2 (2x + 1)\right).

#Coefficient interpretation for the quadratic model:
#Interpretation of the coefficients is now trickier because of the square term
#Now the percentage increase for an additional year of experience is given by: 
#100*(b1 + 2*Exper*b2 + b2). For example if Exper = 0, then this is 100*(b1 + b2) which is approximately 6%
#If Exper = 25, then this is 100*(b1 + 50*b2+b2) = 
exper_value = 50
perc_increase = 100*(b1 + 2*exper_value*b2 + b2)
print(perc_increase)
-4.940972151572134

From the equation of the fitted quadratic function, it is easy to calculate the value of the point where the quadratic goes from increasing to decreasing. This is the peak of the quadratic and represents the point beyond which weekly wages go down with further increases in experience.

#The point where the quadratic goes from increasing to decreasing is given by: 
peak_quadratic = -b1/(2*b2)
print(peak_quadratic)
27.72060595169568

Broken Stick Regression

The quadratic term is not the only way of fitting a nonlinear curve to this dataset. Another option is to use broken stick regression. Here one adds an additional term of the form (xs)+:=max(xs,0)(x - s)_+ := \max(x-s, 0) to the simple equation Y=β0+β1XY = \beta_0 + \beta_1 X to get the new equation:

Y=β0+β1X+β2(Xs)+\begin{align*} Y = \beta_0 + \beta_1 X + \beta_2 (X - s)_+ \end{align*}

Here ss is known as a ‘break-point’ or ‘knot’. The coefficients of this equation are easier to interpret compared to the quadratic equation Y=β0+β1X+β2X2Y = \beta_0 + \beta_1 X + \beta_2 X^2. Specifically, β1\beta_1 is the slope for xsx \leq s and β1+β2\beta_1 + \beta_2 is the slope for xsx \geq s. In this example, 100β1100*\beta_1 denotes the percent increase in weekly earnings for every additional year of experience when experience is less than ss, and 100(β1+β2)100*(\beta_1 + \beta_2) denotes the percent increase in weekly earnings for every additional year of experience when experience is more than ss.

The following code illustrates how this model can be fit using the OLS function in statsmodels. We are using s=27s = 27 for the code below. The choice of s=27s = 27 is motivated by the fact that the quadratic fit has its peak close to 27.

#Suppose we go back to the linear model, and instead of
#adding the quadratic term, we add the term (x - 27)_+ leading to the model:
#y = b0 + b1 x + b2 (x - 27)_+
#This model keeps the slope b1 for x smaller than 27, but after 27 it modifies the slope to b1 + b2
#We would expect b2 to be negative. 
#This model can also be fit by usual linear regression

X = sm.add_constant(x) #adding intercept
X['pos_part_x_minus_27'] = np.maximum(x - 27, 0)
model_3 = sm.OLS(y, X).fit()

print(model_3.summary())

#Plotting the fitted equation on the scatter plot:

b0, b1, b2 = model_3.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_3 = b0 + b1*x_range + b2*np.maximum(x_range-27,0)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_3, color = 'green', label = 'Fitted Broken Stick')
plt.plot(x_range, fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Broken Stick Regression of Log(Weekly Earnings) on Experience')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.115
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     1653.
Date:                Fri, 06 Mar 2026   Prob (F-statistic):               0.00
Time:                        20:49:07   Log-Likelihood:                -22545.
No. Observations:               25437   AIC:                         4.510e+04
Df Residuals:                   25434   BIC:                         4.512e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   5.8595      0.008    709.655      0.000       5.843       5.876
Exper                   0.0290      0.001     57.298      0.000       0.028       0.030
pos_part_x_minus_27    -0.0524      0.001    -43.544      0.000      -0.055      -0.050
==============================================================================
Omnibus:                      650.115   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              879.115
Skew:                          -0.300   Prob(JB):                    1.27e-191
Kurtosis:                       3.684   Cond. No.                         51.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Figure size 1000x600 with 1 Axes>
#This model has an AIC that is much smaller than the first model (simple linear regression)
#But its AIC is still larger than that of the quadratic model (indicating the quadratic
#model is a better fit). However, the coefficients in this model are easier to interpret
#corresponding to the quadratic model. 
#For x <= 27, wages increase by 100*b1 = 2.9% for every additional year in experience
#For x >= 27, wages increase by 100*(b1 + b2) = -2.34% for every additional year in experience
print(100*(b1 + b2))
-2.342262114823935

Because the fitted function looks like a broken stick, this is known as Broken Stick Regression. The choice of knot ss (which we took as s=27s = 27) is crucial to the quality of the fit to data. A natural data-driven strategy to select ss is to go over all possible values for ss, and then choose the knot which gives the smallest AIC (or, equivalently, the highest log-likelihood). This can be done using the code shown below.

#Below we search over all possible knots and then 
#choose the knot with the highest log-likelihood:
best_knot = None
highest_log_likelihood = -np.inf

for knot in range(1, 63):
    X = sm.add_constant(x)
    X['knot_variable'] = np.maximum(x - knot, 0)
    model_knot = sm.OLS(y, X).fit()
    if model_knot.llf > highest_log_likelihood:
        highest_log_likelihood = model_knot.llf
        best_knot = knot

print(f"The knot with highest log-likelihood is: {best_knot}")
print(f"Highest Log-Likelihood: {highest_log_likelihood}")
The knot with highest log-likelihood is: 17
Highest Log-Likelihood: -22398.051362075297

The best knot (i.e., the knot with the highest log-likelihood or smallest AIC) turned out to be 17 (which is much smaller than our initial choice s=27s = 27). Below we fit the broken stick regression model with knot chosen to be 17.

#We can now fit the model with the best_knot
X = sm.add_constant(x) #adding intercept
X['knot_variable'] = np.maximum(x - best_knot, 0)
model_4 = sm.OLS(y, X).fit()

print(model_4.summary())

#Plotting the fitted equation on the scatter plot

b0, b1, b2 = model_4.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_4 = b0 + b1*x_range + b2*np.maximum(x_range-best_knot,0)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_4, color = 'green', label = 'Fitted Broken Stick')
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Broken Stick Regression of y on x with best knot')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.125
Model:                            OLS   Adj. R-squared:                  0.125
Method:                 Least Squares   F-statistic:                     1819.
Date:                Fri, 06 Mar 2026   Prob (F-statistic):               0.00
Time:                        20:49:11   Log-Likelihood:                -22398.
No. Observations:               25437   AIC:                         4.480e+04
Df Residuals:                   25434   BIC:                         4.483e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             5.7143      0.010    567.213      0.000       5.695       5.734
Exper             0.0474      0.001     57.468      0.000       0.046       0.049
knot_variable    -0.0548      0.001    -47.043      0.000      -0.057      -0.053
==============================================================================
Omnibus:                      688.082   Durbin-Watson:                   2.011
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              971.466
Skew:                          -0.300   Prob(JB):                    1.12e-211
Kurtosis:                       3.747   Cond. No.                         67.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Figure size 1000x600 with 1 Axes>

Regression Models with more knots

The model from the previous section with the best knot still had an inferior AIC compared to fitting the quadratic function. We can further improve the broken stick regression model by adding more knot variables. For example, adding one more knot variable leads to the equation:

Y=β0+β1X+β2(Xs1)++β3(Xs2)+\begin{align*} Y = \beta_0 + \beta_1 X + \beta_2 (X - s_1)_+ + \beta_3 (X - s_2)_+ \end{align*}

The code below demonstrates fitting this model with s1=17s_1 = 17 and s2=35s_2 = 35 (the choice s2=35s_2 = 35 is arbitrary).

#We can further improve the broken stick regression model by adding more knot variables. 
#Suppose we add another knot variable at 35:
new_knot = 35
X = sm.add_constant(x) #adding intercept
X['knot_variable'] = np.maximum(x - best_knot, 0)
X['another_knot'] = np.maximum(x - new_knot, 0)

model_5 = sm.OLS(y, X).fit()

print(model_5.summary())

#Plotting the fitted quadratic on the scatter plot

b0, b1, b2, b3 = model_5.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_5 = b0 + b1*x_range + b2*np.maximum(x_range-best_knot,0) + b3 * np.maximum(x_range - new_knot, 0)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_5, color = 'green', label = 'Fitted Broken Stick')
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('Broken Stick Regression of y on x with two knots')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.130
Model:                            OLS   Adj. R-squared:                  0.130
Method:                 Least Squares   F-statistic:                     1267.
Date:                Fri, 06 Mar 2026   Prob (F-statistic):               0.00
Time:                        20:49:12   Log-Likelihood:                -22327.
No. Observations:               25437   AIC:                         4.466e+04
Df Residuals:                   25433   BIC:                         4.470e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             5.7340      0.010    563.180      0.000       5.714       5.754
Exper             0.0442      0.001     51.187      0.000       0.043       0.046
knot_variable    -0.0443      0.001    -30.437      0.000      -0.047      -0.041
another_knot     -0.0261      0.002    -11.916      0.000      -0.030      -0.022
==============================================================================
Omnibus:                      701.022   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              999.068
Skew:                          -0.301   Prob(JB):                    1.14e-217
Kurtosis:                       3.762   Cond. No.                         68.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Figure size 1000x600 with 1 Axes>

With the two additional terms (xs1)+(x - s_1)_+ and (xs2)+(x - s_2)_+, the percentage increase in wages for an additional year of experience depends on whether the current experience level is less than s1s_1, between s1s_1 and s2s_2, or more than s2s_2:

#Interpretation of the coefficients:
#for x < 17, the slope is b1. So Wages increase by 100*b1 = 4.42% for every one year increase in experience
# for x between 17 and 35, slope is b1 + b2. So wages increase by 100*(b1 + b2) for every one year increase in experience:
print(100*(b1 + b2))
#for x larger than 35, slope is b1 + b2 + b3. So wages increase by 100*(b1 + b2 + b3):
print(100*(b1 + b2 + b3))
-0.011413982590524618
-2.6233912674073783

One can more knot variables to the equation to obtain models of the form:

Y=β0+β1X+β2(Xs1)++β3(Xs2)+++βk+1(Xsk)+\begin{align*} Y = \beta_0 + \beta_1 X + \beta_2 (X - s_1)_+ + \beta_3 (X - s_2)_+ + \dots + \beta_{k+1} (X - s_k)_+ \end{align*}

While this model is conceptually interesting, one has to deal with the following annoying issues:

  1. It is difficult to select a suitable value of kk.

  2. For a given choice of kk, it is quite difficult to select the knots s1,,sks_1, \dots, s_k in a principled manner. If kk is very small, one can go over all possible values for these knots and then select the best set of knots using AIC or log-likelihood. But this process is computationally intractable if kk is even moderately large.

One way to circumvent these problems is to introduce a change of slope term (xs)+(x - s)_+ at every possible value of ss.

All knots: high-dimensional linear model

In this dataset, the variable XX takes the values 0,1,,630, 1, \dots, 63. Considering all these possible knots (with the exception of the last value 63 because (X63)+(X - 63)_+ is always zero) leads us to the model equation: $$ Y = \beta_0 + \beta_1 X + \beta_2 (X - 1)+ + \beta_3 (X - 2)+

  • \dots + \beta_{63}(X - 62)_+ \tag{1} $Thisisalinearregressionmodelwith This is a linear regression model with 64coefficients.Theinterpretationoftheregressioncoefficientsareasfollows. coefficients. The interpretation of the regression coefficients are as follows. \beta_0isthesimplythevalueof is the simply the value of Y = \log(\text{Earnings})when when X = 0(i.e.,forsomeonejustjoiningtheworkforce).Toobtaintheinterpretationfor (i.e., for someone just joining the workforce). To obtain the interpretation for \beta_1,firstplugin, first plug inX = 1inthemodelequationandthen in the model equation and then x = 0$ and subtract the second equation from the first. This gives

β1=Y1Y0=log(E1)log(E0)=log(E1E0)E1E0E0.\beta_1 = Y_1 - Y_0 = \log(E_1) - \log(E_0) = \log\left(\frac{E_1}{E_0} \right) \approx \frac{E_1 - E_0}{E_0}.

Here E0E_0 and E1E_1 are Earnings for X=0X = 0 and X=1X = 1 respectively, and in the last equality, we again used log(u)u1\log(u) \approx u - 1 for u1u \approx 1. Thus 100β1100\beta_1 represents the increment (in percent) in salary from year 0 to year 1. For example, β1=0.05\beta_1 = 0.05 means that the salary increases by 5%5\% from year 0 to year 1. What is the interpretation for β2\beta_2? It is easy to see that:

logE2=β0+2β1+β2   logE1=β0+β1   logE0=β0.\log E_2 = \beta_0 + 2 \beta_1 + \beta_2 ~~~ \log E_1 = \beta_0 + \beta_1 ~~~ \log E_0 = \beta_0.

Thus

β2=logE22logE1+logE0=logE2E1logE1E0E2E1E1E1E0E0.\beta_2 = \log E_2 - 2 \log E_1 + \log E_0 = \log \frac{E_2}{E_1} - \log \frac{E_1}{E_0} \approx \frac{E_2 - E_1}{E_1} - \frac{E_1 - E_0}{E_0}.

Thus 100β2100 \beta_2 represents the change in the percent increment between years 1 and 2 compared to the percent increment between years 0 and 1. For example, β2=0.0003\beta_2 = 0.0003 means that the percent increment decreases by 0.03 after year 2. If β1=0.05\beta_1 = 0.05, we would have a 5%5\% increment after year 1 and a 50.03=4.97%5 - 0.03 = 4.97\% increment after year 2. More generally, the interpretation for βj,j2\beta_j, j \geq 2 is as follows: 100βj100\beta_j is the change in the percent increment between years j1j-1 and jj compared to the percent increment between years j2j-2 and j1j-1. For a concrete example, suppose

β0=5.74   β1=0.05   β2=0.0003   β3=0.0008   β4=0.001   \beta_0 = 5.74 ~~~ \beta_1 = 0.05 ~~~ \beta_2 = -0.0003 ~~~ \beta_3 = -0.0008 ~~~ \beta_4 = -0.001 ~~~ \dots

then

  1. weekly earnings for someone just joining the workforce is exp(5.74)=$311.06\exp(5.74) = \$311.06,

  2. increment after year 1 is 5%5 \%,

  3. increment after year 2 is (50.03)=4.97%(5 - 0.03) = 4.97 \%,

  4. increment after year 3 is (4.970.08)=4.89%(4.97 - 0.08) = 4.89\%,

  5. increment after year 4 is (4.890.1)=4.79%(4.89 - 0.1) = 4.79 \%, and so on.

If all βj,j2\beta_j, j \geq 2 are negative, then, after a while, the increments may become negative which means that the salary actually starts decreasing after a certain number of years of experience.

The all knots regression model (in equation (1)) can also be fit by the OLS function in statsmodels. However it is usually not a good idea to fit a linear regression model with many covariates using plain old OLS. The reason is that there will be overfitting leading to poor interpretability and poor predictive power. We shall demonstrate this below.

#Here is how we can use OLS to fit the regression model with all possible knots 1, 2,.., 62.
X = sm.add_constant(x) #adding intercept
for knot in range(1, 63):
    X[f'knot_{knot}'] = np.maximum(x - knot, 0)

model_all_knots = sm.OLS(y, X).fit()
print(model_all_knots.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.139
Model:                            OLS   Adj. R-squared:                  0.137
Method:                 Least Squares   F-statistic:                     67.42
Date:                Fri, 06 Mar 2026   Prob (F-statistic):               0.00
Time:                        20:49:16   Log-Likelihood:                -22188.
No. Observations:               25437   AIC:                         4.450e+04
Df Residuals:                   25375   BIC:                         4.501e+04
Df Model:                          61                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.4711      0.030    185.217      0.000       5.413       5.529
Exper          0.1347      0.040      3.390      0.001       0.057       0.213
knot_1         0.0323      0.066      0.491      0.623      -0.097       0.161
knot_2        -0.0847      0.061     -1.386      0.166      -0.205       0.035
knot_3         0.0185      0.056      0.331      0.741      -0.091       0.128
knot_4        -0.0753      0.053     -1.425      0.154      -0.179       0.028
knot_5         0.0762      0.052      1.465      0.143      -0.026       0.178
knot_6        -0.0510      0.051     -1.005      0.315      -0.151       0.049
knot_7        -0.0773      0.049     -1.565      0.118      -0.174       0.020
knot_8         0.1382      0.049      2.804      0.005       0.042       0.235
knot_9        -0.0891      0.049     -1.832      0.067      -0.184       0.006
knot_10        0.0031      0.048      0.064      0.949      -0.091       0.097
knot_11       -0.0265      0.048     -0.556      0.578      -0.120       0.067
knot_12        0.0711      0.048      1.490      0.136      -0.022       0.165
knot_13       -0.0524      0.048     -1.098      0.272      -0.146       0.041
knot_14       -0.0099      0.049     -0.201      0.841      -0.107       0.087
knot_15        0.0158      0.049      0.319      0.750      -0.081       0.113
knot_16        0.0372      0.050      0.739      0.460      -0.061       0.136
knot_17       -0.0554      0.050     -1.097      0.272      -0.154       0.044
knot_18       -0.0034      0.051     -0.066      0.947      -0.103       0.096
knot_19       -0.0130      0.054     -0.243      0.808      -0.118       0.092
knot_20        0.0394      0.056      0.698      0.485      -0.071       0.150
knot_21       -0.0460      0.059     -0.783      0.434      -0.161       0.069
knot_22        0.0370      0.059      0.623      0.533      -0.079       0.153
knot_23       -0.0154      0.061     -0.250      0.802      -0.136       0.105
knot_24       -0.0035      0.064     -0.055      0.956      -0.128       0.121
knot_25        0.0113      0.065      0.174      0.862      -0.116       0.139
knot_26       -0.0121      0.066     -0.182      0.856      -0.142       0.118
knot_27       -0.0291      0.068     -0.426      0.670      -0.163       0.105
knot_28        0.0647      0.069      0.933      0.351      -0.071       0.201
knot_29       -0.0132      0.070     -0.189      0.850      -0.150       0.124
knot_30       -0.0305      0.071     -0.429      0.668      -0.170       0.109
knot_31       -0.0261      0.073     -0.360      0.719      -0.168       0.116
knot_32       -0.0034      0.074     -0.045      0.964      -0.149       0.142
knot_33        0.0766      0.076      1.002      0.317      -0.073       0.226
knot_34       -0.0943      0.076     -1.235      0.217      -0.244       0.055
knot_35        0.1039      0.078      1.337      0.181      -0.048       0.256
knot_36       -0.1690      0.081     -2.079      0.038      -0.328      -0.010
knot_37        0.1986      0.084      2.362      0.018       0.034       0.363
knot_38       -0.0801      0.085     -0.938      0.348      -0.247       0.087
knot_39       -0.0127      0.087     -0.146      0.884      -0.183       0.158
knot_40        0.0045      0.083      0.054      0.957      -0.157       0.166
knot_41       -0.0718      0.087     -0.824      0.410      -0.243       0.099
knot_42       -0.0043      0.091     -0.047      0.962      -0.183       0.175
knot_43        0.1796      0.096      1.879      0.060      -0.008       0.367
knot_44       -0.1397      0.103     -1.350      0.177      -0.342       0.063
knot_45       -0.0091      0.111     -0.082      0.934      -0.226       0.208
knot_46        0.0326      0.120      0.271      0.786      -0.203       0.268
knot_47       -0.0828      0.131     -0.630      0.529      -0.340       0.175
knot_48        0.0582      0.146      0.399      0.690      -0.228       0.344
knot_49       -0.0267      0.162     -0.165      0.869      -0.344       0.290
knot_50        0.1124      0.194      0.581      0.562      -0.267       0.492
knot_51       -0.1186      0.235     -0.504      0.614      -0.580       0.342
knot_52       -0.0133      0.268     -0.050      0.960      -0.539       0.512
knot_53        0.1658      0.294      0.564      0.573      -0.410       0.742
knot_54       -0.2045      0.335     -0.611      0.541      -0.861       0.452
knot_55        0.1716      0.429      0.400      0.689      -0.669       1.012
knot_56       -0.4076      0.465     -0.877      0.380      -1.318       0.503
knot_57        0.5338      0.627      0.851      0.395      -0.696       1.763
knot_58        0.1181      0.700      0.169      0.866      -1.255       1.491
knot_59       -0.3092      0.314     -0.984      0.325      -0.925       0.306
knot_60       -0.7365      0.952     -0.774      0.439      -2.603       1.130
knot_61        0.9303      0.867      1.072      0.284      -0.770       2.631
knot_62        0.4651      0.434      1.072      0.284      -0.385       1.315
==============================================================================
Omnibus:                      724.158   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1054.343
Skew:                          -0.302   Prob(JB):                    1.13e-229
Kurtosis:                       3.794   Cond. No.                     1.75e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.52e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The main problem with the above OLS output is that the estimated regression coefficients are not interpretable. To see this, note that the fitted coefficients are:

β0=5.4711   β1=0.1347   β2=0.0323   β3=0.0847   β4=0.0185   \beta_0 = 5.4711 ~~~ \beta_1 = 0.1347 ~~~ \beta_2 = 0.0323 ~~~ \beta_3 = -0.0847 ~~~ \beta_4 = 0.0185 ~~~ \dots

so that the interpretations become

  1. weekly earnings for someone just joining the workforce is exp(5.4711)=$237.72\exp(5.4711) = \$237.72,

  2. increment after year 1 is 13.47%13.47 \%,

  3. increment after year 2 is (13.47+3.23)=16.7%(13.47 + 3.23) = 16.7 \%,

  4. increment after year 3 is (16.78.47)=8.23%(16.7 - 8.47) = 8.23\%,

  5. increment after year 4 is (8.23+1.85)=10.08%(8.23 + 1.85) = 10.08 \%, and so on.

Clearly the increments are quite large, and they also oscillate somewhat wildly from year to year. So the fitted model is somewhat nonsensical.

It is also interesting to compare this fitted model to the earlier quadratic model in terms of AIC and BIC. The AIC of this model is actually smaller than that of the quadratic model. However the BIC of this model is larger. Even though, the AIC is smaller, no one would want to use this fitted model because of poor interpretability. One can also plot the fitted function on the scatter plot of the data as shown below. The fitted function is quite wiggly.

#Plotting the fitted values of this model: 
# Generate x_range for plotting
x_range = np.linspace(x.min(), x.max(), 100)

# Prepare a DataFrame for x_range
X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, 63):
    X_range[f'knot_{knot}'] = np.maximum(X_range['x'] - knot, 0)

fitted_model_all_knots = model_all_knots.predict(X_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 1)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression')
plt.plot(x_range,fitted_model_2, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

The reason why this ‘all-knots’ regression model is giving us a poor fit is not because the model itself is inadequate but it is rather because of the use of OLS to estimate its coefficients. The sample size here is quite large (n=25437n = 25437). The problem is much worse if the sample size were smaller. To illustrate this, consider fitting this model to a smaller dataset obtained by randomly sampling n=500n = 500 observations from the full dataset. This is demonstrated below.

#Repeating the analysis on a smaller dataset: 
n = 500
random_seed = 42
dt_small = dt.sample(n = 500, random_state = random_seed)
print(dt_small.shape)
print(dt_small.head())
(500, 5)
          Region   MetropolitanStatus  Exper  Educ  WeeklyEarnings
8735        West     MetropolitanArea     19    14          735.99
13369  Northeast  NotMetropolitanArea     15    16          807.22
5512     Midwest  NotMetropolitanArea     18    16          341.88
24277      South  NotMetropolitanArea     32    12          240.38
24959      South     MetropolitanArea     14    16         1210.83
#Let us now repeat the exercise fitting the quadratic model as well as the all-knots regression model on this smaller dataset: 
y = np.log(dt_small['WeeklyEarnings'])
x = dt_small['Exper']

#Quadratic model: 
X = sm.add_constant(x)
X['Exper_Square'] = np.square(x)
model_quad = sm.OLS(y, X).fit()

print(model_quad.summary())

#Plotting the fitted quadratic on the scatter plot
b0, b1, b2 = model_quad.params
x_range = np.linspace(x.min(), x.max(), 100)
fitted_model_quad = b0 + b1*x_range + b2*np.square(x_range)

X_all_knots = sm.add_constant(x) #adding intercept
for knot in range(1, 63):
    X_all_knots[f'knot_{knot}'] = np.maximum(x - knot, 0)
model_all_knots = sm.OLS(y, X_all_knots).fit()

print(model_all_knots.summary())

# Prepare a DataFrame for x_range
X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, 63):
    X_range[f'knot_{knot}'] = np.maximum(X_range['x'] - knot, 0)

fitted_model_all_knots = model_all_knots.predict(X_range)

plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.183
Model:                            OLS   Adj. R-squared:                  0.180
Method:                 Least Squares   F-statistic:                     55.83
Date:                Fri, 06 Mar 2026   Prob (F-statistic):           1.34e-22
Time:                        20:49:19   Log-Likelihood:                -405.85
No. Observations:                 500   AIC:                             817.7
Df Residuals:                     497   BIC:                             830.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            5.6176      0.069     81.159      0.000       5.482       5.754
Exper            0.0666      0.007      9.404      0.000       0.053       0.080
Exper_Square    -0.0011      0.000     -7.593      0.000      -0.001      -0.001
==============================================================================
Omnibus:                        9.502   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.009   Jarque-Bera (JB):               11.214
Skew:                          -0.226   Prob(JB):                      0.00367
Kurtosis:                       3.577   Cond. No.                     2.09e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.09e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         WeeklyEarnings   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.190
Method:                 Least Squares   F-statistic:                     3.298
Date:                Fri, 06 Mar 2026   Prob (F-statistic):           9.95e-12
Time:                        20:49:19   Log-Likelihood:                -376.83
No. Observations:                 500   AIC:                             857.7
Df Residuals:                     448   BIC:                             1077.
Df Model:                          51                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1547      0.272     18.980      0.000       4.621       5.688
Exper          0.1953      0.351      0.557      0.578      -0.494       0.884
knot_1         0.1042      0.551      0.189      0.850      -0.978       1.186
knot_2        -0.0938      0.453     -0.207      0.836      -0.983       0.796
knot_3        -0.1610      0.384     -0.419      0.675      -0.916       0.594
knot_4        -0.2014      0.326     -0.617      0.538      -0.843       0.440
knot_5         0.6183      0.309      2.002      0.046       0.011       1.225
knot_6        -0.6134      0.296     -2.075      0.039      -1.194      -0.032
knot_7         0.3069      0.315      0.974      0.330      -0.312       0.926
knot_8        -0.1287      0.346     -0.372      0.710      -0.809       0.551
knot_9         0.0483      0.428      0.113      0.910      -0.794       0.890
knot_10       -0.0715      0.366     -0.195      0.845      -0.791       0.648
knot_11       -0.1556      0.334     -0.465      0.642      -0.813       0.502
knot_12        0.1899      0.331      0.574      0.566      -0.461       0.840
knot_13        0.2660      0.328      0.811      0.418      -0.378       0.910
knot_14       -0.5626      0.329     -1.709      0.088      -1.209       0.084
knot_15        0.3070      0.314      0.979      0.328      -0.309       0.923
knot_16        0.2728      0.389      0.701      0.484      -0.492       1.037
knot_17       -0.4239      0.364     -1.166      0.244      -1.139       0.291
knot_18        0.1405      0.329      0.427      0.669      -0.506       0.787
knot_19       -0.4726      0.305     -1.549      0.122      -1.072       0.127
knot_20        0.6368      0.350      1.821      0.069      -0.050       1.324
knot_21       -0.1141      0.373     -0.306      0.760      -0.847       0.619
knot_22        0.1367      0.361      0.379      0.705      -0.572       0.846
knot_23       -0.1656      0.387     -0.428      0.669      -0.926       0.595
knot_24       -0.1908      0.388     -0.492      0.623      -0.953       0.571
knot_25        0.1208      0.432      0.280      0.780      -0.728       0.970
knot_26       -0.2886      0.454     -0.636      0.525      -1.181       0.603
knot_27        0.6466      0.493      1.310      0.191      -0.323       1.616
knot_28       -0.1814      0.498     -0.365      0.716      -1.159       0.797
knot_29       -0.4298      0.476     -0.903      0.367      -1.365       0.506
knot_30        0.4724      0.462      1.021      0.308      -0.437       1.381
knot_31       -0.5066      0.438     -1.157      0.248      -1.367       0.354
knot_32        0.4254      0.462      0.920      0.358      -0.483       1.334
knot_33       -0.1973      0.476     -0.415      0.679      -1.133       0.738
knot_34        0.2300      0.505      0.456      0.649      -0.762       1.222
knot_35       -0.3746      0.530     -0.707      0.480      -1.416       0.667
knot_36        0.5390      0.562      0.959      0.338      -0.566       1.644
knot_37       -0.2597      0.703     -0.369      0.712      -1.642       1.122
knot_38       -0.5455      0.639     -0.854      0.394      -1.801       0.710
knot_39        0.7818      0.643      1.217      0.224      -0.481       2.045
knot_40       -0.0695      0.639     -0.109      0.913      -1.325       1.186
knot_41       -0.3461      0.725     -0.477      0.633      -1.772       1.079
knot_42       -0.4418      0.735     -0.601      0.548      -1.887       1.004
knot_43        0.1942      0.979      0.198      0.843      -1.730       2.119
knot_44        1.1360      1.194      0.951      0.342      -1.211       3.483
knot_45       -0.9202      0.914     -1.006      0.315      -2.717       0.877
knot_46       -0.5969      0.873     -0.684      0.494      -2.313       1.119
knot_47        0.4265      0.859      0.497      0.620      -1.261       2.114
knot_48        1.5575      1.152      1.352      0.177      -0.707       3.822
knot_49       -2.1921      0.941     -2.330      0.020      -4.041      -0.343
knot_50        0.6258      0.504      1.242      0.215      -0.365       1.616
knot_51        0.4172      0.336      1.242      0.215      -0.243       1.078
knot_52        0.2086      0.168      1.242      0.215      -0.122       0.539
knot_53             0          0        nan        nan           0           0
knot_54             0          0        nan        nan           0           0
knot_55             0          0        nan        nan           0           0
knot_56             0          0        nan        nan           0           0
knot_57             0          0        nan        nan           0           0
knot_58             0          0        nan        nan           0           0
knot_59             0          0        nan        nan           0           0
knot_60             0          0        nan        nan           0           0
knot_61             0          0        nan        nan           0           0
knot_62             0          0        nan        nan           0           0
==============================================================================
Omnibus:                       12.537   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               15.793
Skew:                          -0.262   Prob(JB):                     0.000372
Kurtosis:                       3.696   Cond. No.                     1.09e+18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.13e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
<Figure size 1000x600 with 1 Axes>

Now the coefficients oscillate even more wildly to make interpretation fully nonsensical. Indeed now

β0=5.1547   β1=0.1953   β2=0.1042   β3=0.0938   β4=0.1610   \beta_0 = 5.1547 ~~~ \beta_1 = 0.1953 ~~~ \beta_2 = 0.1042 ~~~ \beta_3 = -0.0938 ~~~ \beta_4 = -0.1610 ~~~ \dots

so that the interpretations become

  1. weekly earnings for someone just joining the workforce is exp(5.4711)=$173.24\exp(5.4711) = \$173.24,

  2. increment after year 1 is 19.53%19.53 \%,

  3. increment after year 2 is (19.53+10.42)=29.95%(19.53 + 10.42) = 29.95 \%,

  4. increment after year 3 is (29.959.38)=20.57%(29.95 - 9.38) = 20.57\%,

  5. increment after year 4 is (20.5716.10)=4.47%(20.57 - 16.10) = 4.47 \%, and so on.

Clearly these numbers are meaningless. Moreover the AIC and BIC now are both larger than that of the quadratic model. In situations such as this where the OLS is giving meaningless results, one uses regularization.

Regularization

Regularization is used when the OLS estimates are not working. It is easy to understand how regularization works from the Bayesian viewpoint. We have seen preivously that the OLS estimates can be seen as Bayesian estimates under the flat prior i.e., under the prior

β0,β1,,βmi.i.dUnif(C,C)\begin{align*} \beta_0, \beta_1, \dots, \beta_m \overset{\text{i.i.d}}{\sim} \text{Unif}(-C, C) \end{align*}

for a large CC. The Unif(C,C)\text{Unif}(-C, C) prior with very large CC is flat in the region of plausibility of the β\beta’s. There are also other ways of making flat priors. The normal density N(0,C)N(0, C) with mean 0 and variance CC is also an example of a flat prior when CC is very large. This is because the normal density is given by 12πCexp(βj2/(2C))\frac{1}{\sqrt{2\pi C}} \exp \left(-\beta_j^2/(2C) \right). The exponential term is essentially 1 when CC is very large so that the prior density does not depend on the value βj\beta_j and is therefore flat. Thus the OLS estimates in linear regression can also be seen as Bayesian estimates under the prior:

β0,β1,,βmi.i.dN(0,C)\begin{align*} \beta_0, \beta_1, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C) \tag{Flat} \end{align*}

where CC is very large. When the OLS estimates are giving strange results, it means that the above prior is not a useful one. In such cases, one should look for alternative priors. In the specific example of the all-knots regression model, here is a reasonable alternative prior:

β0N(0,C)  β1N(0,C)   and   β2,,β63i.i.dN(0,τ2)\begin{align*} \beta_0 \sim N(0, C) ~~ \beta_1 \sim N(0, C) ~~ \text{ and } ~~ \beta_2, \dots, \beta_{63} \overset{\text{i.i.d}}{\sim} N(0, \tau^2) \tag{Regularizing Prior} \end{align*}

for τ2\tau^2 that is not necessarity set to be some large value. This prior is more flexible compared to previous flat prior because we can get back the flat prior by taking τ=C\tau = \sqrt{C}. But here we have the flexibility of taking τ\tau small (if it leads to a better fit to the data) which will lead to smooth function fits. The assumption β0,β1i.i.dN(0,C)\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) just says that we do not enforce anything on β0\beta_0 and β1\beta_1 a priori. We can write the regularizing prior as

βN64(m0,Q0)\beta \sim N_{64}(m_0, Q_0)

where β\beta is the 64×164 \times 1 vector with components β0,β1,,β63\beta_0, \beta_1, \dots, \beta_{63}, m0m_0 is the 64×164 \times 1 vector of zeros, and Q0Q_0 is the 64×6464 \times 64 diagonal matrix with diagonal entries C,C,τ2,,τ2C,C, \tau^2, \dots, \tau^2. To calculate the posterior distribution of β\beta, we shall use the following fact (below m0,Q0,X,σm_0, Q_0, X, \sigma should be treated as non-random)

βNp(m0,Q0) and YβNn(Xβ,σ2In)    βYNp(m1,Q1)\beta \sim N_p(m_0, Q_0) \text{ and } Y \mid \beta \sim N_n(X \beta, \sigma^2 I_n) \implies \beta \mid Y \sim N_p(m_1, Q_1)

where

m1=(Q01+1σ2XTX)1(Q01m0+1σ2XTY)    and    Q1=(Q01+1σ2XTX)1.m_1 = \left(Q_0^{-1} + \frac{1}{\sigma^2}X^T X \right)^{-1} \left(Q_0^{-1} m_0 + \frac{1}{\sigma^2}X^T Y \right) ~~~ \text{ and } ~~~ Q_1 = \left(Q_0^{-1} + \frac{1}{\sigma^2}X^T X \right)^{-1}.

Here NpN_p and NnN_n denote the multivariate normal distributions on pp dimensions and nn dimensions respectively. Using this result, the posterior of β\beta with the regularizing prior is

βdata,σN64((Q01+1σ2XTX)11σ2XTY,(Q01+1σ2XTX)1).\beta \mid \text{data}, \sigma \sim N_{64} \left( \left(Q_0^{-1} + \frac{1}{\sigma^2} X^T X \right)^{-1} \frac{1}{\sigma^2} X^T Y, \left(Q_0^{-1} + \frac{1}{\sigma^2} X^T X \right)^{-1} \right).

This posterior can be used for inference on β\beta. Note that it depends on τ\tau and σ\sigma. One can take a small value for τ\tau if smooth function fit is desired. The posterior mean is

β~τ=(Q01+1σ2XTX)11σ2XTY\tilde{\beta}^{\tau} = \left(Q_0^{-1} + \frac{1}{\sigma^2} X^T X \right)^{-1} \frac{1}{\sigma^2} X^T Y

which can be used to get the function fit:

f~τ(x):=β~0τ+β~1τx+β~2τ(x1)++β~3τ(x2)+++β~63τ(x62)+.\tilde{f}^{\tau}(x) := \tilde{\beta}^{\tau}_0 + \tilde{\beta}_1^{\tau} x + \tilde{\beta}^{\tau}_2 (x - 1)_+ + \tilde{\beta}^{\tau}_3 (x - 2)_+ + \dots + \tilde{\beta}^{\tau}_{63}(x - 62)_+.

When τ\tau is small, it can be checked that f~τ(x)\tilde{f}^{\tau}(x) will be a very smooth function of xx. If τ\tau is really really small, then f~τ(x)\tilde{f}^{\tau}(x) will be essentially a line. If τ\tau is large, then we revert back to the OLS estimate.

In the code below, set τ\tau to multiple values ranging from small to large and see how the estimate varies.

#In order to work with the regularizing prior, we need to set the values of C, tau, sigma:
C = 10 ** 6 #some large value
#tau = 0.001 #set tau = 0.001, tau = 0.01, tau = 0.1, tau = 1, tau = 2
tau = 0.01
sig = 0.6

kmax = 62

Q0 = np.diag([C, C] + [tau ** 2] * kmax) #this is the prior covariance matrix of b0, b1, b2, ...b(kmax + 1)
#Q0 is a diagonal matrix with C in the first two entries and the rest of entries equal tau^2

TempMat = np.linalg.inv((sig**2) * np.linalg.inv(Q0) + X_all_knots.T @ X_all_knots) #this is the matrix which
#multiplies X^T Y to give the posterior mean
post_mean = TempMat @ (X_all_knots.T @ y.to_numpy().reshape(-1, 1))

#calculating the fitted values for a grid of x-values:
x_range = np.linspace(x.min(), x.max(), 100)

X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, (1 + kmax)):
    X_range[f'knot_{knot}'] = np.maximum(x_range - knot, 0)

fitted_model_regularized = X_range.values @ post_mean


plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.plot(x_range,fitted_model_regularized, color = 'black', label = 'Fitted Regularization')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

In the above code, with τ\tau large (say τ=2\tau = 2), there is very little difference between the regularized fit and the OLS fit. With τ\tau smaller τ=0.01\tau = 0.01, the regularized fit is very similar to the quadratic function fit. With even smaller τ\tau: τ=0.001\tau = 0.001, the regularized fit is the same as simple linear regression. Thus by varying τ\tau from small to large, one gets a nice range of fits from simple linear regression to quadratic to the full all-knots OLS.

In practice, one would need to choose the value of τ\tau. One way would be to try various values of τ\tau and pick the value which visually seems to give a good fit. The Bayesian formulation gives a more principled way of choosing τ\tau using marginal likelihood.

Choosing τ\tau and σ\sigma via Marginal Likelihood

One way of selecting τ\tau and σ\sigma is by maximizing the likelihood of the data given only τ\tau and σ\sigma. In usual linear regression, the likelihood is a function of β0,β1,,βm\beta_0, \beta_1, \dots, \beta_m and σ\sigma and is given by the density of Nn(Xβ,σ2In)N_n(X\beta, \sigma^2 I_n). Now since we have a prior on β0,,βm\beta_0, \dots, \beta_m, we can integrate out β0,,βm\beta_0, \dots, \beta_m (with respect to the prior) to obtain a marginal likelihood which is a function of τ\tau and σ\sigma. Fortunately, this integral can be obtained in closed form because of the following result:

βNp(m0,Q0) and YβNn(Xβ,σ2In)    YN(Xm0,XQ0XT+σ2In).\beta \sim N_p(m_0, Q_0) \text{ and } Y \mid \beta \sim N_n(X \beta, \sigma^2 I_n) \implies Y \sim N \left(X m_0, X Q_0 X^T + \sigma^2 I_n\right).

Therefore (note that for us m0m_0 is the zero vector) the marginal likelihood as a function of τ\tau and σ\sigma is given by: $$ f_{\text{data} \mid \tau, \sigma}(\text{data}) = \left(\frac{1}{\sqrt{2\pi}}\right)^n \left(\det \Sigma \right)^{-1/2} \exp \left(-\frac{1}{2} Y^T \Sigma^{-1} Y \right)

$$
Recall that $Q_0$ is the diagonal matrix with diagonal entries $C, C,
\tau^2, \dots, \tau^2$. Throughout $C$ is a large constant (e.g., $C = 10^6$). We can maximize the logarithm of the above marginal likelihood: 
\begin{align*}
 \log f_{\text{data} \mid \tau, \sigma}(\text{data}) = - n \log \sqrt{2 \pi} - \frac{1}{2} \log \det \Sigma - \frac{1}{2} Y^T \Sigma^{-1} Y
\end{align*}
over $\tau$ and $\sigma$ to get their estimates (note that $\Sigma$ above equals $X Q_0 X^T + \sigma^2 I_n$ and hence depends on both $\tau$ and $\sigma$). One way of doing this maximization is to take a grid of $\tau$ and $\sigma$ values, calculating the marginal likelihood at each value of the grid and then taking the maximizer. This is illustrated in the code below. 
# Marginal Likelihood Maximization for tau and sigma:
# We first set grids for tau and sigma:
taugrid = np.logspace(np.log10(0.001), np.log10(0.1), 50)
siggrid = np.logspace(np.log10(0.1), np.log10(1), 50)
g = pd.DataFrame([(tau, sig) for tau in taugrid for sig in siggrid], columns=['tau', 'sig'])

C = 10 ** 6
for i in range(len(g)):
    tau_val = g.loc[i, 'tau']
    sig_val = g.loc[i, 'sig']
    Q0 = np.diag([C, C] + [tau_val ** 2] * kmax)
    Sigmat = X_all_knots.values @ Q0 @ X_all_knots.values.T + np.diag([sig_val ** 2] * n)
    cov_inv = np.linalg.inv(Sigmat)
    sign, log_cov_det = np.linalg.slogdet(Sigmat)
    norm_factor = n*np.log(2*np.pi) + log_cov_det
    g.loc[i, 'loglik'] = -0.5 * (norm_factor + y.T @ cov_inv @ y) #this is the log marginal density

ind_max = g['loglik'].idxmax()
print(g.loc[ind_max])
tau         0.013895
sig         0.542868
loglik   -432.732990
Name: 1436, dtype: float64
#The best values of tau and sig selected are
tau_best = g.loc[ind_max, 'tau']
sig_best = g.loc[ind_max, 'sig']
print(tau_best, sig_best)
0.013894954943731374 0.5428675439323859
#Now we calculate the posterior mean with the best values for tau and sig:
C = 10 ** 6 #some large value
tau = tau_best
sig = sig_best

kmax = 62

Q0 = np.diag([C, C] + [tau ** 2] * kmax) #this is the prior covariance matrix of b0, b1, b2, ...b(kmax + 1)
#Q0 is a diagonal matrix with C in the first two entries and the rest of entries equal tau^2

TempMat = np.linalg.inv((sig**2) * np.linalg.inv(Q0) + X_all_knots.T @ X_all_knots) #this is the matrix which
#multiplies X^T Y to give the posterior mean
post_mean = TempMat @ (X_all_knots.T @ y.to_numpy().reshape(-1, 1))

#calculating the fitted values for a grid of x-values:
x_range = np.linspace(x.min(), x.max(), 100)

X_range = pd.DataFrame({'const': 1, 'x': x_range})
for knot in range(1, 63):
    X_range[f'knot_{knot}'] = np.maximum(x_range - knot, 0)

fitted_model_regularized = X_range.values @ post_mean


plt.figure(figsize = (10, 6))
plt.scatter(x, y, s = 5)
plt.plot(x_range,fitted_model_all_knots, color = 'green', label = 'Fitted All Knots Regression by OLS')
plt.plot(x_range,fitted_model_quad, color = 'red', label = 'Fitted Quadratic')
plt.plot(x_range,fitted_model_regularized, color = 'black', label = 'Fitted Regularization')
plt.xlabel('Years of Experience')
plt.ylabel('Log of Weekly Earnings')
plt.title('All knots regression of y on x')
plt.legend()
plt.show()
<Figure size 1000x600 with 1 Axes>

Note that the best regularized fit (given in black in the plot above) is similar to but not exactly same as the quadratic fit. For example, for small xx, the regularized fit is steeper, for xx near the middle, the regularized fit is flatter compared to the quadratic fit.