Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Linear Regression Details

In Lectures 15 and 16, we looked at Bayesian linear regression. Here we use the default prior:

β0,,βm,logσi.i.duniform(C,C)\begin{align*} \beta_0, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C) \end{align*}

for CC \rightarrow \infty. Bayesian inference with this prior basically coincides with frequentist inference. Today, we shall look at some standard regression terminology and also how some of the numbers shown in the regression output are actually computed. We shall use the statsmodels library to compute linear regression output.

To keep things simple, we shall illustrate multiple linear regression with a very simple dataset. This is the US GDP dataset from FRED (https://fred.stlouisfed.org/series/GDP) which gives quarterly data on the Gross Domestic Product (units are billions of dollars).

import pandas as pd
gdp = pd.read_csv('GDP_12Nov2025.csv')
display(gdp)
Loading...

The key variable is GDP. Below we plot it against time.

import matplotlib.pyplot as plt
plt.plot(gdp['GDP'])
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Gross Domestic Product (GDP) of the United States")
plt.show()
<Figure size 640x480 with 1 Axes>

We take the GDP data to be the response variable yy. We shall fit a multiple linear regression to this data with three covariates: x1,x2,x3x_1, x_2, x_3. x1x_1 is simply time t=1,,nt = 1, \dots, n, x2x_2 is the square of time, and x3x_3 is the cube of time. In other words, we fit a cubic model of time to the data.

Let us fit a cubic trend model to this dataset given by:

yt=β0+β1t+β2t2+β3t3+ϵty_t = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 t^3 + \epsilon_t

for t=1,,nt = 1, \dots, n where nn is the total number of observations in the dataset. In vector-matrix notation, this model can be equivalently be represented as:

y=Xβ+ϵy = X \beta + \epsilon

where

y=(y1y2yn)   X=(11111222231332331nn2n3)   β=(β0β1β2β3)   ϵ=(ϵ1ϵ2ϵn)y = \begin{pmatrix} y_1\\ y_2 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~~ X = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2^2 & 2^3 \\ 1 & 3 & 3^2 & 3^3 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ 1 & n & n^2 & n^3 \end{pmatrix} ~~~ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{pmatrix} ~ ~ ~ \epsilon = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \cdot \\ \cdot \\ \cdot \\ \epsilon_n \end{pmatrix}

Observe yy is n×1n \times 1, XX is n×4n \times 4, β\beta is 4×14 \times 1 and ϵ\epsilon is n×1n \times 1. The entries ϵ1,,ϵn\epsilon_1, \dots, \epsilon_n of the vector ϵ\epsilon are known as errors and we assume that ϵii.i.dN(0,σ2)\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). There are 5 parameters in this model β0,β1,β2,β3,σ\beta_0, \beta_1, \beta_2, \beta_3, \sigma which we need to estimate from the data.

To implement regression using statsmodels, we simply need to create yy and XX and give them as input to the sm.OLS function.

The following code creates yy and XX.

import numpy as np
y = gdp['GDP']
n = len(y)
x = np.arange(1, n + 1)
x2 = x ** 2
x3 = x ** 3
import numpy as np
X = np.column_stack([np.ones(n),x, x2, x3])
print(X)
[[1.0000000e+00 1.0000000e+00 1.0000000e+00 1.0000000e+00]
 [1.0000000e+00 2.0000000e+00 4.0000000e+00 8.0000000e+00]
 [1.0000000e+00 3.0000000e+00 9.0000000e+00 2.7000000e+01]
 ...
 [1.0000000e+00 3.1200000e+02 9.7344000e+04 3.0371328e+07]
 [1.0000000e+00 3.1300000e+02 9.7969000e+04 3.0664297e+07]
 [1.0000000e+00 3.1400000e+02 9.8596000e+04 3.0959144e+07]]

Now we simply use sm.OLS with yy and XX:

import statsmodels.api as sm
md = sm.OLS(y, X).fit()
print(md.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GDP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 1.760e+04
Date:                Tue, 03 Mar 2026   Prob (F-statistic):               0.00
Time:                        08:10:30   Log-Likelihood:                -2458.4
No. Observations:                 314   AIC:                             4925.
Df Residuals:                     310   BIC:                             4940.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        180.6117    139.806      1.292      0.197     -94.477     455.700
x1             2.7137      3.838      0.707      0.480      -4.837      10.265
x2             0.0252      0.028      0.893      0.373      -0.030       0.081
x3             0.0008    5.9e-05     13.395      0.000       0.001       0.001
==============================================================================
Omnibus:                       50.074   Durbin-Watson:                   0.076
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              417.551
Skew:                           0.269   Prob(JB):                     2.14e-91
Kurtosis:                       8.624   Cond. No.                     4.76e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.76e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Least Squares Estimates

The information on the estimates β^0,β^1,β^2,β^3\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3 along with associated uncertainties (standard errors) are given in the table near the middle of the output. Specifically, the estimates from the above table are β^0=180.6117\hat{\beta}_0 = 180.6117, β^1=2.7137\hat{\beta}_1 = 2.7137, β^2=0.0252\hat{\beta}_2 = 0.0252 and β^3=0.0008\hat{\beta}_3 = 0.0008. These numbers can also be obtained using model.params as follows:

print(md.params)
const    180.611745
x1         2.713731
x2         0.025245
x3         0.000791
dtype: float64

These estimates are known as the Least Squares Estimates (also the same as Maximum Likelihood Estimates) and they are computed via the formula:

β^=(β^0β^1β^2β^3)=(XTX)1XTy.\hat{\beta} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \hat{\beta}_3 \end{pmatrix} = (X^T X)^{-1} X^T y.

Let us verify that this formula indeed gives the reported estimates.

XTX = np.dot(X.T, X)
XTX_inverse = np.linalg.inv(XTX)
XTX_inverse_XT = np.dot(XTX_inverse, X.T)
betahat = np.dot(XTX_inverse_XT, y)
print(np.column_stack([betahat, md.params]))
[[1.80611745e+02 1.80611745e+02]
 [2.71373086e+00 2.71373086e+00]
 [2.52451671e-02 2.52451671e-02]
 [7.90735172e-04 7.90735172e-04]]

In reality, sm.OLS does not use the above formula directly (even though the formula is the correct one), instead it uses some efficient linear algebra methods to solve the linear equations XTXβ=XTyX^T X \beta = X^T y.

Fitted Values

The next quantity to understand are the fitted values. These are given by model.fittedvalues and they are obtained via the simple formula:

y^=Xβ^\hat{y} = X \hat{\beta}

The entries y^1,,y^n\hat{y}_1, \dots, \hat{y}_n of the vector y^\hat{y} are known as the fitted values. Let us check the correctness of this formula.

fvals = np.dot(X, betahat)
print(np.column_stack([fvals[:10], md.fittedvalues[:10]]))
[[183.35151151 183.35151151]
 [186.14651302 186.14651302]
 [189.00149368 189.00149368]
 [191.92119791 191.92119791]
 [194.91037012 194.91037012]
 [197.97375472 197.97375472]
 [201.11609611 201.11609611]
 [204.34213872 204.34213872]
 [207.65662695 207.65662695]
 [211.06430522 211.06430522]]

To assess how well the regression model fits the data, we plot the fitted values along with the original data.

plt.plot(y, label = "Data")
plt.plot(md.fittedvalues, label = "Fitted values")
plt.legend()
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Gross Domestic Product (GDP) of the United States")
plt.show()
<Figure size 640x480 with 1 Axes>

Residuals

The residuals simply are the differences between the observed data and the fitted values. We use the notation ee for the vector of residuals:

e=yy^e = y - \hat{y}

The ithi^{th} residual is simply the ithi^{th} entry of ee.

residuals = y - fvals
print(np.column_stack([md.resid[:10], residuals[:10]]))
[[59.81248849 59.81248849]
 [59.82148698 59.82148698]
 [60.58350632 60.58350632]
 [67.82380209 67.82380209]
 [70.83162988 70.83162988]
 [74.59324528 74.59324528]
 [78.07990389 78.07990389]
 [76.02386128 76.02386128]
 [67.37737305 67.37737305]
 [60.28669478 60.28669478]]

A plot of the residuals against time is an important diagnostic tool which can tell us about the effectiveness of the fitted model.

plt.plot(md.resid)
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Residuals")
plt.show()
<Figure size 640x480 with 1 Axes>

From this plot, we see that there are some very large residuals indicating the presence of time points where the model predictions are quite far off from the observed values. This could be because of outliers or some systematic features that the model is missing.

#the value of the last residual
md.resid[n-1]
np.float64(2483.4492304471496)

Residual Sum of Squares (RSS) and Residual df

Two other commonly used terms (in connection with residuals) are the Residual Sum of Squares (RSS) and the Residual Degrees of Freedom. The RSS is simply the sum of the squares of the residuals:

RSS=i=1nei2=t=1n(ytβ^0β^1tβ^2t2β^3t3)2\text{RSS} = \sum_{i=1}^n e_i^2 = \sum_{t=1}^n \left(y_t - \hat{\beta}_0 - \hat{\beta}_1 t - \hat{\beta}_2 t^2 - \hat{\beta}_3 t^3 \right)^2

RSS is simply equal to the smallest possible of the sum of squares criterion S(β^0,β^1,β^2,β^3)S(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3) (recall from lecture that S(β0,β1,β2,β3)=t=1n(ytβ0β1tβ2t2β3t3)2S(\beta_0, \beta_1, \beta_2, \beta_3) = \sum_{t=1}^n (y_t - \beta_0 - \beta_1 t - \beta_2 t^2 - \beta_3 t^3)^2).

The vector of residuals has the following important property: XTe=0X^T e = 0. This can be proved as follows:

XTe=XT(yy^)=XT(yXβ^)=XT(yX(XTX)1XTy)=XTyXTX(XTX)1XTy=XTyXTy=0.X^T e = X^T (y - \hat{y}) = X^T (y - X \hat{\beta}) = X^T (y - X (X^T X)^{-1} X^T y) = X^T y - X^T X (X^T X)^{-1} X^T y = X^T y - X^T y = 0.

XTe=0X^T e = 0 means that the dot product between every column of XX and ee equals 0. In our regression model for GDP, XTe=0X^T e = 0 is equivalent to:

t=1net=0    t=1ntet=0    t=1nt2et=0    t=1nt3et=0\sum_{t = 1}^n e_t = 0 ~~~~ \sum_{t = 1}^n t e_t = 0 ~~~~ \sum_{t = 1}^n t^2 e_t = 0 ~~~~ \sum_{t = 1}^n t^3 e_t = 0

Even though there are nn residuals e1,,ene_1, \dots, e_n, they have to satisfy the above four equality constraints. Therefore, the effective number of ‘free’ residuals is n4n-4. Thus the residual degrees of freedom equals n4n-4.

More generally, the residual df equals the number of observations (nn) minus the number of columns in the XX matrix.

Estimate of σ\sigma: the residual standard error

The estimate of σ\sigma is given by:

σ^=RSSn4=S(β^0,β^1,β^2,β^3)n4\hat{\sigma} = \sqrt{\frac{RSS}{n-4}} = \sqrt{\frac{S(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3)}{n-4}}

This quantity is sometimes called the Residual Standard Error. The value of the Residual Standard Error can be used to assess the size of the residuals (residuals much larger than say twice the Residual Standard Error indicate points where the model fits particularly poorly).

rss = np.sum(md.resid ** 2)
rse = np.sqrt(rss/(n - 4))
print(rse)
611.9599669724937

This intuitively means that residuals with magnitude above 2rse12002*\text{rse} \approx 1200 are points where the model fits particularly poorly. From a look at the plot of the residuals against time, there are many residuals which are this large in magnitude.

Standard Errors of the coefficient estimates

The standard errors corresponding to β^0,β^1,β^2,β^3\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3 are the square roots of the diagonal entries of σ^2(XTX)1\hat{\sigma}^2 (X^T X)^{-1}. They are given by md.bse as verified below.

sebetahat_squared = (rse ** 2)*(np.diag(XTX_inverse))
sebetahat = np.sqrt(sebetahat_squared)
print(np.array([md.bse, sebetahat]))
[[1.39805910e+02 3.83755953e+00 2.82852061e-02 5.90308579e-05]
 [1.39805910e+02 3.83755953e+00 2.82852061e-02 5.90308579e-05]]

t-statistic and confidence intervals for the coefficients

The tt-statistic for each coefficient is simply the estimate divided by the standard error. The confidence interval for βj\beta_j is given by:

[β^jtα/2,n4 standard error of β^j,β^j+tα/2,n4 standard error of β^j]\left[\hat{\beta}_j - t_{\alpha/2, n-4} \text{ standard error of } \hat{\beta}_j, \hat{\beta}_j + t_{\alpha/2, n-4} \text{ standard error of } \hat{\beta}_j \right]

where tα/2,n4t_{\alpha/2, n-4} is the point beyond the tt-distribution with n4n-4 degrees of freedom gives probability mass α/2\alpha/2. The above formula is the result of:

βjβ^j standard error of β^j data tdistribution with n4 d.f\frac{\beta_j - \hat{\beta}_j}{\text{ standard error of } \hat{\beta}_j} \mid \text{ data } \sim t-\text{distribution with} ~n-4 \text{ d.f}

The above statement can also be interpreted in a frequentist sense (then the conditional on data will be removed and the randomness will be over the data with fixed βj\beta_j).

from scipy.stats import t
alpha = 0.05
qt = t.ppf(1 - (alpha/2), n-4) #area to the right equaling alpha/2 is the same as area to the left equaling 1-alpha/2
print(qt)
cilower = md.params - md.bse*qt
ciupper = md.params + md.bse*qt
print(np.column_stack([cilower, ciupper]))
print(md.summary())
1.967645928777568
[[-9.44767854e+01  4.55700275e+02]
 [-4.83722752e+00  1.02646892e+01]
 [-3.04101036e-02  8.09004378e-02]
 [ 6.74583344e-04  9.06886999e-04]]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GDP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 1.760e+04
Date:                Tue, 03 Mar 2026   Prob (F-statistic):               0.00
Time:                        08:10:30   Log-Likelihood:                -2458.4
No. Observations:                 314   AIC:                             4925.
Df Residuals:                     310   BIC:                             4940.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        180.6117    139.806      1.292      0.197     -94.477     455.700
x1             2.7137      3.838      0.707      0.480      -4.837      10.265
x2             0.0252      0.028      0.893      0.373      -0.030       0.081
x3             0.0008    5.9e-05     13.395      0.000       0.001       0.001
==============================================================================
Omnibus:                       50.074   Durbin-Watson:                   0.076
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              417.551
Skew:                           0.269   Prob(JB):                     2.14e-91
Kurtosis:                       8.624   Cond. No.                     4.76e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.76e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Visualizing Uncertainty

It was showed in Lecture 15 that the posterior distribution of β\beta (this is the vector of coefficients) is given by

βdatatn4,4(β^,σ^2(XTX)1)\beta \mid \text{data} \sim t_{n-4, 4} \left(\hat{\beta}, \hat{\sigma}^2 (X^T X)^{-1} \right)

From the definition of the tt-distribution above, we also know that the distribution above coincides with the distribution of

β^+ZV/(n4)\hat{\beta} + \frac{Z}{\sqrt{V/(n-4)}}

where ZN(0,σ^2(XTX)1)Z \sim N(0, \hat{\sigma}^2 (X^T X)^{-1}) and Vχn42V \sim \chi^2_{n-4} are independent. Using this, we can generate vectors β\beta from their posterior distribution and then plot the regression curves corresponding to the different generated β\beta vectors. This will give us an idea of the uncertainty underlying the coefficients.

N = 200 #this the number of posterior samples that we shall draw
Sigma_mat = (rse ** 2)*(XTX_inverse)
#print(Sigma_mat)

from scipy.stats import chi2, multivariate_normal

chi_samples = chi2(df = n-4).rvs(N)
norm_samples = multivariate_normal(cov = Sigma_mat, allow_singular = True).rvs(N)
post_samples = np.tile(betahat, (N, 1)) + norm_samples / np.sqrt(chi_samples / (n-4))[:, None]
#print(post_samples)

plt.figure(figsize = (10, 6))
plt.plot(y, linewidth = 1, color = 'black')
for k in range(N):
    fvalsnew = np.dot(X, post_samples[k])
    plt.plot(fvalsnew, color = 'red', linewidth = 1)
plt.xlabel('Time (quarterly)')
plt.ylabel('Billions of dollars')
plt.title('GDP data with Fitted values along with uncertainty')
plt.show()
<Figure size 1000x600 with 1 Axes>

Even though N=200N = 200 regression curves have been plotted, they are still fairly clustered together. This suggests that the uncertainty in these estimates is fairly small.