Linear Regression Details¶

In Lectures 15 and 16, we looked at Bayesian linear regression. Here we use the default prior:

\begin{align*} \beta_0, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C) \end{align*}

(1)

for $C \rightarrow \infty$ . Bayesian inference with this prior basically coincides with frequentist inference. Today, we shall look at some standard regression terminology and also how some of the numbers shown in the regression output are actually computed. We shall use the statsmodels library to compute linear regression output.

To keep things simple, we shall illustrate multiple linear regression with a very simple dataset. This is the US GDP dataset from FRED (https://fred.stlouisfed.org/series/GDP) which gives quarterly data on the Gross Domestic Product (units are billions of dollars).

import pandas as pd
gdp = pd.read_csv('GDP_12Nov2025.csv')
display(gdp)

Loading...

The key variable is GDP. Below we plot it against time.

import matplotlib.pyplot as plt
plt.plot(gdp['GDP'])
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Gross Domestic Product (GDP) of the United States")
plt.show()

We take the GDP data to be the response variable $y$ . We shall fit a multiple linear regression to this data with three covariates: $x_1, x_2, x_3$ . $x_1$ is simply time $t = 1, \dots, n$ , $x_2$ is the square of time, and $x_3$ is the cube of time. In other words, we fit a cubic model of time to the data.

Let us fit a cubic trend model to this dataset given by:

y_t = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 t^3 + \epsilon_t

for $t = 1, \dots, n$ where $n$ is the total number of observations in the dataset. In vector-matrix notation, this model can be equivalently be represented as:

y = X \beta + \epsilon

where

y = \begin{pmatrix} y_1\\ y_2 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~~ X = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2^2 & 2^3 \\ 1 & 3 & 3^2 & 3^3 \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ 1 & n & n^2 & n^3 \end{pmatrix} ~~~ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \end{pmatrix} ~ ~ ~ \epsilon = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \cdot \\ \cdot \\ \cdot \\ \epsilon_n \end{pmatrix}

Observe $y$ is $n \times 1$ , $X$ is $n \times 4$ , $\beta$ is $4 \times 1$ and $\epsilon$ is $n \times 1$ . The entries $\epsilon_1, \dots, \epsilon_n$ of the vector $\epsilon$ are known as errors and we assume that $\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$ . There are 5 parameters in this model $\beta_0, \beta_1, \beta_2, \beta_3, \sigma$ which we need to estimate from the data.

To implement regression using statsmodels, we simply need to create $y$ and $X$ and give them as input to the sm.OLS function.

The following code creates $y$ and $X$ .

import numpy as np
y = gdp['GDP']
n = len(y)
x = np.arange(1, n + 1)
x2 = x ** 2
x3 = x ** 3
import numpy as np
X = np.column_stack([np.ones(n),x, x2, x3])
print(X)

[[1.0000000e+00 1.0000000e+00 1.0000000e+00 1.0000000e+00]
 [1.0000000e+00 2.0000000e+00 4.0000000e+00 8.0000000e+00]
 [1.0000000e+00 3.0000000e+00 9.0000000e+00 2.7000000e+01]
 ...
 [1.0000000e+00 3.1200000e+02 9.7344000e+04 3.0371328e+07]
 [1.0000000e+00 3.1300000e+02 9.7969000e+04 3.0664297e+07]
 [1.0000000e+00 3.1400000e+02 9.8596000e+04 3.0959144e+07]]

Now we simply use sm.OLS with $y$ and $X$ :

import statsmodels.api as sm
md = sm.OLS(y, X).fit()
print(md.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GDP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 1.760e+04
Date:                Tue, 03 Mar 2026   Prob (F-statistic):               0.00
Time:                        08:10:30   Log-Likelihood:                -2458.4
No. Observations:                 314   AIC:                             4925.
Df Residuals:                     310   BIC:                             4940.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        180.6117    139.806      1.292      0.197     -94.477     455.700
x1             2.7137      3.838      0.707      0.480      -4.837      10.265
x2             0.0252      0.028      0.893      0.373      -0.030       0.081
x3             0.0008    5.9e-05     13.395      0.000       0.001       0.001
==============================================================================
Omnibus:                       50.074   Durbin-Watson:                   0.076
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              417.551
Skew:                           0.269   Prob(JB):                     2.14e-91
Kurtosis:                       8.624   Cond. No.                     4.76e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.76e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Least Squares Estimates¶

The information on the estimates $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3$ along with associated uncertainties (standard errors) are given in the table near the middle of the output. Specifically, the estimates from the above table are $\hat{\beta}_0 = 180.6117$ , $\hat{\beta}_1 = 2.7137$ , $\hat{\beta}_2 = 0.0252$ and $\hat{\beta}_3 = 0.0008$ . These numbers can also be obtained using model.params as follows:

print(md.params)

const    180.611745
x1         2.713731
x2         0.025245
x3         0.000791
dtype: float64

These estimates are known as the Least Squares Estimates (also the same as Maximum Likelihood Estimates) and they are computed via the formula:

\hat{\beta} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \hat{\beta}_3 \end{pmatrix} = (X^T X)^{-1} X^T y.

Let us verify that this formula indeed gives the reported estimates.

XTX = np.dot(X.T, X)
XTX_inverse = np.linalg.inv(XTX)
XTX_inverse_XT = np.dot(XTX_inverse, X.T)
betahat = np.dot(XTX_inverse_XT, y)
print(np.column_stack([betahat, md.params]))

[[1.80611745e+02 1.80611745e+02]
 [2.71373086e+00 2.71373086e+00]
 [2.52451671e-02 2.52451671e-02]
 [7.90735172e-04 7.90735172e-04]]

In reality, sm.OLS does not use the above formula directly (even though the formula is the correct one), instead it uses some efficient linear algebra methods to solve the linear equations $X^T X \beta = X^T y$ .

Fitted Values¶

The next quantity to understand are the fitted values. These are given by model.fittedvalues and they are obtained via the simple formula:

\hat{y} = X \hat{\beta}

The entries $\hat{y}_1, \dots, \hat{y}_n$ of the vector $\hat{y}$ are known as the fitted values. Let us check the correctness of this formula.

fvals = np.dot(X, betahat)
print(np.column_stack([fvals[:10], md.fittedvalues[:10]]))

[[183.35151151 183.35151151]
 [186.14651302 186.14651302]
 [189.00149368 189.00149368]
 [191.92119791 191.92119791]
 [194.91037012 194.91037012]
 [197.97375472 197.97375472]
 [201.11609611 201.11609611]
 [204.34213872 204.34213872]
 [207.65662695 207.65662695]
 [211.06430522 211.06430522]]

To assess how well the regression model fits the data, we plot the fitted values along with the original data.

plt.plot(y, label = "Data")
plt.plot(md.fittedvalues, label = "Fitted values")
plt.legend()
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Gross Domestic Product (GDP) of the United States")
plt.show()

Residuals¶

The residuals simply are the differences between the observed data and the fitted values. We use the notation $e$ for the vector of residuals:

e = y - \hat{y}

The $i^{th}$ residual is simply the $i^{th}$ entry of $e$ .

residuals = y - fvals
print(np.column_stack([md.resid[:10], residuals[:10]]))

[[59.81248849 59.81248849]
 [59.82148698 59.82148698]
 [60.58350632 60.58350632]
 [67.82380209 67.82380209]
 [70.83162988 70.83162988]
 [74.59324528 74.59324528]
 [78.07990389 78.07990389]
 [76.02386128 76.02386128]
 [67.37737305 67.37737305]
 [60.28669478 60.28669478]]

A plot of the residuals against time is an important diagnostic tool which can tell us about the effectiveness of the fitted model.

plt.plot(md.resid)
plt.xlabel("Time (quarterly)")
plt.ylabel("Billions of Dollars")
plt.title("Residuals")
plt.show()

From this plot, we see that there are some very large residuals indicating the presence of time points where the model predictions are quite far off from the observed values. This could be because of outliers or some systematic features that the model is missing.

#the value of the last residual
md.resid[n-1]

np.float64(2483.4492304471496)

Residual Sum of Squares (RSS) and Residual df¶

Two other commonly used terms (in connection with residuals) are the Residual Sum of Squares (RSS) and the Residual Degrees of Freedom. The RSS is simply the sum of the squares of the residuals:

\text{RSS} = \sum_{i=1}^n e_i^2 = \sum_{t=1}^n \left(y_t - \hat{\beta}_0 - \hat{\beta}_1 t - \hat{\beta}_2 t^2 - \hat{\beta}_3 t^3 \right)^2

RSS is simply equal to the smallest possible of the sum of squares criterion $S(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3)$ (recall from lecture that $S(\beta_0, \beta_1, \beta_2, \beta_3) = \sum_{t=1}^n (y_t - \beta_0 - \beta_1 t - \beta_2 t^2 - \beta_3 t^3)^2$ ).

The vector of residuals has the following important property: $X^T e = 0$ . This can be proved as follows:

X^T e = X^T (y - \hat{y}) = X^T (y - X \hat{\beta}) = X^T (y - X (X^T X)^{-1} X^T y) = X^T y - X^T X (X^T X)^{-1} X^T y = X^T y - X^T y = 0.

$X^T e = 0$ means that the dot product between every column of $X$ and $e$ equals 0. In our regression model for GDP, $X^T e = 0$ is equivalent to:

\sum_{t = 1}^n e_t = 0 ~~~~ \sum_{t = 1}^n t e_t = 0 ~~~~ \sum_{t = 1}^n t^2 e_t = 0 ~~~~ \sum_{t = 1}^n t^3 e_t = 0

Even though there are $n$ residuals $e_1, \dots, e_n$ , they have to satisfy the above four equality constraints. Therefore, the effective number of ‘free’ residuals is $n-4$ . Thus the residual degrees of freedom equals $n-4$ .

More generally, the residual df equals the number of observations ( $n$ ) minus the number of columns in the $X$ matrix.

Estimate of $\sigma$ : the residual standard error¶

The estimate of $\sigma$ is given by:

\hat{\sigma} = \sqrt{\frac{RSS}{n-4}} = \sqrt{\frac{S(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3)}{n-4}}

This quantity is sometimes called the Residual Standard Error. The value of the Residual Standard Error can be used to assess the size of the residuals (residuals much larger than say twice the Residual Standard Error indicate points where the model fits particularly poorly).

rss = np.sum(md.resid ** 2)
rse = np.sqrt(rss/(n - 4))
print(rse)

611.9599669724937

This intuitively means that residuals with magnitude above $2*\text{rse} \approx 1200$ are points where the model fits particularly poorly. From a look at the plot of the residuals against time, there are many residuals which are this large in magnitude.

Standard Errors of the coefficient estimates¶

The standard errors corresponding to $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3$ are the square roots of the diagonal entries of $\hat{\sigma}^2 (X^T X)^{-1}$ . They are given by md.bse as verified below.

sebetahat_squared = (rse ** 2)*(np.diag(XTX_inverse))
sebetahat = np.sqrt(sebetahat_squared)
print(np.array([md.bse, sebetahat]))

[[1.39805910e+02 3.83755953e+00 2.82852061e-02 5.90308579e-05]
 [1.39805910e+02 3.83755953e+00 2.82852061e-02 5.90308579e-05]]

t-statistic and confidence intervals for the coefficients¶

The $t$ -statistic for each coefficient is simply the estimate divided by the standard error. The confidence interval for $\beta_j$ is given by:

\left[\hat{\beta}_j - t_{\alpha/2, n-4} \text{ standard error of } \hat{\beta}_j, \hat{\beta}_j + t_{\alpha/2, n-4} \text{ standard error of } \hat{\beta}_j \right]

where $t_{\alpha/2, n-4}$ is the point beyond the $t$ -distribution with $n-4$ degrees of freedom gives probability mass $\alpha/2$ . The above formula is the result of:

\frac{\beta_j - \hat{\beta}_j}{\text{ standard error of } \hat{\beta}_j} \mid \text{ data } \sim t-\text{distribution with} ~n-4 \text{ d.f}

The above statement can also be interpreted in a frequentist sense (then the conditional on data will be removed and the randomness will be over the data with fixed $\beta_j$ ).

from scipy.stats import t
alpha = 0.05
qt = t.ppf(1 - (alpha/2), n-4) #area to the right equaling alpha/2 is the same as area to the left equaling 1-alpha/2
print(qt)
cilower = md.params - md.bse*qt
ciupper = md.params + md.bse*qt
print(np.column_stack([cilower, ciupper]))
print(md.summary())

1.967645928777568
[[-9.44767854e+01  4.55700275e+02]
 [-4.83722752e+00  1.02646892e+01]
 [-3.04101036e-02  8.09004378e-02]
 [ 6.74583344e-04  9.06886999e-04]]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    GDP   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 1.760e+04
Date:                Tue, 03 Mar 2026   Prob (F-statistic):               0.00
Time:                        08:10:30   Log-Likelihood:                -2458.4
No. Observations:                 314   AIC:                             4925.
Df Residuals:                     310   BIC:                             4940.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        180.6117    139.806      1.292      0.197     -94.477     455.700
x1             2.7137      3.838      0.707      0.480      -4.837      10.265
x2             0.0252      0.028      0.893      0.373      -0.030       0.081
x3             0.0008    5.9e-05     13.395      0.000       0.001       0.001
==============================================================================
Omnibus:                       50.074   Durbin-Watson:                   0.076
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              417.551
Skew:                           0.269   Prob(JB):                     2.14e-91
Kurtosis:                       8.624   Cond. No.                     4.76e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.76e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Visualizing Uncertainty¶

It was showed in Lecture 15 that the posterior distribution of $\beta$ (this is the vector of coefficients) is given by

\beta \mid \text{data} \sim t_{n-4, 4} \left(\hat{\beta}, \hat{\sigma}^2 (X^T X)^{-1} \right)

From the definition of the $t$ -distribution above, we also know that the distribution above coincides with the distribution of

\hat{\beta} + \frac{Z}{\sqrt{V/(n-4)}}

where $Z \sim N(0, \hat{\sigma}^2 (X^T X)^{-1})$ and $V \sim \chi^2_{n-4}$ are independent. Using this, we can generate vectors $\beta$ from their posterior distribution and then plot the regression curves corresponding to the different generated $\beta$ vectors. This will give us an idea of the uncertainty underlying the coefficients.

N = 200 #this the number of posterior samples that we shall draw
Sigma_mat = (rse ** 2)*(XTX_inverse)
#print(Sigma_mat)

from scipy.stats import chi2, multivariate_normal

chi_samples = chi2(df = n-4).rvs(N)
norm_samples = multivariate_normal(cov = Sigma_mat, allow_singular = True).rvs(N)
post_samples = np.tile(betahat, (N, 1)) + norm_samples / np.sqrt(chi_samples / (n-4))[:, None]
#print(post_samples)

plt.figure(figsize = (10, 6))
plt.plot(y, linewidth = 1, color = 'black')
for k in range(N):
    fvalsnew = np.dot(X, post_samples[k])
    plt.plot(fvalsnew, color = 'red', linewidth = 1)
plt.xlabel('Time (quarterly)')
plt.ylabel('Billions of dollars')
plt.title('GDP data with Fitted values along with uncertainty')
plt.show()

Even though $N = 200$ regression curves have been plotted, they are still fairly clustered together. This suggests that the uncertainty in these estimates is fairly small.

Linear Regression Details