import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as smPearson Heights Dataset¶
The following is a classical dataset for regression. It contains data on heights of father-adult son pairs.
height = pd.read_csv("PearsonHeightData.txt", sep = r"\s+")
print(height) Father Son
0 65.0 59.8
1 63.3 63.2
2 65.0 63.3
3 65.8 62.8
4 61.1 64.3
... ... ...
1073 67.0 70.8
1074 71.3 68.3
1075 71.8 69.3
1076 70.7 69.3
1077 70.3 67.0
[1078 rows x 2 columns]
plt.figure(figsize = (10, 4))
plt.scatter(height["Father"], height["Son"])
plt.xlabel("Father's height")
plt.ylabel("Son's height")
plt.title("Scatter plot of heights data")
plt.show()
We will fit a linear regression model with taken to be Son’s height, and taken to be Father’s height. Here there is only one covariate (i.e., ) so it is called simple linear regression. We shall use the statsmodels library to fit regressions.
n = height.shape[0]
X = np.column_stack([np.ones(n), height["Father"]]) #this is the X matrix (consisting of covariates and the column of ones)
y = height["Son"] #this is the y vector consisting of response values
linmod = sm.OLS(y, X).fit()
print(linmod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Son R-squared: 0.251
Model: OLS Adj. R-squared: 0.250
Method: Least Squares F-statistic: 360.9
Date: Mon, 02 Mar 2026 Prob (F-statistic): 1.27e-69
Time: 13:24:33 Log-Likelihood: -2489.4
No. Observations: 1078 AIC: 4983.
Df Residuals: 1076 BIC: 4993.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 33.8928 1.833 18.491 0.000 30.296 37.489
x1 0.5140 0.027 18.997 0.000 0.461 0.567
==============================================================================
Omnibus: 17.527 Durbin-Watson: 0.765
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.642
Skew: -0.052 Prob(JB): 2.22e-07
Kurtosis: 3.819 Cond. No. 1.67e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.67e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
bhat = linmod.params.values
print(bhat)
Sbhat = np.sum(linmod.resid ** 2)
m = X.shape[1] - 1 #in this case, m = 1
sighat = np.sqrt(Sbhat/(n-m-1)) #this is also called the residual standard error
print(sighat)[33.89280054 0.51400591]
2.438134379352427
Below we plot the data along with the fitted regression line.
plt.figure(figsize = (10, 7))
plt.scatter(height["Father"], height["Son"])
x = height["Father"].values
plt.plot(x, linmod.fittedvalues, 'r')
plt.xlabel("Father's height")
plt.ylabel("Son's height")
plt.title("Scatter plot of heights data")
plt.show()
To visualize the uncertainty associated with the regression line, we can draw from the posterior distribution:
where . The matrix is actually calculated as part of the sm.OLS output, and can be obtained as follows:
m = X.shape[1] - 1 #in this case, m = 1
n = len(y)
Sigma = Sbhat / (n - m - 1) * np.linalg.inv(X.T @ X)
print(Sigma)
print(linmod.cov_params()) #this coincides with Sigma calculated above. Check out help(linmod.cov_params) for more details.
[[ 3.35950253e+00 -4.95515638e-02]
[-4.95515638e-02 7.32071005e-04]]
const x1
const 3.359503 -0.049552
x1 -0.049552 0.000732
To generate observations from the -distribution mentioned above, we shall use an inbuilt function from scipy:
from scipy.stats import multivariate_t
N = 1000 #number of samples to generate
beta_samples = multivariate_t.rvs(loc=bhat, shape=Sigma, df=n - m - 1, size=N)
print(beta_samples)[[33.04742173 0.52666897]
[32.94598986 0.52703271]
[32.54037851 0.53299359]
...
[32.23576553 0.54241367]
[32.3467417 0.53697393]
[32.04900966 0.54041803]]
import matplotlib.pyplot as plt
plt.scatter(beta_samples[:,0], beta_samples[:,1], marker = '.')
plt.xlabel('Intercept')
plt.ylabel('Slope')
plt.title('Posterior Samples (drawn from t-distribution)')
plt.show()
plt.figure(figsize = (10, 7))
plt.scatter(height["Father"], height["Son"])
x = height["Father"].values
for r in range(N):
fvalsnew = np.dot(X, beta_samples[r])
plt.plot(x, fvalsnew, color = 'orange')
plt.plot(x, linmod.fittedvalues, 'r')
plt.xlabel("Father's height")
plt.ylabel("Son's height")
plt.title("Scatter plot of heights data")
plt.show()
A Simulated Dataset¶
Consider the following simple simulated dataset.
#A simulated dataset
n = 400
x = np.arange(1, n+1)
sig = 1000
rng = np.random.default_rng(12345)
#quadratic data with noise
y = 5 + 0.8 * ((x - (n/2)) ** 2) + rng.normal(loc = 0, scale = sig, size = n)
plt.plot(y)
plt.xlabel("Time")
plt.ylabel("Data")
plt.title("A simulated dataset")
plt.show()
Let us fit a line (not a quadratic) to this dataset.
X = np.column_stack([np.ones(n), x])
linmod = sm.OLS(y, X).fit()
plt.plot(y)
plt.plot(linmod.fittedvalues, color = "red")
plt.xlabel("Time")
plt.ylabel("Data")
plt.title("A simulated dataset")
plt.show()
N = 1000 #number of samples to generate
n = X.shape[0]
m = X.shape[1] - 1
beta_samples = multivariate_t.rvs(loc=linmod.params, shape=linmod.cov_params(), df=n - m - 1, size=N)
print(beta_samples)
plt.plot(y)
for r in range(N):
fvalsnew = np.dot(X, beta_samples[r])
plt.plot(fvalsnew, color = 'red')
plt.plot(y, color = 'blue')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Simulated Dataset')
plt.show()[[ 1.15441423e+04 -4.28385649e+00]
[ 8.81460405e+03 1.03071293e+01]
[ 1.05394896e+04 1.38910049e+00]
...
[ 1.08566840e+04 5.25273171e+00]
[ 1.11261055e+04 -5.63857139e+00]
[ 1.10185607e+04 -7.03639275e-01]]

The correct thing here is to fit a quadratic. This is done by multiple linear regression as shown below.
X = np.column_stack([np.ones(n), x, x ** 2])
quadmod = sm.OLS(y, X).fit()
print(quadmod.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.989
Model: OLS Adj. R-squared: 0.989
Method: Least Squares F-statistic: 1.849e+04
Date: Mon, 02 Mar 2026 Prob (F-statistic): 0.00
Time: 13:40:47 Log-Likelihood: -3325.9
No. Observations: 400 AIC: 6658.
Df Residuals: 397 BIC: 6670.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.2e+04 149.517 214.033 0.000 3.17e+04 3.23e+04
x1 -319.9986 1.722 -185.840 0.000 -323.384 -316.613
x2 0.7997 0.004 192.309 0.000 0.792 0.808
==============================================================================
Omnibus: 0.637 Durbin-Watson: 1.911
Prob(Omnibus): 0.727 Jarque-Bera (JB): 0.732
Skew: -0.004 Prob(JB): 0.693
Kurtosis: 2.791 Cond. No. 2.16e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
N = 1000 #number of samples to generate
n = X.shape[0]
m = X.shape[1] - 1
beta_samples = multivariate_t.rvs(loc=quadmod.params, shape=quadmod.cov_params(), df=n - m - 1, size=N)
print(beta_samples)
plt.plot(y)
for r in range(N):
fvalsnew = np.dot(X, beta_samples[r])
plt.plot(fvalsnew, color = 'red')
#plt.plot(y, color = 'blue')
plt.xlabel('Time')
plt.ylabel('Data')
plt.title('Simulated Dataset')
plt.show()[[ 3.21481039e+04 -3.22765079e+02 8.06663291e-01]
[ 3.20425001e+04 -3.19749731e+02 7.97965584e-01]
[ 3.21951920e+04 -3.20996981e+02 8.00088759e-01]
...
[ 3.18238777e+04 -3.17068504e+02 7.93182352e-01]
[ 3.19250922e+04 -3.18464984e+02 7.95402688e-01]
[ 3.22866030e+04 -3.25515790e+02 8.13643963e-01]]

A Time Series Dataset (Personal Consumption Expenditures from FRED)¶
The following data is on personal consumption expenditures from FRED (https://
pcec = pd.read_csv("PCEC_30Oct2025.csv")
print(pcec.head())
y = pcec['PCEC'].to_numpy()
plt.plot(y, color = 'black')
plt.xlabel('Quarter')
plt.ylabel('Billions of Dollars')
plt.title('Personal Consumption Expenditure')
plt.show() observation_date PCEC
0 1947-01-01 156.161
1 1947-04-01 160.031
2 1947-07-01 163.543
3 1947-10-01 167.672
4 1948-01-01 170.372

Let us fit the AR(1) model to this dataset. This is just linear regression with as response and as the covariate. We can fit this by manually creating and and using the sm.OLS function.
p = 1
n = len(y)
yreg = y[p:] #these are the response values in the autoregression
Xmat = np.ones((n-p, 1)) #this will be the design matrix (X) in the autoregression
for j in range(1, p+1):
col = y[p-j : n-j].reshape(-1, 1)
Xmat = np.column_stack([Xmat, col])
armod = sm.OLS(yreg, Xmat).fit()
print(armod.params)
print(armod.summary())[5.06629868 1.01215697]
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.999
Model: OLS Adj. R-squared: 0.999
Method: Least Squares F-statistic: 6.086e+05
Date: Mon, 02 Mar 2026 Prob (F-statistic): 0.00
Time: 13:55:54 Log-Likelihood: -1949.1
No. Observations: 313 AIC: 3902.
Df Residuals: 311 BIC: 3910.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0663 9.510 0.533 0.595 -13.646 23.779
x1 1.0122 0.001 780.151 0.000 1.010 1.015
==============================================================================
Omnibus: 306.260 Durbin-Watson: 2.153
Prob(Omnibus): 0.000 Jarque-Bera (JB): 110851.759
Skew: -3.084 Prob(JB): 0.00
Kurtosis: 94.988 Cond. No. 1.00e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
There is also an in-built function (called AutoReg) in statsmodels for fitting AR models. We can directly use it.
from statsmodels.tsa.ar_model import AutoReg
armod_sm = AutoReg(y, lags = p).fit()
print(armod_sm.summary()) AutoReg Model Results
==============================================================================
Dep. Variable: y No. Observations: 314
Model: AutoReg(1) Log Likelihood -1949.118
Method: Conditional MLE S.D. of innovations 122.520
Date: Mon, 02 Mar 2026 AIC 3904.236
Time: 13:56:50 BIC 3915.475
Sample: 1 HQIC 3908.728
314
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0663 9.480 0.534 0.593 -13.514 23.647
y.L1 1.0122 0.001 782.656 0.000 1.010 1.015
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 0.9880 +0.0000j 0.9880 0.0000
-----------------------------------------------------------------------------
Here are some observations comparing the above two outputs:
The parameter estimates are exactly the same.
The regression summary gives t-scores corresponding to each parameter estimate while the AutoReg summary only gives z-scores
The standard errors are slightly different.
The standard errors given by the regression summary correspond to the square roots of the diagonal entries of:
while the standard errors reported by AutoReg summary correspond to the square roots of the diagonal entries of: