import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
np.set_printoptions(suppress=True, formatter={'float_kind':'{:f}'.format})
#this option makes Jupyter print numbers in ordinary (as opposed to scientific) notationLinear Regression and Poisson Regression¶
We shall review linear regression in the context of an applied dataset. We will also use this example to motivate Poisson Regression.
Review of Linear Regression with an Economics Dataset¶
The dataset (called MROZ) comes from an Econometrica paper by Mroz in 1987 and contains data on a bunch of variables for married women in the year 1975. This dataset is analyzed in the Econometrics book by Wooldridge (this book contains many other interesting economic and political datasets). The specific variables included in the Mroz dataset are:
inlf: binary variable equaling 1 if the individual worked (i.e., they were ‘in the labor force’) in the year 1975 and 0 otherwise
hours: number of hours worked in 1975
kidslt6: number of kids < 6 years of age
kidsge6: number of kids 6-18 years of age
age: age in years
educ: years of schooling
wage: hourly wage in 1975
repwage: reported wage at interview in 1976
hushrs: hours worked by husband in 1975
husage: husband’s age
huseduc: husband’s years of schooling
huswage: husband’s hourly wage in 1975
faminc: family income in 1975
mtr: federal marginal tax rate facing woman
motheduc: mother’s years of schooling
fatheduc: father’s years of schooling
unem: unemployment rate in county of residence
city: =1 if live in Standard metropolitan statistical area
exper: actual labor market experience
nwifeinc: (faminc - wage*hours)/1000
lwage: log(wage)
expersq: (the square of the experience variable)
#Import the MROZ.csv dataset
mroz = pd.read_csv("MROZ.csv")
print(mroz.head(12)) inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage \
0 1 1610 1 0 32 12 3.3540 2.65 2708 34
1 1 1656 0 2 30 12 1.3889 2.65 2310 30
2 1 1980 1 3 35 12 4.5455 4.04 3072 40
3 1 456 0 3 34 12 1.0965 3.25 1920 53
4 1 1568 1 2 31 14 4.5918 3.60 2000 32
5 1 2032 0 0 54 12 4.7421 4.70 1040 57
6 1 1440 0 2 37 16 8.3333 5.95 2670 37
7 1 1020 0 0 54 12 7.8431 9.98 4120 53
8 1 1458 0 2 48 12 2.1262 0.00 1995 52
9 1 1600 0 2 39 12 4.6875 4.15 2100 43
10 1 1969 0 1 33 12 4.0630 4.30 2450 34
11 1 1960 0 1 42 11 4.5918 4.58 2375 47
... faminc mtr motheduc fatheduc unem city exper nwifeinc \
0 ... 16310 0.7215 12 7 5.0 0 14 10.910060
1 ... 21800 0.6615 7 7 11.0 1 5 19.499981
2 ... 21040 0.6915 12 7 5.0 0 15 12.039910
3 ... 7300 0.7815 7 7 5.0 0 6 6.799996
4 ... 27300 0.6215 12 14 9.5 1 7 20.100058
5 ... 19495 0.6915 14 7 7.5 1 33 9.859054
6 ... 21152 0.6915 14 7 5.0 0 11 9.152048
7 ... 18900 0.6915 3 3 5.0 0 35 10.900038
8 ... 20405 0.7515 7 7 3.0 0 24 17.305000
9 ... 20425 0.6915 7 7 5.0 0 21 12.925000
10 ... 32300 0.5815 12 3 5.0 0 15 24.299953
11 ... 28700 0.6215 14 7 5.0 0 14 19.700071
lwage expersq
0 1.210154 196
1 0.328512 25
2 1.514138 225
3 0.092123 36
4 1.524272 49
5 1.556480 1089
6 2.120260 121
7 2.059634 1225
8 0.754336 576
9 1.544899 441
10 1.401922 225
11 1.524272 196
[12 rows x 22 columns]
Several regressions can be fit on this dataset. Let us start by fitting a linear regression model with “hours” as the response variable, and “kidslt6”, “kidsge6”, “age”, “educ”, “exper”, “expersq”, “huswage”, “huseduc”, “hushrs”, “motheduc” and “fatheduc” as the covariates (or predictor variables). Note that we are using both “exper” and “expersq” (which is the square of the variable “exper”) as covariates in the regression. This is because we would expect “hours” to increase with “exper” for small values of “exper” (for women new to the workforce) but to perhaps decrease with “exper” for large values of “exper” (for example, people with very large “exper” might be closer to retirement and work less leading to smaller “hours”). Such a relationship cannot be captured by linear functions but can be captured by quadratic functions which is why both “exper” and “expersq” are included in the model.
Using the statsmodels package in Python, this regression can be carried out in the following way.
#Several regressions can be fit on this dataset. Let us fit one with
#hours as the response variable, and
#kidslt6, kidsge6, age, educ, exper, expersq, huswage, huseduc, hushrs, motheduc and fatheduc
#as covariates
import statsmodels.api as sm
#Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'kidsge6', 'age', 'educ',
'hushrs', 'huseduc', 'huswage', 'motheduc',
'fatheduc', 'exper', 'expersq']].copy()
#Add a constant (intercept) to the model
X = sm.add_constant(X)
#Fit the model:
model = sm.OLS(Y, X).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: hours R-squared: 0.273
Model: OLS Adj. R-squared: 0.262
Method: Least Squares F-statistic: 25.30
Date: Tue, 10 Mar 2026 Prob (F-statistic): 1.05e-44
Time: 14:19:30 Log-Likelihood: -6045.7
No. Observations: 753 AIC: 1.212e+04
Df Residuals: 741 BIC: 1.217e+04
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1488.5175 293.748 5.067 0.000 911.839 2065.196
kidslt6 -439.3761 58.748 -7.479 0.000 -554.709 -324.043
kidsge6 -32.6212 23.202 -1.406 0.160 -78.171 12.928
age -30.4462 4.382 -6.948 0.000 -39.049 -21.844
educ 41.2569 16.470 2.505 0.012 8.924 73.590
hushrs -0.0635 0.049 -1.292 0.197 -0.160 0.033
huseduc -16.1572 12.297 -1.314 0.189 -40.298 7.983
huswage -13.7469 7.556 -1.819 0.069 -28.580 1.086
motheduc 10.9852 10.368 1.059 0.290 -9.370 31.340
fatheduc -4.0224 9.771 -0.412 0.681 -23.205 15.161
exper 65.9440 9.984 6.605 0.000 46.344 85.544
expersq -0.7289 0.325 -2.241 0.025 -1.368 -0.090
==============================================================================
Omnibus: 82.696 Durbin-Watson: 1.389
Prob(Omnibus): 0.000 Jarque-Bera (JB): 120.379
Skew: 0.786 Prob(JB): 7.25e-27
Kurtosis: 4.167 Cond. No. 2.54e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
The table above gives estimates of the regression coefficients and their standard errors for each of the covariates. These estimates and standard errors are computed in the following way. The data is denoted by for (here is the number of covariates and is the number of observations). One often represents the response data () in a vector called , and the covariate data for in a matrix as follows:
Note the additional column of ones in the matrix. In this notation, the estimates of the regression coefficients are computed by solving equation:
The standard errors are computed in the following way. First one computes the matrix:
The standard errors are given by the square roots of the diagonal entries of the matrix .
Looking back at the statsmodels regression summary, it is common practice to judge the importance of each variable by the value of the estimated regression coefficient as well as the corresponding standard error. The third column (the column titled t) of the regression output gives the ratio of the estimate of each regression coefficient to the corresponding standard error. Variables for which the magnitude of this ratio is large are considered important. For example, in this example, the variable “kidslt6” has a t value of -7.479 i.e., 7.479 in magnitude which is the largest of all other variables so this can be considered the most important variable among all the covariates. On the other hand, the variable “fatheduc” has t value equal to -0.412 (i.e., 0.412 in absolute value) which is small so this can be considered the least important variable. The right thing to compare these t-values are the quantiles of the Student t probability distribution with degrees of freedom. This comparison is done by the OLS function and its values are given in the fourth column of the regression table. This column (titled ) goes by the same ‘p-values’ and values close to zero indicate that the variable is important. For example, for ‘kidslt6’, this -value is rounded to zero while for ‘fathedu’, this -value is quit large equalling 0.681.
Usually one which looks at the table above and drops variables for which the -value is large. In this problem, it might be reasonable to drop the variables “motheduc”, “fatheduc”, “hushrs”, “huseduc” and “kidsge6” deeming these to be unimportant for explaining the response “hours”. Below we carry out linear regression on “hours” with the other six variables “kidslt6”, “age”, “educ”, “huswage”, “exper”, “expersq”.
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) #add a constant (intercept) to the model
#Fit the model:
linmodel = sm.OLS(Y, X).fit()
print(linmodel.summary()) OLS Regression Results
==============================================================================
Dep. Variable: hours R-squared: 0.266
Model: OLS Adj. R-squared: 0.260
Method: Least Squares F-statistic: 44.99
Date: Tue, 10 Mar 2026 Prob (F-statistic): 4.67e-47
Time: 14:19:30 Log-Likelihood: -6049.5
No. Observations: 753 AIC: 1.211e+04
Df Residuals: 746 BIC: 1.215e+04
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1166.8778 243.738 4.787 0.000 688.384 1645.372
kidslt6 -433.1229 58.416 -7.414 0.000 -547.803 -318.443
age -28.4308 4.067 -6.991 0.000 -36.414 -20.447
educ 32.6255 12.827 2.543 0.011 7.444 57.807
huswage -13.9353 6.857 -2.032 0.042 -27.397 -0.474
exper 67.7980 9.896 6.851 0.000 48.371 87.225
expersq -0.7375 0.325 -2.270 0.023 -1.375 -0.100
==============================================================================
Omnibus: 78.707 Durbin-Watson: 1.383
Prob(Omnibus): 0.000 Jarque-Bera (JB): 111.647
Skew: 0.768 Prob(JB): 5.70e-25
Kurtosis: 4.095 Cond. No. 2.76e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.76e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Instead of using the package StatsModels (specifically the function sm.OLS), we can also fit this linear regression model via the Bayesian approach (in PyMC) with flat priors on the regression coefficients (as well as ). This is done as follows. Note that with these priors, it can be mathematically proved (by calculating the posterior explicitly) that the Bayesian approach gives identical results as the frequentist approach. This means that implementation in PyMC should give essentially the same results as found in the sm.OLS summary (with perhaps slight differences due to Monte Carlo fluctuations).
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozmod = pm.Model()
with mrozmod:
# Priors for unknown model parameters
b0 = pm.Flat("b0")
b1 = pm.Flat("b1")
b2 = pm.Flat("b2")
b3 = pm.Flat("b3")
b4 = pm.Flat("b4")
b5 = pm.Flat("b5")
b6 = pm.Flat("b6")
log_sigma = pm.Flat("log_sigma")
sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
# Expected value of outcome
mu = b0 + b1 * mroz['kidslt6'] + b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
# Likelihood
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=mroz['hours'])
idata = pm.sample(2000, chains = 2, return_inferencedata = True) Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6, log_sigma]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The above PyMC code specifies the linear regression model in the following way:
Let us check that the Bayesian analysis leads to basically the same answers as statsmodels OLS.
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()
allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Mean | Std. Dev. | Least Squares | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
print(f"{name:10}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {linmodel.params.values[i]:.6f} | {linmodel.bse.values[i]:.6f}")Parameter | Mean | Std. Dev. | Least Squares | Std. Error
------------|----------|----------
b0 | 1179.207105 | 243.674265 | 1166.877797 | 243.737876
b1 | -435.235164 | 58.923978 | -433.122873 | 58.416292
b2 | -28.594224 | 4.066794 | -28.430794 | 4.066762
b3 | 32.212160 | 12.860550 | 32.625466 | 12.827177
b4 | -13.798977 | 6.858220 | -13.935253 | 6.857217
b5 | 67.599595 | 9.941135 | 67.797967 | 9.895837
b6 | -0.729025 | 0.324472 | -0.737492 | 0.324827
The estimated regression coefficent for “kidslt6” is about -430 which can be interpreted as follows: having a small child reduces the mean hours worked by about 430. This result is clearly meaningless for women with small working hours (e.g., if someone only works for about 300 hours, what does a reduction of 430 mean?). It is also not very meaningful for women with lots of working hours (e.g., if someone works for more than 3000 hours, a reduction of 430 due to a small kid sounds too small). A much more meaningful and useful interpretation would report reductions in terms of percentages and not absolute hours. For example, a statement such as “having a small child reduces the mean hours worked by about 40%” would apply to all working women and be much more interpretable. Such an intepretation cannot be drawn based on this linear regression.
A percentage interpretation can be realized if we change our model by taking the response variable to be as opposed to “hours”. Indeed, in the linear regression model for :
the coefficient gives the reduction in due to a small child. This implies a percentage increase of in hours worked because:
It is much more preferable to use as the response variable in these problems compared to . However, one big problem is that, in this dataset, there are quite a few observations for which the Hours variable equals 0. So we cannot really work with the variable . In such case, Poisson regression provides a great alternative model.
Poisson Regression¶
In Poisson regression, one modifies the linear regression model:
(which cannot be used as the variable is meaningless when ) in the following way:
The main change is that the response variable is now modelled as . The Poisson distribution is often used to model counts. In this case, is the number of hours worked in 1975 which is a count variable. The linear relationship between the response and covariates is specified through which enables percentage interpretation for the regression coefficients.
As in the case of linear regression, there are two ways of implementing Poisson regression:
Through an in-built function in the library statsmodels. This function is called sm.GLM. GLM stands for Generalized Linear Models which is a general class of models which includes linear regression and Poisson regression (and also logistic regression). While using sm.GLM, we need to specify which GLM we are fitting using the family argument.
Using PyMC to do Bayesian inference with flat priors: .
Most practitioners use the first option above. Let us first see how it works before turning to the Bayesian PyMC approach.
#Poisson Regression through StatsModels
# Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) # Add a constant (intercept) to the model
# Fit the Poisson regression model
poiregmodel = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
print(poiregmodel.summary()) Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: hours No. Observations: 753
Model: GLM Df Residuals: 746
Model Family: Poisson Df Model: 6
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -3.1563e+05
Date: Tue, 10 Mar 2026 Deviance: 6.2754e+05
Time: 14:19:37 Pearson chi2: 6.60e+05
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 6.9365 0.012 562.281 0.000 6.912 6.961
kidslt6 -0.8075 0.004 -193.217 0.000 -0.816 -0.799
age -0.0427 0.000 -201.166 0.000 -0.043 -0.042
educ 0.0528 0.001 83.439 0.000 0.052 0.054
huswage -0.0207 0.000 -54.548 0.000 -0.021 -0.020
exper 0.1204 0.001 219.231 0.000 0.119 0.121
expersq -0.0018 1.63e-05 -112.090 0.000 -0.002 -0.002
==============================================================================
The output above looks very similar to the linear regression output. This model is much more interpretable compared to the linear regression model that we fit earlier. To see this, consider the interpretation of the coefficient -0.8075 for the “kidslt6” variable: having a small kid reduces mean hours worked by 55%. This is clearly a much more interpretable result compared to before.
#56% comes from:
print((np.exp(poiregmodel.params['kidslt6']) - 1)*100)-55.40391074218212
Maximum Likelihood Estimation in Poisson Regression¶
Let us now understand how sm.GLM is obtaining the estimates and standard errors for the regression coefficients. The estimates are obtained by Maximum Likelihood Estimation. The likelihood in this Poisson Regression Model is given by
It is much more easier to maximize the logarithm of the likelihood (as opposed to the likelihood directly). We can also ignore the terms in the denominator as they do not involve the parameters . Let us also replace the sum by the simpler expression where denotes the column vector with components and denotes the column vector with components . With these changes, we obtain the log-likelihood:
The goal is to maximize this log-likelihood over . We can do this by taking the gradient (i.e., derivatives of with respect to collected in a column vector), setting the gradient to zero and solving the resulting set of equations. The gradient of is
where is the vector with entries . Therefore setting the gradient equal to zero leads us to the equation:
This gives equations for the parameters . Note that the parameters appear in the above equation through . It is interesting to observe that in the case of linear regression (where we have the normal distribution instead of Poisson), if we attempt to calculate the MLE in the same way as above, we would obtain exactly the same equation: . However, in that case, leading to the equation which can be solved as . Unfortunately, in the present Poisson regression case, , which has components , depends on in a nonlinear fashion and, consequently, it is not possible to solve for in closed form. One needs to resort to solving this equation using numerical methods.
MLE via Newton’s Method¶
A very classical method for solving nonlinear equations is the Newton method (also known as the Newton-Raphson method): see https://en.wikipedia.org/wiki/Newton’s_method. We apply this method to solve . Newton’s method is iterative (starts with a rough guess for the solution and then successively modifies into better approximations for the solution eventually converging to the correct solution). Newton’s interation for changing the current solution approximation into a better approximation of the solution to is given by:
where is the Hessian Matrix of the function evaluated at the point . It may be recalled the Hessian matrix is the matrix of second derivatives (see Hessian matrix). To actually, implement Newton’s method, we would need to calculate the Hessian. We can do this by further differentiating the gradient with respect to . It can be checked that this leads to:
We can rewrite this in matrix form as
where is the diagonal matrix with diagonal entries . Thus Newton’s method for Poisson regression is:
where we wrote to emphasize that this vector is calculated as with . This method can be implemented as follows.
#Newton's Method for Calculating MLE in Poisson Regression
beta_hat = poiregmodel.params.values
print(beta_hat)
#this is the answer computed by statsmodels and we shall show that Newton's method leads to the same answer if initialized reasonably
#Initialization for Newton's Method
m = 6
p = 7
beta_initial = [5, 0, 0, 0, 0, 0, 0]
n = mroz.shape[0]
Xmat = X.values
Yvec = mroz['hours'].values
#Newton's method for 100 iterations
num_iterations = 100
for i in range(num_iterations):
log_muvec = np.dot(Xmat, beta_initial)
muvec = np.exp(log_muvec)
gradient = np.dot(Xmat.T, Yvec - muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
beta_initial = beta_initial - Hessian_inv @ gradient
print(beta_initial)[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[11.862361 -2.918359 -0.191565 0.219829 -0.093895 0.456819 -0.004969]
[10.885550 -2.863001 -0.189547 0.218271 -0.093150 0.450808 -0.004889]
[9.946436 -2.723529 -0.184228 0.214137 -0.091177 0.435017 -0.004679]
[9.097029 -2.412093 -0.170935 0.203596 -0.086180 0.395908 -0.004161]
[8.414084 -1.884261 -0.142079 0.179445 -0.074896 0.313363 -0.003085]
[7.865945 -1.335615 -0.097770 0.136900 -0.055375 0.198124 -0.001679]
[7.313642 -0.990725 -0.061477 0.090536 -0.034424 0.127770 -0.001108]
[7.011093 -0.838093 -0.045790 0.061125 -0.022797 0.113816 -0.001399]
[6.944282 -0.808540 -0.042825 0.053242 -0.020705 0.118791 -0.001758]
[6.936650 -0.807526 -0.042682 0.052832 -0.020712 0.120342 -0.001827]
[6.936480 -0.807524 -0.042681 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
Because of the exponential calculation in , one needs to be careful with initialization. For example, go back and change the initialization to . In this case, has some large entries leading to blowing up. If this can be avoided, the method converges to the correct solution.
Bayesian Analysis with Flat Prior¶
Before seeing how sm.GLM is calculating the standard errors, let us first look at the Bayesian approach for fitting the Poisson regression model with flat priors. For the Bayesian approach with flat priors, we have
The posterior is therefore given by
This posterior in cannot be written in terms of standard distributions. One can use PyMC to generate samples from this posterior. However, PyMC can be somewhat unstable and might lead to (significantly different) answers for different runs. For example, in the following code, setting randomseed = 0 gives results similar to the frequentist output while setting randomseed = 4 gives quite different results.
#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozpoimod = pm.Model()
with mrozpoimod:
# Priors for unknown model parameters
b0 = pm.Flat("b0")
b1 = pm.Flat("b1")
b2 = pm.Flat("b2")
b3 = pm.Flat("b3")
b4 = pm.Flat("b4")
b5 = pm.Flat("b5")
b6 = pm.Flat("b6")
log_mu = b0 + b1 * mroz['kidslt6'] + b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
# Likelihood
Y_obs = pm.Poisson("Y_obs", mu=np.exp(log_mu), observed=mroz['hours'])
idata = pm.sample(2000, chains = 2, random_seed = 0, return_inferencedata = True) Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 92 seconds.
Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()
allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
print(f"{name:8}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {poiregmodel.params.values[i]:.6f} | {poiregmodel.bse.values[i]:.6f}")
#These results seem different from the frequentist output.
#PyMC is also not very reliable here. Change the random seed from 0 to 4
#and look at the results. Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
------------|----------|----------
b0 | 3.200661 | 3.735618 | 6.936480 | 0.012336
b1 | -0.298308 | 0.509199 | -0.807524 | 0.004179
b2 | 0.080414 | 0.123097 | -0.042680 | 0.000212
b3 | -0.034505 | 0.087354 | 0.052831 | 0.000633
b4 | 0.001426 | 0.022144 | -0.020714 | 0.000380
b5 | -0.385071 | 0.505446 | 0.120372 | 0.000549
b6 | 0.009150 | 0.010979 | -0.001829 | 0.000016
Posterior Normal Approximation and Standard Errors¶
When the posterior cannot be written in terms of standard distributions, and when MCMC seems unreliable, an alternative option is to approximate the posterior using simple standard distributions. This idea is at the heart of the method of Variational Inference which is quite popular in Bayesian inference. In this problem, one can approximate the posterior via a (multivariate) normal distribution. This works in the following way.
Our posterior is proportional to . The normal distribution, on the other hand, is proportional to . Thus, in order to obtain a normal approximation of the posterior, it makes sense to approximate by a quadratic function of . This can be done by Taylor approximation but we need to figure out around which point does the Taylor approximation need to be done. The function is the log-likelihood which is maximized at the MLE . It makes sense, therefore, to do the Taylor approximation around . The idea is that the approximation will be quite accurate for points near the MLE . For points far from the MLE, the posterior density is probably small anyway so bad approximation for these points might have an insignificant effect. Motivated by these considerations, we Taylor expand the log-likelihood around up to second order as follows:
Thus the posterior is approximated as:
In the last step above, we ignored in proportionality because it does not involve the parameters .
Now we can compare the posterior approximation to the density of a multivariate normal (with mean and covariance ) which is proportional to . It is then clear that and . Thus the posterior normal approximation is given by:
This gives a very simple and practical approximation to the complicated posterior, especially because we already know how to calculate the Hessian (it was used in the Newton algorithm for computing the MLE). If one wants, posterior samples can be easily generated from . This will be much faster than using PyMC. However, PyMC tries to obtain samples from the actual posterior while this normal distribution is an approximation to the actual posterior.
Standard Errors for Poisson Regression¶
The standard errors of Poisson Regression that were reported by sm.GLM are actually computed from the covariance of the posterior normal approximation. In other words, the standard errors are simply the square-roots of the diagonal entries of the matrix:
where is the MLE and is the diagonal matrix with diagonal entries . This can be readily verified in the context of our dataset as follows.
#Standard Error Calculation:
log_muvec = np.dot(Xmat, beta_hat)
muvec = np.exp(log_muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
CovMat = -Hessian_inv
print(np.sqrt(np.diag(CovMat)))
#Compare with
print(poiregmodel.bse)[0.012336 0.004179 0.000212 0.000633 0.000380 0.000549 0.000016]
const 0.012336
kidslt6 0.004179
age 0.000212
educ 0.000633
huswage 0.000380
exper 0.000549
expersq 0.000016
dtype: float64
These standard errors can also be justified using frequentist arguments but that justification is much more complicated (it involves asymptotics and notions such as Fisher information). The Bayesian justification is much simpler and intuitive (based on normal approximation via Taylor expansion).
The important takeaways are:
Poisson Regression is an important alternative to linear regression which is applicable when the response variable involves counts. It gives more interpretable results than linear regression by allowing one to do log-linear modeling in situations where one cannot directly take logarithms of the response variable due to existence of zero counts.
In the case of linear regression, frequentist inference and Bayesian inference with flat priors coincide exactly as can be rigorously mathematically proved.
For Poisson regression, frequentist inference uses the MLE and provides standard errors by taking square roots of the diagonal entries of the matrix . Bayesian inference with flat priors can give different answers to frequentist inference in Poisson regression. However, if one implements Bayesian inference by not working with the full posterior and instead approximating the posterior by a multivariate normal distribution centered at the MLE, then Bayesian inference gives identical answers to the frequentist inference.