Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
np.set_printoptions(suppress=True, formatter={'float_kind':'{:f}'.format}) 
#this option makes Jupyter print numbers in ordinary (as opposed to scientific) notation

Linear Regression and Poisson Regression

We shall review linear regression in the context of an applied dataset. We will also use this example to motivate Poisson Regression.

Review of Linear Regression with an Economics Dataset

The dataset (called MROZ) comes from an Econometrica paper by Mroz in 1987 and contains data on a bunch of variables for married women in the year 1975. This dataset is analyzed in the Econometrics book by Wooldridge (this book contains many other interesting economic and political datasets). The specific variables included in the Mroz dataset are:

  1. inlf: binary variable equaling 1 if the individual worked (i.e., they were ‘in the labor force’) in the year 1975 and 0 otherwise

  2. hours: number of hours worked in 1975

  3. kidslt6: number of kids < 6 years of age

  4. kidsge6: number of kids 6-18 years of age

  5. age: age in years

  6. educ: years of schooling

  7. wage: hourly wage in 1975

  8. repwage: reported wage at interview in 1976

  9. hushrs: hours worked by husband in 1975

  10. husage: husband’s age

  11. huseduc: husband’s years of schooling

  12. huswage: husband’s hourly wage in 1975

  13. faminc: family income in 1975

  14. mtr: federal marginal tax rate facing woman

  15. motheduc: mother’s years of schooling

  16. fatheduc: father’s years of schooling

  17. unem: unemployment rate in county of residence

  18. city: =1 if live in Standard metropolitan statistical area

  19. exper: actual labor market experience

  20. nwifeinc: (faminc - wage*hours)/1000

  21. lwage: log(wage)

  22. expersq: exper2\text{exper}^2 (the square of the experience variable)

#Import the MROZ.csv dataset
mroz = pd.read_csv("MROZ.csv")
print(mroz.head(12))
    inlf  hours  kidslt6  kidsge6  age  educ    wage  repwage  hushrs  husage  \
0      1   1610        1        0   32    12  3.3540     2.65    2708      34   
1      1   1656        0        2   30    12  1.3889     2.65    2310      30   
2      1   1980        1        3   35    12  4.5455     4.04    3072      40   
3      1    456        0        3   34    12  1.0965     3.25    1920      53   
4      1   1568        1        2   31    14  4.5918     3.60    2000      32   
5      1   2032        0        0   54    12  4.7421     4.70    1040      57   
6      1   1440        0        2   37    16  8.3333     5.95    2670      37   
7      1   1020        0        0   54    12  7.8431     9.98    4120      53   
8      1   1458        0        2   48    12  2.1262     0.00    1995      52   
9      1   1600        0        2   39    12  4.6875     4.15    2100      43   
10     1   1969        0        1   33    12  4.0630     4.30    2450      34   
11     1   1960        0        1   42    11  4.5918     4.58    2375      47   

    ...  faminc     mtr  motheduc  fatheduc  unem  city  exper   nwifeinc  \
0   ...   16310  0.7215        12         7   5.0     0     14  10.910060   
1   ...   21800  0.6615         7         7  11.0     1      5  19.499981   
2   ...   21040  0.6915        12         7   5.0     0     15  12.039910   
3   ...    7300  0.7815         7         7   5.0     0      6   6.799996   
4   ...   27300  0.6215        12        14   9.5     1      7  20.100058   
5   ...   19495  0.6915        14         7   7.5     1     33   9.859054   
6   ...   21152  0.6915        14         7   5.0     0     11   9.152048   
7   ...   18900  0.6915         3         3   5.0     0     35  10.900038   
8   ...   20405  0.7515         7         7   3.0     0     24  17.305000   
9   ...   20425  0.6915         7         7   5.0     0     21  12.925000   
10  ...   32300  0.5815        12         3   5.0     0     15  24.299953   
11  ...   28700  0.6215        14         7   5.0     0     14  19.700071   

       lwage  expersq  
0   1.210154      196  
1   0.328512       25  
2   1.514138      225  
3   0.092123       36  
4   1.524272       49  
5   1.556480     1089  
6   2.120260      121  
7   2.059634     1225  
8   0.754336      576  
9   1.544899      441  
10  1.401922      225  
11  1.524272      196  

[12 rows x 22 columns]

Several regressions can be fit on this dataset. Let us start by fitting a linear regression model with “hours” as the response variable, and “kidslt6”, “kidsge6”, “age”, “educ”, “exper”, “expersq”, “huswage”, “huseduc”, “hushrs”, “motheduc” and “fatheduc” as the covariates (or predictor variables). Note that we are using both “exper” and “expersq” (which is the square of the variable “exper”) as covariates in the regression. This is because we would expect “hours” to increase with “exper” for small values of “exper” (for women new to the workforce) but to perhaps decrease with “exper” for large values of “exper” (for example, people with very large “exper” might be closer to retirement and work less leading to smaller “hours”). Such a relationship cannot be captured by linear functions but can be captured by quadratic functions which is why both “exper” and “expersq” are included in the model.

Using the statsmodels package in Python, this regression can be carried out in the following way.

#Several regressions can be fit on this dataset. Let us fit one with
#hours as the response variable, and
#kidslt6, kidsge6, age, educ, exper, expersq, huswage, huseduc, hushrs, motheduc and fatheduc
#as covariates
import statsmodels.api as sm
#Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'kidsge6', 'age', 'educ', 
       'hushrs',  'huseduc', 'huswage',  'motheduc',
       'fatheduc', 'exper', 'expersq']].copy()
#Add a constant (intercept) to the model
X = sm.add_constant(X)

#Fit the model: 
model = sm.OLS(Y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  hours   R-squared:                       0.273
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     25.30
Date:                Tue, 10 Mar 2026   Prob (F-statistic):           1.05e-44
Time:                        14:19:30   Log-Likelihood:                -6045.7
No. Observations:                 753   AIC:                         1.212e+04
Df Residuals:                     741   BIC:                         1.217e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1488.5175    293.748      5.067      0.000     911.839    2065.196
kidslt6     -439.3761     58.748     -7.479      0.000    -554.709    -324.043
kidsge6      -32.6212     23.202     -1.406      0.160     -78.171      12.928
age          -30.4462      4.382     -6.948      0.000     -39.049     -21.844
educ          41.2569     16.470      2.505      0.012       8.924      73.590
hushrs        -0.0635      0.049     -1.292      0.197      -0.160       0.033
huseduc      -16.1572     12.297     -1.314      0.189     -40.298       7.983
huswage      -13.7469      7.556     -1.819      0.069     -28.580       1.086
motheduc      10.9852     10.368      1.059      0.290      -9.370      31.340
fatheduc      -4.0224      9.771     -0.412      0.681     -23.205      15.161
exper         65.9440      9.984      6.605      0.000      46.344      85.544
expersq       -0.7289      0.325     -2.241      0.025      -1.368      -0.090
==============================================================================
Omnibus:                       82.696   Durbin-Watson:                   1.389
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              120.379
Skew:                           0.786   Prob(JB):                     7.25e-27
Kurtosis:                       4.167   Cond. No.                     2.54e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The table above gives estimates of the regression coefficients and their standard errors for each of the covariates. These estimates and standard errors are computed in the following way. The data is denoted by (yi,xi1,xi2,,xim)(y_i, x_{i1}, x_{i2}, \dots, x_{im}) for i=1,,ni = 1, \dots, n (here mm is the number of covariates and nn is the number of observations). One often represents the response data (y1,,yny_1, \dots, y_n) in a n×1n \times 1 vector called YY, and the covariate data (xi1,,xim)(x_{i1}, \dots, x_{im}) for i=1,,ni = 1, \dots, n in a n×(m+1)n \times (m+1) matrix XX as follows:

Y=(y1yn)   and   X=(1x11x1m1xi1xim1xn1xnm)\begin{align*} Y = \begin{pmatrix} y_1 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~\text{ and }~~ X = \begin{pmatrix} 1 & x_{11} & \cdot & \cdot & \cdot & x_{1m} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ 1 & x_{i1} & \cdot & \cdot & \cdot & x_{im} \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot\\ 1 & x_{n1} & \cdot & \cdot & \cdot & x_{nm} \end{pmatrix} \end{align*}

Note the additional column of ones in the XX matrix. In this notation, the estimates of the regression coefficients are computed by solving equation:

XTXβ=XTY    leading to    β^=(XTX)1XTY.\begin{align*} X^T X \beta = X^T Y ~~~\text{ leading to } ~~~ \hat{\beta} = (X^T X)^{-1} X^T Y. \end{align*}

The standard errors are computed in the following way. First one computes the matrix:

σ^2(XTX)1    where   σ^:=i=1n(yiβ^0β^1xi1β^2xi2β^mxim)2nm1\begin{align*} \hat{\sigma}^2 (X^T X)^{-1} ~~~ \text{ where } ~~ \hat{\sigma} := \sqrt{\frac{\sum_{i=1}^n \left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \dots - \hat{\beta}_m x_{im} \right)^2}{n-m-1}} \end{align*}

The standard errors are given by the square roots of the diagonal entries of the (m+1)×(m+1)(m+1) \times (m+1) matrix σ^2(XTX)1\hat{\sigma}^2 (X^T X)^{-1}.

Looking back at the statsmodels regression summary, it is common practice to judge the importance of each variable by the value of the estimated regression coefficient as well as the corresponding standard error. The third column (the column titled t) of the regression output gives the ratio of the estimate of each regression coefficient to the corresponding standard error. Variables for which the magnitude of this ratio is large are considered important. For example, in this example, the variable “kidslt6” has a t value of -7.479 i.e., 7.479 in magnitude which is the largest of all other variables so this can be considered the most important variable among all the covariates. On the other hand, the variable “fatheduc” has t value equal to -0.412 (i.e., 0.412 in absolute value) which is small so this can be considered the least important variable. The right thing to compare these t-values are the quantiles of the Student t probability distribution with nm1n-m-1 degrees of freedom. This comparison is done by the OLS function and its values are given in the fourth column of the regression table. This column (titled P>tP>|t|) goes by the same ‘p-values’ and values close to zero indicate that the variable is important. For example, for ‘kidslt6’, this pp-value is rounded to zero while for ‘fathedu’, this pp-value is quit large equalling 0.681.

Usually one which looks at the table above and drops variables for which the pp-value P>tP > |t| is large. In this problem, it might be reasonable to drop the variables “motheduc”, “fatheduc”, “hushrs”, “huseduc” and “kidsge6” deeming these to be unimportant for explaining the response “hours”. Below we carry out linear regression on “hours” with the other six variables “kidslt6”, “age”, “educ”, “huswage”, “exper”, “expersq”.

Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ', 
        'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) #add a constant (intercept) to the model
#Fit the model: 
linmodel = sm.OLS(Y, X).fit()
print(linmodel.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  hours   R-squared:                       0.266
Model:                            OLS   Adj. R-squared:                  0.260
Method:                 Least Squares   F-statistic:                     44.99
Date:                Tue, 10 Mar 2026   Prob (F-statistic):           4.67e-47
Time:                        14:19:30   Log-Likelihood:                -6049.5
No. Observations:                 753   AIC:                         1.211e+04
Df Residuals:                     746   BIC:                         1.215e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1166.8778    243.738      4.787      0.000     688.384    1645.372
kidslt6     -433.1229     58.416     -7.414      0.000    -547.803    -318.443
age          -28.4308      4.067     -6.991      0.000     -36.414     -20.447
educ          32.6255     12.827      2.543      0.011       7.444      57.807
huswage      -13.9353      6.857     -2.032      0.042     -27.397      -0.474
exper         67.7980      9.896      6.851      0.000      48.371      87.225
expersq       -0.7375      0.325     -2.270      0.023      -1.375      -0.100
==============================================================================
Omnibus:                       78.707   Durbin-Watson:                   1.383
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              111.647
Skew:                           0.768   Prob(JB):                     5.70e-25
Kurtosis:                       4.095   Cond. No.                     2.76e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.76e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Instead of using the package StatsModels (specifically the function sm.OLS), we can also fit this linear regression model via the Bayesian approach (in PyMC) with flat priors on the regression coefficients (as well as logσ\log \sigma). This is done as follows. Note that with these priors, it can be mathematically proved (by calculating the posterior explicitly) that the Bayesian approach gives identical results as the frequentist approach. This means that implementation in PyMC should give essentially the same results as found in the sm.OLS summary (with perhaps slight differences due to Monte Carlo fluctuations).

#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozmod = pm.Model()
with mrozmod:
    # Priors for unknown model parameters
    b0 = pm.Flat("b0")
    b1 = pm.Flat("b1")
    b2 = pm.Flat("b2")
    b3 = pm.Flat("b3")
    b4 = pm.Flat("b4")
    b5 = pm.Flat("b5")
    b6 = pm.Flat("b6")
    log_sigma = pm.Flat("log_sigma")             
    sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
    # Expected value of outcome
    mu = b0 + b1 * mroz['kidslt6'] +   b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=mroz['hours'])
    idata = pm.sample(2000, chains = 2, return_inferencedata = True)  
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6, log_sigma]
Loading...
Loading...
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 5 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics

The above PyMC code specifies the linear regression model in the following way:

YiN(μi,σ2)   for i=1,,nμi=β0+β1xi1+β2xi2++βmxim   for i=1,,nβ0,β1,,βm,logσi.i.dFlat=Uniform[C,C] for very large C\begin{align*} &Y_i \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \\ &\beta_0, \beta_1, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{Flat} = \text{Uniform}[-C, C] \text{ for very large } C \end{align*}

Let us check that the Bayesian analysis leads to basically the same answers as statsmodels OLS.

b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()

allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter   | Mean     | Std. Dev. | Least Squares | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
    print(f"{name:10}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {linmodel.params.values[i]:.6f}  | {linmodel.bse.values[i]:.6f}")
Parameter   | Mean     | Std. Dev. | Least Squares | Std. Error
------------|----------|----------
b0        | 1179.207105 | 243.674265 | 1166.877797  | 243.737876
b1        | -435.235164 | 58.923978 | -433.122873  | 58.416292
b2        | -28.594224 | 4.066794 | -28.430794  | 4.066762
b3        | 32.212160 | 12.860550 | 32.625466  | 12.827177
b4        | -13.798977 | 6.858220 | -13.935253  | 6.857217
b5        | 67.599595 | 9.941135 | 67.797967  | 9.895837
b6        | -0.729025 | 0.324472 | -0.737492  | 0.324827

The estimated regression coefficent β^1\hat{\beta}_1 for “kidslt6” is about -430 which can be interpreted as follows: having a small child reduces the mean hours worked by about 430. This result is clearly meaningless for women with small working hours (e.g., if someone only works for about 300 hours, what does a reduction of 430 mean?). It is also not very meaningful for women with lots of working hours (e.g., if someone works for more than 3000 hours, a reduction of 430 due to a small kid sounds too small). A much more meaningful and useful interpretation would report reductions in terms of percentages and not absolute hours. For example, a statement such as “having a small child reduces the mean hours worked by about 40%” would apply to all working women and be much more interpretable. Such an intepretation cannot be drawn based on this linear regression.

A percentage interpretation can be realized if we change our model by taking the response variable to be log(Hours worked)\log(\text{Hours worked}) as opposed to “hours”. Indeed, in the linear regression model for log(Hours)\log(\text{Hours}):

log(Hours)=β0+β1kidslt6+β2age+,\begin{align*} \log(\text{Hours}) = \beta_0 + \beta_1 \text{kidslt6} + \beta_2 \text{age} + \dots, \end{align*}

the coefficient β1\beta_1 gives the reduction in log(Hours)\log(\text{Hours}) due to a small child. This implies a percentage increase of 100×(exp(β1)1)100 \times(\exp(\beta_1) - 1) in hours worked because:

HoursnewHoursoldHoursold×100=[exp(logHoursnewlogHoursold)1]×100\begin{align*} \frac{\text{Hours}_{\text{new}} - \text{Hours}_{\text{old}}}{\text{Hours}_{\text{old}}} \times 100 = \left[\exp \left(\log \text{Hours}_{\text{new}} - \log \text{Hours}_{\text{old}} \right) - 1\right] \times 100 \end{align*}

It is much more preferable to use log(Hours)\log(\text{Hours}) as the response variable in these problems compared to Hours\text{Hours}. However, one big problem is that, in this dataset, there are quite a few observations for which the Hours variable equals 0. So we cannot really work with the variable logHours\log \text{Hours}. In such case, Poisson regression provides a great alternative model.

Poisson Regression

In Poisson regression, one modifies the linear regression model:

log(Yi)N(μi,σ2)   for i=1,,nμi=β0+β1xi1+β2xi2++βmxim   for i=1,,n\begin{align*} &\log(Y_i) \sim N(\mu_i, \sigma^2) ~~ \text{ for } i = 1, \dots, n \\ &\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*}

(which cannot be used as the variable log(Yi)\log(Y_i) is meaningless when Yi=0Y_i = 0) in the following way:

YiPoisson(μi)   for i=1,,nlogμi=β0+β1xi1+β2xi2++βmxim   for i=1,,n\begin{align*} &Y_i \sim \text{Poisson}(\mu_i) ~~ \text{ for } i = 1, \dots, n \\ &\log \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im} ~~ \text{ for } i = 1, \dots, n \end{align*}

The main change is that the response variable YiY_i is now modelled as Poisson(μi)\text{Poisson}(\mu_i). The Poisson distribution is often used to model counts. In this case, YiY_i is the number of hours worked in 1975 which is a count variable. The linear relationship between the response and covariates is specified through logμi\log \mu_i which enables percentage interpretation for the regression coefficients.

As in the case of linear regression, there are two ways of implementing Poisson regression:

  1. Through an in-built function in the library statsmodels. This function is called sm.GLM. GLM stands for Generalized Linear Models which is a general class of models which includes linear regression and Poisson regression (and also logistic regression). While using sm.GLM, we need to specify which GLM we are fitting using the family argument.

  2. Using PyMC to do Bayesian inference with flat priors: β0,,βmi.i.dFlat\beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} \text{Flat}.

Most practitioners use the first option above. Let us first see how it works before turning to the Bayesian PyMC approach.

#Poisson Regression through StatsModels
# Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ', 
        'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) # Add a constant (intercept) to the model
# Fit the Poisson regression model
poiregmodel = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
print(poiregmodel.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  hours   No. Observations:                  753
Model:                            GLM   Df Residuals:                      746
Model Family:                 Poisson   Df Model:                            6
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -3.1563e+05
Date:                Tue, 10 Mar 2026   Deviance:                   6.2754e+05
Time:                        14:19:37   Pearson chi2:                 6.60e+05
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.9365      0.012    562.281      0.000       6.912       6.961
kidslt6       -0.8075      0.004   -193.217      0.000      -0.816      -0.799
age           -0.0427      0.000   -201.166      0.000      -0.043      -0.042
educ           0.0528      0.001     83.439      0.000       0.052       0.054
huswage       -0.0207      0.000    -54.548      0.000      -0.021      -0.020
exper          0.1204      0.001    219.231      0.000       0.119       0.121
expersq       -0.0018   1.63e-05   -112.090      0.000      -0.002      -0.002
==============================================================================

The output above looks very similar to the linear regression output. This model is much more interpretable compared to the linear regression model that we fit earlier. To see this, consider the interpretation of the coefficient -0.8075 for the “kidslt6” variable: having a small kid reduces mean hours worked by 55%. This is clearly a much more interpretable result compared to before.

#56% comes from:
print((np.exp(poiregmodel.params['kidslt6']) - 1)*100)
-55.40391074218212

Maximum Likelihood Estimation in Poisson Regression

Let us now understand how sm.GLM is obtaining the estimates and standard errors for the regression coefficients. The estimates are obtained by Maximum Likelihood Estimation. The likelihood in this Poisson Regression Model is given by

likelihood=i=1neμiμiyiyi!=i=1neeβ0+β1xi1++βmximeyi(β0+β1xi1++βmxim)yi!\begin{align*} \text{likelihood} &= \prod_{i=1}^n \frac{e^{-\mu_i} \mu_i^{y_i}}{y_i!} \\ &= \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \end{align*}

It is much more easier to maximize the logarithm of the likelihood (as opposed to the likelihood directly). We can also ignore the yi!y_i! terms in the denominator as they do not involve the parameters β0,,βm\beta_0, \dots, \beta_m. Let us also replace the sum β0+β1xi1++βmxim\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} by the simpler expression xiTβx_i^T \beta where β\beta denotes the column vector with components β0,,βm\beta_0, \dots, \beta_m and xix_i denotes the column vector with components 1,xi1,xi2,,xim1, x_{i1}, x_{i2}, \dots, x_{im}. With these changes, we obtain the log-likelihood:

(β)=i=1n(exiTβ+yixiTβ)\begin{align*} \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \end{align*}

The goal is to maximize this log-likelihood over β0,β1,,βm\beta_0, \beta_1, \dots, \beta_m. We can do this by taking the gradient (i.e., derivatives of (β)\ell(\beta) with respect to β0,,βm\beta_0, \dots, \beta_m collected in a column vector), setting the gradient to zero and solving the resulting set of equations. The gradient of (β)\ell(\beta) is

(β)=i=1n(exiTβxi+yixi)=i=1n(yiexiTβ)xi=i=1n(yiμi)xi=XT(Yμ).\begin{align*} \nabla \ell(\beta) = \sum_{i=1}^n \left(-e^{x_i^T \beta} x_i + y_i x_i \right) = \sum_{i=1}^n \left(y_i - e^{x_i^T \beta} \right) x_i = \sum_{i=1}^n \left(y_i - \mu_i \right) x_i = X^T(Y - \mu). \end{align*}

where μ\mu is the n×1n \times 1 vector with entries μ1,,μn\mu_1, \dots, \mu_n. Therefore setting the gradient equal to zero leads us to the equation:

XTμ=XTY.\begin{align*} X^T \mu = X^T Y. \end{align*}

This gives m+1m+1 equations for the m+1m+1 parameters β0,,βm\beta_0, \dots, \beta_m. Note that the parameters appear in the above equation through μ\mu. It is interesting to observe that in the case of linear regression (where we have the normal distribution instead of Poisson), if we attempt to calculate the MLE in the same way as above, we would obtain exactly the same equation: XTμ=XTYX^T \mu = X^T Y. However, in that case, μ=Xβ\mu = X \beta leading to the equation XTXβ=XTYX^T X \beta = X^T Y which can be solved as β^=(XTX)1XTY\hat{\beta} = (X^T X)^{-1} X^T Y. Unfortunately, in the present Poisson regression case, μ\mu, which has components ex1Tβ,,exnTβe^{x_1^T \beta}, \dots, e^{x_n^T \beta}, depends on β\beta in a nonlinear fashion and, consequently, it is not possible to solve XTμ=XTYX^T \mu = X^T Y for β\beta in closed form. One needs to resort to solving this equation using numerical methods.

MLE via Newton’s Method

A very classical method for solving nonlinear equations is the Newton method (also known as the Newton-Raphson method): see https://en.wikipedia.org/wiki/Newton’s_method. We apply this method to solve (β)=XTYXTμ=0\nabla \ell(\beta) = X^T Y - X^T \mu = 0. Newton’s method is iterative (starts with a rough guess β(0)\beta^{(0)} for the solution and then successively modifies into better approximations for the solution eventually converging to the correct solution). Newton’s interation for changing the current solution approximation β(m)\beta^{(m)} into a better approximation β(m+1)\beta^{(m+1)} of the solution to (β)=0\nabla \ell(\beta) = 0 is given by:

β(m+1)=β(m)(H(β(m)))1(β(m)),\begin{align*} \beta^{(m+1)} = \beta^{(m)} - (H\ell(\beta^{(m)}))^{-1} \nabla \ell(\beta^{(m)}), \end{align*}

where H(β(m)H \ell(\beta^{(m)} is the Hessian Matrix of the function (β)\ell(\beta) evaluated at the point β=β(m)\beta = \beta^{(m)}. It may be recalled the Hessian matrix is the matrix of second derivatives (see Hessian matrix). To actually, implement Newton’s method, we would need to calculate the Hessian. We can do this by further differentiating the gradient (β)=i=1n(yiexiTβ)xi=XT(Yμ)\nabla \ell(\beta) = \sum_{i=1}^n (y_i - e^{x_i^T \beta}) x_i = X^T(Y - \mu) with respect to β\beta. It can be checked that this leads to:

H(β)=i=1nexiTβxixiT=i=1nμixixiT.\begin{align*} H \ell(\beta) = - \sum_{i=1}^n e^{x_i^T \beta} x_i x_i^T = - \sum_{i=1}^n \mu_i x_i x_i^T. \end{align*}

We can rewrite this in matrix form as

H(β)=XTM(β)X\begin{align*} H \ell(\beta) = - X^T M(\beta) X \end{align*}

where M(β)M(\beta) is the diagonal matrix with diagonal entries μ1=ex1Tβ,,μn=exnTβ\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}. Thus Newton’s method for Poisson regression is:

β(m+1)=β(m)+(XTM(β(m))X)1(XTYXTμ(β(m))),\begin{align*} \beta^{(m+1)} = \beta^{(m)} + \left(X^T M(\beta^{(m)}) X \right)^{-1} \left( X^T Y - X^T \mu(\beta^{(m)}) \right), \end{align*}

where we wrote μ(β(m))\mu(\beta^{(m)}) to emphasize that this vector μ\mu is calculated as ex1Tβ,,exnTβe^{x_1^T \beta}, \dots, e^{x_n^T \beta} with β=β(m)\beta = \beta^{(m)}. This method can be implemented as follows.

#Newton's Method for Calculating MLE in Poisson Regression
beta_hat = poiregmodel.params.values 
print(beta_hat)
#this is the answer computed by statsmodels and we shall show that Newton's method leads to the same answer if initialized reasonably

#Initialization for Newton's Method
m = 6
p = 7
beta_initial = [5, 0, 0, 0, 0, 0, 0]
n = mroz.shape[0]
Xmat = X.values
Yvec = mroz['hours'].values

#Newton's method for 100 iterations
num_iterations = 100
for i in range(num_iterations):
    log_muvec = np.dot(Xmat, beta_initial)
    muvec = np.exp(log_muvec)
    gradient = np.dot(Xmat.T,  Yvec - muvec)
    M = np.diag(muvec)
    Hessian = -Xmat.T @ M @ Xmat
    Hessian_inv = np.linalg.inv(Hessian)
    beta_initial = beta_initial - Hessian_inv @ gradient
    print(beta_initial)
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[11.862361 -2.918359 -0.191565 0.219829 -0.093895 0.456819 -0.004969]
[10.885550 -2.863001 -0.189547 0.218271 -0.093150 0.450808 -0.004889]
[9.946436 -2.723529 -0.184228 0.214137 -0.091177 0.435017 -0.004679]
[9.097029 -2.412093 -0.170935 0.203596 -0.086180 0.395908 -0.004161]
[8.414084 -1.884261 -0.142079 0.179445 -0.074896 0.313363 -0.003085]
[7.865945 -1.335615 -0.097770 0.136900 -0.055375 0.198124 -0.001679]
[7.313642 -0.990725 -0.061477 0.090536 -0.034424 0.127770 -0.001108]
[7.011093 -0.838093 -0.045790 0.061125 -0.022797 0.113816 -0.001399]
[6.944282 -0.808540 -0.042825 0.053242 -0.020705 0.118791 -0.001758]
[6.936650 -0.807526 -0.042682 0.052832 -0.020712 0.120342 -0.001827]
[6.936480 -0.807524 -0.042681 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]
[6.936480 -0.807524 -0.042680 0.052831 -0.020714 0.120372 -0.001829]

Because of the exponential calculation in μi=exiTβ\mu_i = e^{x_i^T \beta}, one needs to be careful with initialization. For example, go back and change the initialization to β(0)=(0,,0)\beta^{(0)} = (0, \dots, 0). In this case, β(1)\beta^{(1)} has some large entries leading to exiTβ(1)e^{x_i^T \beta^{(1)}} blowing up. If this can be avoided, the method converges to the correct solution.

Bayesian Analysis with Flat Prior

Before seeing how sm.GLM is calculating the standard errors, let us first look at the Bayesian approach for fitting the Poisson regression model with flat priors. For the Bayesian approach with flat priors, we have

Prior=1    and    Likelihood=i=1neeβ0+β1xi1++βmximeyi(β0+β1xi1++βmxim)yi!exp((β)).\begin{align*} \text{Prior} = 1 ~~~ \text{ and } ~~~ \text{Likelihood} = \prod_{i=1}^n \frac{e^{-e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im}}} e^{y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im})}}{y_i!} \propto \exp(\ell(\beta)). \end{align*}

The posterior is therefore given by

Posterior(β)exp((β))=exp(i=1n(exiTβ+yixiTβ))\begin{align*} \text{Posterior}(\beta) \propto \exp \left(\ell(\beta) \right) = \exp \left(\sum_{i=1}^n \left(-e^{x_i^T \beta} + y_i x_i^T \beta \right) \right) \end{align*}

This posterior in β0,,βm\beta_0, \dots, \beta_m cannot be written in terms of standard distributions. One can use PyMC to generate samples from this posterior. However, PyMC can be somewhat unstable and might lead to (significantly different) answers for different runs. For example, in the following code, setting randomseed = 0 gives results similar to the frequentist output while setting randomseed = 4 gives quite different results.

#We can also take the Bayesian Approach and use PyMC:
import pymc as pm
mrozpoimod = pm.Model()
with mrozpoimod:
    # Priors for unknown model parameters
    b0 = pm.Flat("b0")
    b1 = pm.Flat("b1")
    b2 = pm.Flat("b2")
    b3 = pm.Flat("b3")
    b4 = pm.Flat("b4")
    b5 = pm.Flat("b5")
    b6 = pm.Flat("b6")
    log_mu = b0 + b1 * mroz['kidslt6'] +   b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
    # Likelihood
    Y_obs = pm.Poisson("Y_obs", mu=np.exp(log_mu), observed=mroz['hours'])
    idata = pm.sample(2000, chains = 2, random_seed = 0, return_inferencedata = True)  
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [b0, b1, b2, b3, b4, b5, b6]
Loading...
Loading...
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 92 seconds.
Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()

allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
    print(f"{name:8}| {np.mean(arr):.6f} | {np.std(arr):.6f}  | {poiregmodel.params.values[i]:.6f}  | {poiregmodel.bse.values[i]:.6f}")
#These results seem different from the frequentist output.
#PyMC is also not very reliable here. Change the random seed from 0 to 4
#and look at the results. 
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
------------|----------|----------
b0      | 3.200661 | 3.735618  | 6.936480  | 0.012336
b1      | -0.298308 | 0.509199  | -0.807524  | 0.004179
b2      | 0.080414 | 0.123097  | -0.042680  | 0.000212
b3      | -0.034505 | 0.087354  | 0.052831  | 0.000633
b4      | 0.001426 | 0.022144  | -0.020714  | 0.000380
b5      | -0.385071 | 0.505446  | 0.120372  | 0.000549
b6      | 0.009150 | 0.010979  | -0.001829  | 0.000016

Posterior Normal Approximation and Standard Errors

When the posterior cannot be written in terms of standard distributions, and when MCMC seems unreliable, an alternative option is to approximate the posterior using simple standard distributions. This idea is at the heart of the method of Variational Inference which is quite popular in Bayesian inference. In this problem, one can approximate the posterior via a (multivariate) normal distribution. This works in the following way.

Our posterior is proportional to exp((β))\exp(\ell(\beta)). The normal distribution, on the other hand, is proportional to exp(quadratic function of β)\exp \left(\text{quadratic function of } \beta \right). Thus, in order to obtain a normal approximation of the posterior, it makes sense to approximate (β)\ell(\beta) by a quadratic function of β\beta. This can be done by Taylor approximation but we need to figure out around which point does the Taylor approximation need to be done. The function (β)\ell(\beta) is the log-likelihood which is maximized at the MLE β^\hat{\beta}. It makes sense, therefore, to do the Taylor approximation around β^\hat{\beta}. The idea is that the approximation will be quite accurate for points near the MLE β^\hat{\beta}. For points far from the MLE, the posterior density is probably small anyway so bad approximation for these points might have an insignificant effect. Motivated by these considerations, we Taylor expand the log-likelihood (β)\ell(\beta) around β^\hat{\beta} up to second order as follows:

(β)(β^)+<(β^),ββ^>+12(ββ^)TH(β^)(ββ^)=(β^)+12(ββ^)TH(β^)(ββ^)   because (β^)=0.\begin{align*} \ell(\beta) &\approx \ell(\hat{\beta}) + \left<\nabla \ell(\hat{\beta}), \beta - \hat{\beta} \right> + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \\ &= \ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) ~~ \text{ because } \nabla \ell(\hat{\beta}) = 0. \end{align*}

Thus the posterior is approximated as:

posterior(β)exp((β))exp((β^)+12(ββ^)TH(β^)(ββ^))=exp((β^))exp(12(ββ^)TH(β^)(ββ^))exp(12(ββ^)TH(β^)(ββ^)).\begin{align*} \text{posterior}(\beta) & \propto \exp \left(\ell(\beta) \right) \\ & \approx \exp \left(\ell(\hat{\beta}) + \frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &= \exp \left(\ell(\hat{\beta}) \right) \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right) \\ &\propto \exp \left(\frac{1}{2} \left(\beta - \hat{\beta} \right)^T H\ell(\hat{\beta}) \left(\beta - \hat{\beta} \right) \right). \end{align*}

In the last step above, we ignored exp((β^))\exp \left(\ell(\hat{\beta}) \right) in proportionality because it does not involve the parameters β\beta.

Now we can compare the posterior approximation to the density of a multivariate normal (with mean μ\mu and covariance Σ\Sigma) which is proportional to exp((βμ)TΣ1(βμ))\exp \left(- (\beta - \mu)^T \Sigma^{-1} (\beta - \mu) \right). It is then clear that μ=β^\mu = \hat{\beta} and Σ1=H(β^)\Sigma^{-1} = -H\ell(\hat{\beta}). Thus the posterior normal approximation is given by:

posteriorN(β^,(H(β^))1)\begin{align*} \text{posterior} \approx N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right) \end{align*}

This gives a very simple and practical approximation to the complicated posterior, especially because we already know how to calculate the Hessian (it was used in the Newton algorithm for computing the MLE). If one wants, posterior samples can be easily generated from N(β^,(H(β^))1)N \left(\hat{\beta}, \left( - H\ell(\hat{\beta}) \right)^{-1}\right). This will be much faster than using PyMC. However, PyMC tries to obtain samples from the actual posterior while this normal distribution is an approximation to the actual posterior.

Standard Errors for Poisson Regression

The standard errors of Poisson Regression that were reported by sm.GLM are actually computed from the covariance of the posterior normal approximation. In other words, the standard errors are simply the square-roots of the diagonal entries of the matrix:

(H(β^))1=(XTM(β^)X)1\begin{align*} \left(-H\ell(\hat{\beta}) \right)^{-1} = \left(X^T M(\hat{\beta}) X \right)^{-1} \end{align*}

where β^\hat{\beta} is the MLE and M(β)M(\beta) is the diagonal matrix with diagonal entries μ1=ex1Tβ,,μn=exnTβ\mu_1 = e^{x_1^T \beta}, \dots, \mu_n = e^{x_n^T \beta}. This can be readily verified in the context of our dataset as follows.

#Standard Error Calculation:
log_muvec = np.dot(Xmat, beta_hat)
muvec = np.exp(log_muvec)
M = np.diag(muvec)
Hessian = -Xmat.T @ M @ Xmat
Hessian_inv = np.linalg.inv(Hessian)
CovMat = -Hessian_inv
print(np.sqrt(np.diag(CovMat)))
#Compare with
print(poiregmodel.bse)
[0.012336 0.004179 0.000212 0.000633 0.000380 0.000549 0.000016]
const      0.012336
kidslt6    0.004179
age        0.000212
educ       0.000633
huswage    0.000380
exper      0.000549
expersq    0.000016
dtype: float64

These standard errors can also be justified using frequentist arguments but that justification is much more complicated (it involves asymptotics and notions such as Fisher information). The Bayesian justification is much simpler and intuitive (based on normal approximation via Taylor expansion).

The important takeaways are:

  1. Poisson Regression is an important alternative to linear regression which is applicable when the response variable involves counts. It gives more interpretable results than linear regression by allowing one to do log-linear modeling in situations where one cannot directly take logarithms of the response variable due to existence of zero counts.

  2. In the case of linear regression, frequentist inference and Bayesian inference with flat priors coincide exactly as can be rigorously mathematically proved.

  3. For Poisson regression, frequentist inference uses the MLE β^\hat{\beta} and provides standard errors by taking square roots of the diagonal entries of the matrix (H(β^))1\left(-H\ell(\hat{\beta}) \right)^{-1}. Bayesian inference with flat priors can give different answers to frequentist inference in Poisson regression. However, if one implements Bayesian inference by not working with the full posterior and instead approximating the posterior by a multivariate normal distribution centered at the MLE, then Bayesian inference gives identical answers to the frequentist inference.