import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
np.set_printoptions(suppress=True, formatter={'float_kind':'{:f}'.format})
import statsmodels.api as sm
#this option makes Jupyter print numbers in ordinary (as opposed to scientific) notationBayesian approaches to Poisson Regression¶
Let us revisit the following dataset that we previously analyzed in Lab 7. This dataset (called MROZ) comes from an Econometrica paper by Mroz in 1987 and contains data on a bunch of variables for married women in the year 1975. This dataset is analyzed in the Econometrics book by Wooldridge (this book contains many other interesting economic and political datasets). The specific variables included in the Mroz dataset are:
inlf: binary variable equaling 1 if the individual worked (i.e., they were ‘in the labor force’) in the year 1975 and 0 otherwise
hours: number of hours worked in 1975
kidslt6: number of kids < 6 years of age
kidsge6: number of kids 6-18 years of age
age: age in years
educ: years of schooling
wage: hourly wage in 1975
repwage: reported wage at interview in 1976
hushrs: hours worked by husband in 1975
husage: husband’s age
huseduc: husband’s years of schooling
huswage: husband’s hourly wage in 1975
faminc: family income in 1975
mtr: federal marginal tax rate facing woman
motheduc: mother’s years of schooling
fatheduc: father’s years of schooling
unem: unemployment rate in county of residence
city: =1 if live in Standard metropolitan statistical area
exper: actual labor market experience
nwifeinc: (faminc - wage*hours)/1000
lwage: log(wage)
expersq: (the square of the experience variable)
#Import the MROZ.csv dataset
mroz = pd.read_csv("MROZ.csv")
print(mroz.head(12)) inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage \
0 1 1610 1 0 32 12 3.3540 2.65 2708 34
1 1 1656 0 2 30 12 1.3889 2.65 2310 30
2 1 1980 1 3 35 12 4.5455 4.04 3072 40
3 1 456 0 3 34 12 1.0965 3.25 1920 53
4 1 1568 1 2 31 14 4.5918 3.60 2000 32
5 1 2032 0 0 54 12 4.7421 4.70 1040 57
6 1 1440 0 2 37 16 8.3333 5.95 2670 37
7 1 1020 0 0 54 12 7.8431 9.98 4120 53
8 1 1458 0 2 48 12 2.1262 0.00 1995 52
9 1 1600 0 2 39 12 4.6875 4.15 2100 43
10 1 1969 0 1 33 12 4.0630 4.30 2450 34
11 1 1960 0 1 42 11 4.5918 4.58 2375 47
... faminc mtr motheduc fatheduc unem city exper nwifeinc \
0 ... 16310 0.7215 12 7 5.0 0 14 10.910060
1 ... 21800 0.6615 7 7 11.0 1 5 19.499981
2 ... 21040 0.6915 12 7 5.0 0 15 12.039910
3 ... 7300 0.7815 7 7 5.0 0 6 6.799996
4 ... 27300 0.6215 12 14 9.5 1 7 20.100058
5 ... 19495 0.6915 14 7 7.5 1 33 9.859054
6 ... 21152 0.6915 14 7 5.0 0 11 9.152048
7 ... 18900 0.6915 3 3 5.0 0 35 10.900038
8 ... 20405 0.7515 7 7 3.0 0 24 17.305000
9 ... 20425 0.6915 7 7 5.0 0 21 12.925000
10 ... 32300 0.5815 12 3 5.0 0 15 24.299953
11 ... 28700 0.6215 14 7 5.0 0 14 19.700071
lwage expersq
0 1.210154 196
1 0.328512 25
2 1.514138 225
3 0.092123 36
4 1.524272 49
5 1.556480 1089
6 2.120260 121
7 2.059634 1225
8 0.754336 576
9 1.544899 441
10 1.401922 225
11 1.524272 196
[12 rows x 22 columns]
In Lab 7, we fit the Poisson regression model to this dataset, using “hours” as the response variable, and “kidslt6”, “kidsge6”, “age”, “educ”, “exper”, “expersq”, “huswage”, “huseduc”, “hushrs”, “motheduc” and “fatheduc” as the covariates (or predictor variables).
The Poisson regression model is given by:
Here is the number of hours worked in 1975 which is a count variable, so Poisson regression is appropriate here. The linear relationship between the response and covariates is specified through which enables percentage interpretation for the regression coefficients.
The standard way of fitting the Poisson regression model is to use the inbuilt statsmodels function sm.GLM. GLM stands for Generalized Linear Models which is a general class of models which includes linear regression and Poisson regression (and also logistic regression). While using sm.GLM, we need to specify which GLM we are fitting using the family argument.
#Poisson Regression through StatsModels
# Define the response variable and covariates
Y = mroz['hours']
X = mroz[['kidslt6', 'age', 'educ',
'huswage', 'exper', 'expersq']].copy()
X = sm.add_constant(X) # Add a constant (intercept) to the model
# Fit the Poisson regression model
poiregmodel = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
print(poiregmodel.summary()) Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: hours No. Observations: 753
Model: GLM Df Residuals: 746
Model Family: Poisson Df Model: 6
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -3.1563e+05
Date: Sat, 04 Apr 2026 Deviance: 6.2754e+05
Time: 11:24:06 Pearson chi2: 6.60e+05
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 6.9365 0.012 562.281 0.000 6.912 6.961
kidslt6 -0.8075 0.004 -193.217 0.000 -0.816 -0.799
age -0.0427 0.000 -201.166 0.000 -0.043 -0.042
educ 0.0528 0.001 83.439 0.000 0.052 0.054
huswage -0.0207 0.000 -54.548 0.000 -0.021 -0.020
exper 0.1204 0.001 219.231 0.000 0.119 0.121
expersq -0.0018 1.63e-05 -112.090 0.000 -0.002 -0.002
==============================================================================
The output above looks very similar to a linear regression output. The fitted coefficients can be nicely interpreted. For example, the coefficient -0.8075 for the “kidslt6” variable means that having a small kid reduces mean hours worked by 55%. This 55% comes from the following calculation:
#56% comes from:
print((np.exp(poiregmodel.params['kidslt6']) - 1)*100)-55.40391074218092
The regression coefficients above are obtained by the method of Maximum Likelihood. The likelihood in this Poisson Regression Model is given by
so the log-likelihood is (ignoring constant terms):
This is maximized by Newton’s method which uses the gradient and Hessian of the log-likelihood:
The gradient is
where is the vector with entries , and the Hessian is:
We can rewrite this in matrix form as
where is the diagonal matrix with diagonal entries . Thus Newton’s method for Poisson regression is:
where we wrote to emphasize that this vector is calculated as with .
sm.GLM uses this method to give estimates of the coefficients. It is very easy to code this up yourselves as we saw in Lab 7.
The standard errors reported by sm.GLM are computed as the square-roots of the diagonal entries of the matrix:
where is the MLE and is the diagonal matrix with diagonal entries . The validity of these standard errors is commonly justified by frequentist asymptotics. However, they can also be justified from a Bayesian perspective using the Laplace posterior normal approximation. This was illustrated in Lab 7.
Bayesian Analysis with Flat Prior¶
Let us again revisit Bayesian analysis for Poisson regression, and let us work with the uniform prior for . This leads to posterior being proportional to the likelihood:
The posterior is therefore given by
This posterior in cannot be written in terms of standard distributions.
One can attempt to use PyMC to generate samples from this posterior. PyMC displays some strange behaviour. With randomseed = 0, it gives results that are basically the same as the frequentist analysis. But with randomseed = 4, things do not work.
initvals = dict(zip(["b0","b1","b2","b3","b4","b5","b6"], poiregmodel.params))
mrozpoimod = pm.Model()
with mrozpoimod:
# Priors for the unknown model parameters
b0 = pm.Flat("b0")
b1 = pm.Flat("b1")
b2 = pm.Flat("b2")
b3 = pm.Flat("b3")
b4 = pm.Flat("b4")
b5 = pm.Flat("b5")
b6 = pm.Flat("b6")
log_mu = b0 + b1 * mroz['kidslt6'] + b2 * mroz['age'] + b3 * mroz['educ'] + b4 * mroz['huswage'] + b5 * mroz['exper'] + b6 * mroz['expersq']
# Likelihood
Y_obs = pm.Poisson("Y_obs", mu=np.exp(log_mu), observed=mroz['hours'])
idata = pm.sample(4000, chains = 1, random_seed = 0, return_inferencedata = True, initvals = initvals)
#idata = pm.sample(4000, chains = 1, random_seed = 4, return_inferencedata = True, initvals = initvals) #this does not workInitializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [b0, b1, b2, b3, b4, b5, b6]
Sampling 1 chain for 1_000 tune and 4_000 draw iterations (1_000 + 4_000 draws total) took 28 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks
b0_samples = idata.posterior['b0'].values.flatten()
b1_samples = idata.posterior['b1'].values.flatten()
b2_samples = idata.posterior['b2'].values.flatten()
b3_samples = idata.posterior['b3'].values.flatten()
b4_samples = idata.posterior['b4'].values.flatten()
b5_samples = idata.posterior['b5'].values.flatten()
b6_samples = idata.posterior['b6'].values.flatten()
allsamples = [b0_samples, b1_samples, b2_samples, b3_samples, b4_samples, b5_samples, b6_samples]
names = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6']
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("------------|----------|----------")
for i, (name, arr) in enumerate(zip(names, allsamples)):
print(f"{name:8}| {np.mean(arr):.6f} | {np.std(arr):.6f} | {poiregmodel.params.values[i]:.6f} | {poiregmodel.bse.values[i]:.6f}")
#These results seem different from the frequentist output.
#PyMC is also not very reliable here. Change the random seed from 0 to 4
#and look at the results. Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
------------|----------|----------
b0 | 6.936672 | 0.012513 | 6.936480 | 0.012336
b1 | -0.807607 | 0.004120 | -0.807524 | 0.004179
b2 | -0.042680 | 0.000211 | -0.042680 | 0.000212
b3 | 0.052822 | 0.000639 | 0.052831 | 0.000633
b4 | -0.020707 | 0.000371 | -0.020714 | 0.000380
b5 | 0.120350 | 0.000552 | 0.120372 | 0.000549
b6 | -0.001828 | 0.000016 | -0.001829 | 0.000016
Below we plot the trace plots i.e., the plot of each coefficient with time: against .
fig, axes = plt.subplots(len(allsamples), 1, figsize=(12, 2.2 * len(allsamples)), sharex=True)
for ax, arr, name in zip(axes, allsamples, names):
ax.plot(arr, lw=0.8)
ax.set_title(f"Trace plot for {name}")
plt.tight_layout()
plt.show()
With randomseed = 4, the chain does not move at all for some reason.
Metropolis-Hastings¶
We now apply the Metropolis-Hastings algorithm. The log-likelihood, log-prior and log-posterior functions are given below.
#Log-likelihood:
Xm = X.values.astype(float)
y = Y.values.astype(float)
def loglik(th):
"""Poisson log-likelihood (dropping the constant log(y!) term)."""
eta = Xm @ th
return np.sum(y * eta - np.exp(eta))
def logprior(th):
"""Flat (uniform) prior => log-prior = 0."""
return 0.0
def logpost(th):
"""Unnormalized log-posterior = log-likelihood + log-prior."""
return loglik(th) + logprior(th)
Metropolis-Hastings uses a proposal and then an acceptance-rejection step. For the proposal, the most common choice is that of a random-walk i.e., we take:
where . The choice of is crucial for the working of the algorithm. The simplest thing is to take for some . But the different variables are on different scales, so using one common might not work. A natural thing to do is to use the covariance matrix corresponding to the Laplace posterior normal approximation. This can actually be obtained from the sm.GLM output, as shown below.
beta_hat = poiregmodel.params.values
p_dim = len(beta_hat)
cov_mat = poiregmodel.cov_params().values #this is the covariance matrix corresponding to the Laplace posterior normal approximation. It naturally gives us a sense of the scale of the posterior distribution, and is often a good choice for the proposal distribution covariance in a random walk Metropolis algorithm.
0.002615160790187325
Because of the symmetry of the random walk proposal, the acceptance probability in Metropolis-Hastings is simply (where is the target density at the proposed and is the target density at the current ).
The code for implementing the Metropolis-Hastings algorithm is given below.
from scipy.stats import multivariate_normal
nit = 200000
path_rw = np.zeros((nit, p_dim))
#state = beta_hat.copy()
state = np.zeros(p_dim) #we are initializing at zero here.
#If, instead, we initialize at the MLE, we will get faster mixing.
path_rw[0, :] = state
rej_rw = 0
for it in range(1, nit):
# Symmetric proposal centred at the current state
cand = multivariate_normal.rvs(mean=state, cov=cov_mat)
#cand = multivariate_normal.rvs(mean=state, cov=np.diag(cov_mat))
#sig = 0.002 #using a single sigma for all parameters.
#cand = state + np.random.normal(0, sig, size=p_dim)
# Because the proposal is symmetric, acceptance ratio is just
# the posterior ratio — no Hastings correction needed
log_ratio = logpost(cand) - logpost(state)
rnum = np.random.uniform()
if rnum < np.exp(min(0.0, log_ratio)):
state = cand
else:
rej_rw += 1
path_rw[it, :] = state
if (it + 1) % 20000 == 0:
print(f"Iteration {it + 1}")
print(f"\nRandom walk rejection rate: {rej_rw / nit:.4f}")Iteration 20000
Iteration 40000
Iteration 60000
Iteration 80000
Iteration 100000
Iteration 120000
Iteration 140000
Iteration 160000
Iteration 180000
Iteration 200000
Random walk rejection rate: 0.7549
Below are the trace plots corresponding to each variable.
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(path_rw[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

Clearly the chain is taking some time to mix. We can use a burnin to discard the initial few samples generated by the Markov chain.
burnin = 50000
samples = path_rw[burnin:, :]
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(samples[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("----------|----------|-----------|-------------|----------")
for j in range(p_dim):
print(f"{col_names[j]:8}| {samples[:, j].mean():.6f} | {samples[:, j].std():.6f} | {poiregmodel.params.values[j]:.6f} | {poiregmodel.bse.values[j]:.6f}")
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
----------|----------|-----------|-------------|----------
b0 | 6.936279 | 0.012251 | 6.936480 | 0.012336
b1 | -0.807442 | 0.004192 | -0.807524 | 0.004179
b2 | -0.042674 | 0.000213 | -0.042680 | 0.000212
b3 | 0.052835 | 0.000632 | 0.052831 | 0.000633
b4 | -0.020722 | 0.000374 | -0.020714 | 0.000380
b5 | 0.120364 | 0.000544 | 0.120372 | 0.000549
b6 | -0.001828 | 0.000016 | -0.001829 | 0.000016
The posterior means and standard deviations are very close those given by the Laplace posterior normal approximation (which is same as frequentist output). Below let us use some alternative choices of the proposal density in Metropolis-Hastings. First, let us try (i.e., we are using the same scale for all the variables).
from scipy.stats import multivariate_normal
nit = 200000
path_rw = np.zeros((nit, p_dim))
#state = beta_hat.copy()
state = np.zeros(p_dim) #we are initializing at zero here.
#If, instead, we initialize at the MLE, we will get faster mixing.
path_rw[0, :] = state
rej_rw = 0
for it in range(1, nit):
# Symmetric proposal centred at the current state
#cand = multivariate_normal.rvs(mean=state, cov=cov_mat)
sig = 0.002 #using a single sigma for all parameters.
#sig = 0.02
cand = state + np.random.normal(0, sig, size=p_dim)
log_ratio = logpost(cand) - logpost(state)
rnum = np.random.uniform()
if rnum < np.exp(min(0.0, log_ratio)):
state = cand
else:
rej_rw += 1
path_rw[it, :] = state
if (it + 1) % 20000 == 0:
print(f"Iteration {it + 1}")
print(f"\nRandom walk rejection rate: {rej_rw / nit:.4f}")Iteration 20000
Iteration 40000
Iteration 60000
Iteration 80000
Iteration 100000
Iteration 120000
Iteration 140000
Iteration 160000
Iteration 180000
Iteration 200000
Random walk rejection rate: 0.9968
Here are the trace plots.
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(path_rw[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

Clearly this chain has not mixed well. For example, for , the chain still seems to be continuing to increase after 200K time points (this is not surprising as the frequentist estimate of is around 7). We used above, if we increase to say 0.02, then the chain has very high rejection rate.
Next let us use as proposal where is the covariance of the posterior normal approximation. This makes the proposal indepedent (across variables) but uses the proper scales for each individual variable.
from scipy.stats import multivariate_normal
nit = 200000
path_rw = np.zeros((nit, p_dim))
#state = beta_hat.copy()
state = np.zeros(p_dim) #we are initializing at zero here.
#If, instead, we initialize at the MLE, we will get faster mixing.
path_rw[0, :] = state
rej_rw = 0
for it in range(1, nit):
# Symmetric proposal centred at the current state
cand = multivariate_normal.rvs(mean=state, cov=np.diag(cov_mat)) #note we are taking the diagonal part now, which corresponds to independent proposals for each parameter.
log_ratio = logpost(cand) - logpost(state)
rnum = np.random.uniform()
if rnum < np.exp(min(0.0, log_ratio)):
state = cand
else:
rej_rw += 1
path_rw[it, :] = state
if (it + 1) % 20000 == 0:
print(f"Iteration {it + 1}")
print(f"\nRandom walk rejection rate: {rej_rw / nit:.4f}")Iteration 20000
Iteration 40000
Iteration 60000
Iteration 80000
Iteration 100000
Iteration 120000
Iteration 140000
Iteration 160000
Iteration 180000
Iteration 200000
Random walk rejection rate: 0.9792
Below are the trace plots.
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(path_rw[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

It seems that the chain mixes well after a few initial iterations, which can be handled using burnin.
burnin = 50000
samples = path_rw[burnin:, :]
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(samples[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

The posterior means and standard deviations are given below.
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("----------|----------|-----------|-------------|----------")
for j in range(p_dim):
print(f"{col_names[j]:8}| {samples[:, j].mean():.6f} | {samples[:, j].std():.6f} | {poiregmodel.params.values[j]:.6f} | {poiregmodel.bse.values[j]:.6f}")
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
----------|----------|-----------|-------------|----------
b0 | 6.938211 | 0.012637 | 6.936480 | 0.012336
b1 | -0.807783 | 0.004127 | -0.807524 | 0.004179
b2 | -0.042709 | 0.000207 | -0.042680 | 0.000212
b3 | 0.052826 | 0.000657 | 0.052831 | 0.000633
b4 | -0.020707 | 0.000395 | -0.020714 | 0.000380
b5 | 0.120289 | 0.000534 | 0.120372 | 0.000549
b6 | -0.001826 | 0.000016 | -0.001829 | 0.000016
The output again is very similar to the posterior normal approximation.
Finally, another choice of the proposal distribution is the so-called Independence Sampler where we take where is the frequentist MLE, and is the covariance matrix of the Laplace posterior normal approximation, and is a constant slightly larger than 1. This proposal density does not depend on (the current state) which is why the resulting Metropolis-Hastings scheme is called the Independence sampler. Note that this proposal is not symmetric in and so the full acceptance probability needs to be used.
nit = 100000
path_ind = np.zeros((nit, p_dim))
state = beta_hat.copy()
#state = np.zeros(p_dim) #with this initialization, the chain does not move at all!
path_ind[0, :] = state
rej_ind = 0
rho = 1.2 # scaling factor for the proposal covariance
for it in range(1, nit):
# Propose from N(beta_hat, rho * cov_mat) -- centred at the MLE
cand = multivariate_normal.rvs(mean=beta_hat, cov=rho * cov_mat)
# Log acceptance ratio = log[ p(cand)/p(state) * q(state)/q(cand) ]
log_ratio1 = logpost(cand) - logpost(state)
log_ratio2 = (multivariate_normal.logpdf(state, mean=beta_hat, cov=rho * cov_mat)
- multivariate_normal.logpdf(cand, mean=beta_hat, cov=rho * cov_mat))
log_ratio = log_ratio1 + log_ratio2
rnum = np.random.uniform()
if rnum < np.exp(min(0.0, log_ratio)):
state = cand
else:
rej_ind += 1
path_ind[it, :] = state
if (it + 1) % 20000 == 0:
print(f"Iteration {it + 1}")
print(f"\nIndependence sampler rejection rate: {rej_ind / nit:.4f}")
Iteration 20000
Iteration 40000
Iteration 60000
Iteration 80000
Iteration 100000
Independence sampler rejection rate: 0.1852
col_names = [f'b{j}' for j in range(p_dim)]
fig, axes = plt.subplots(2, 4, figsize=(16, 5))
for j in range(p_dim):
axes.flat[j].plot(path_ind[:, j], linewidth=0.3)
axes.flat[j].set_title(col_names[j])
axes.flat[7].set_visible(False)
plt.tight_layout()
plt.show()

This sampler is mixing very well. If the initial state is say zero, then this chain does not seem to move, so it is crucial to initialize this at the MLE.
print("Parameter | Estimate | Std. Dev. | Frequentist | Std. Error")
print("----------|----------|-----------|-------------|----------")
for j in range(p_dim):
print(f"{col_names[j]:8}| {path_ind[:, j].mean():.6f} | {path_ind[:, j].std():.6f} | {poiregmodel.params.values[j]:.6f} | {poiregmodel.bse.values[j]:.6f}")
Parameter | Estimate | Std. Dev. | Frequentist | Std. Error
----------|----------|-----------|-------------|----------
b0 | 6.936539 | 0.012378 | 6.936480 | 0.012336
b1 | -0.807521 | 0.004163 | -0.807524 | 0.004179
b2 | -0.042682 | 0.000213 | -0.042680 | 0.000212
b3 | 0.052829 | 0.000633 | 0.052831 | 0.000633
b4 | -0.020715 | 0.000381 | -0.020714 | 0.000380
b5 | 0.120375 | 0.000551 | 0.120372 | 0.000549
b6 | -0.001829 | 0.000016 | -0.001829 | 0.000016
The posterior means and standard deviations nearly coincide with the posterior normal approximation.
The overall message is that in problems such as these, the posterior normal approximation is crucial. Using it, it is possible to design simple Metropolis-Hastings schemes that work. PyMC seems to give strange results on some occasions.