In this lab, we shall go over a simple application of mixture modeling to regression problems.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import arviz as az
Consider the following simulated regression dataset .
rng = np.random.default_rng(seed = 123)
n = 400
sigma = 0.5
beta_vals = np.array([[3.0, -1.0], [1.0, 1.5], [-1.0, 0.5]])
print(beta_vals)
probs = np.array([0.3, 0.3, 0.4])
#sample x_i \sim uniform(-1, 3)
x = rng.uniform(-1, 3, size = n)
#sample component labels
z = rng.choice(3, size = n, p = probs)
#assign coefficients:
beta0 = beta_vals[z, 0]
beta1 = beta_vals[z, 1]
#sample noise:
eps = rng.normal(0, sigma, size = n)
#generate y_i
y = beta0 + beta1 * x + eps
#store as data:
data = pd.DataFrame({'x': x, 'y': y})
display(data.head())[[ 3. -1. ]
[ 1. 1.5]
[-1. 0.5]]
#Plot the data:
plt.figure(figsize = (8, 6))
plt.scatter(x, y, alpha = 0.7, s = 3, color = 'black')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot of the data')
plt.grid(True)
plt.show()
This regression dataset was generated from a mixture of three linear regression models. For each observation , the response is produced by taking a linear function of the covariate and then adding independent random noise. The important feature is that the linear relationship is not the same across all observations. Instead, each data point comes from one of three possible linear models.
The three underlying linear functions correspond to the coefficient pairs and . Thus, for each observation , one of these three pairs is selected at random, with probabilities (0.3), (0.3), and (0.4), respectively. Conditional on this choice, the response is generated according to the corresponding linear model, with an additional independent noise term.
A natural interpretation is that the dataset is drawn from three hidden subpopulations. Each subpopulation has its own regression line, but the identity of the subpopulation is not observed directly. As a result, the full dataset is heterogeneous: rather than following a single linear trend, it consists of a blend of three different trends. This is precisely the setting of a mixture of regressions model, where the goal is often to recover both the underlying regression relationships and the latent population structure.
We will use the following model for this data:
with
where are the three unknown coefficient vectors and denote the probability weights. Another unknown parameter is the variance of , but we shall assume that this is known for simplicity (in the above simulation example, is fixed to be 0.5).
Gibbs Sampler¶
To implement the Gibbs sampler in this model, we introduce latent allocation variables with each . With these variables, the model can be written as:
We will use the Gibbs sampler to simulate from the joint distribution of given the data . The conditional distribution of given and the data is:
In other words, (conditional on ) is discrete taking the values with probabilities with
The conditional distribution of given is:
where for .
For a fixed , the conditional distribution of given the data and is calculated as follows. First take
Let be the matrix whose rows are given by for . Also let be the vector with entries . Then
Algorithm for the Gibbs Sampler:¶
So here is the algorithm for the Gibbs sampler for mixture of linear regression.
We start with initial values
and repeat for as follows.
For each , sample
where
Compute the counts
for , and sample
For each using the observations with , sample
where
and
Here is the matrix with rows for such that , and is the vector with entries for such that .
Below is the code which implements the Gibbs sampler.
def gibbs_mix3_regression(x, y, sigma=0.5, C=100.0, n_iter=5000, burnin=1000, seed=None):
rng = np.random.default_rng(seed)
x = np.asarray(x)
y = np.asarray(y)
n = len(y)
Xfull = np.column_stack([np.ones(n), x])
# initialize
z = rng.integers(0, 3, size=n)
w = np.array([1/3, 1/3, 1/3], dtype=float)
gamma = rng.normal(0, np.sqrt(C), size=(3, 2))
w_samples = np.zeros((n_iter - burnin, 3))
gamma_samples = np.zeros((n_iter - burnin, 3, 2))
z_samples = np.zeros((n_iter - burnin, n), dtype=int)
eye2 = np.eye(2)
for it in range(n_iter):
# update z
logp = np.empty((n, 3))
for j in range(3):
mean_j = gamma[j, 0] + gamma[j, 1] * x
logp[:, j] = np.log(w[j]) - 0.5 * (y - mean_j) ** 2 / sigma**2
logp -= logp.max(axis=1, keepdims=True)
p = np.exp(logp)
p /= p.sum(axis=1, keepdims=True)
u = rng.random(n)
z = (u[:, None] > np.cumsum(p, axis=1)).sum(axis=1)
# update w
counts = np.bincount(z, minlength=3)
w = rng.dirichlet(1 + counts)
# update gamma_j
for j in range(3):
idx = (z == j)
Xj = Xfull[idx]
yj = y[idx]
V = np.linalg.inv((Xj.T @ Xj) / sigma**2 + eye2 / C)
m = V @ (Xj.T @ yj) / sigma**2
gamma[j] = rng.multivariate_normal(m, V)
if it >= burnin:
k = it - burnin
w_samples[k] = w
gamma_samples[k] = gamma
z_samples[k] = z
return {
"w_samples": w_samples,
"gamma_samples": gamma_samples,
"z_samples": z_samples,
}In the following code, we use the Gibbs sampler on the simulated data shown above.
out = gibbs_mix3_regression(x, y, sigma=0.5, C=100.0, n_iter=6000, burnin=1000, seed=1)Below we compute the posterior means of the parameters and for .
w_post_mean = out["w_samples"].mean(axis=0)
gamma_post_mean = out["gamma_samples"].mean(axis=0)
print("posterior mean of w:", w_post_mean)
print("posterior mean of gammas:")
print(gamma_post_mean)posterior mean of w: [0.27366438 0.41034453 0.31599109]
posterior mean of gammas:
[[ 3.10019646 -1.00922764]
[-0.96064476 0.47280818]
[ 0.9931716 1.54916337]]
The Gibbs sampler produces samples of . In the plot below, we look at the data points and then the lines corresponding to each posterior sample of .
plt.figure(figsize=(8, 6))
plt.scatter(x, y, s=3, alpha=0.7, color='black')
xgrid = np.linspace(x.min(), x.max(), 200)
gamma_samples = out["gamma_samples"]
#Instead of plotting all the posterior samples, let us plot the most recent 500 samples:
k_plot = 500
idx = np.arange(len(gamma_samples) - k_plot, len(gamma_samples))
for t in idx:
for j in range(3):
plt.plot(
xgrid,
gamma_samples[t, j, 0] + gamma_samples[t, j, 1] * xgrid,
alpha=0.08, color = 'lightblue'
)
for j in range(3):
plt.plot(
xgrid,
gamma_post_mean[j, 0] + gamma_post_mean[j, 1] * xgrid,
linewidth=2.5,
label=fr"posterior mean line {j+1}"
)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot with posterior draws of regression lines')
plt.grid(True)
plt.show()
In the plot below, we also show the true regression lines.
plt.figure(figsize=(8, 6))
plt.scatter(x, y, s=3, alpha=0.7, color='black')
xgrid = np.linspace(x.min(), x.max(), 200)
gamma_samples = out["gamma_samples"]
k_plot = 500
idx = np.arange(len(gamma_samples) - k_plot, len(gamma_samples))
for t in idx:
for j in range(3):
plt.plot(
xgrid,
gamma_samples[t, j, 0] + gamma_samples[t, j, 1] * xgrid,
alpha=0.08, color='lightblue'
)
for j in range(3):
plt.plot(
xgrid,
gamma_post_mean[j, 0] + gamma_post_mean[j, 1] * xgrid,
linewidth=2.5)
true_gammas = np.array([
[3.0, -1.0],
[1.0, 1.5],
[-1.0, 0.5]
])
for j in range(3):
plt.plot(
xgrid,
true_gammas[j, 0] + true_gammas[j, 1] * xgrid,
linestyle='--',
linewidth=1.5,
)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter plot with posterior draws and true regression lines')
plt.grid(True)
plt.show()
Another advantage of the mixture of regression model is that we can also separate the data points into three points. Based on the posterior samples for , we can calculate the posterior mode for . Then the points can be separated into three groups based on whether the posterior mode equals . This is done below.
z_samples = out["z_samples"] # shape: (n_saved, n)
z_mode = np.zeros(z_samples.shape[1], dtype=int)
for i in range(z_samples.shape[1]):
z_mode[i] = np.bincount(z_samples[:, i], minlength=3).argmax()
colors = np.array(['tab:blue', 'tab:orange', 'tab:green'])
point_colors = colors[z_mode]
plt.figure(figsize=(8, 6))
plt.scatter(x, y, s=12, c=point_colors, alpha=0.8)
xgrid = np.linspace(x.min(), x.max(), 200)
true_gammas = np.array([
[3.0, -1.0],
[1.0, 1.5],
[-1.0, 0.5]
])
for j in range(3):
plt.plot(
xgrid,
true_gammas[j, 0] + true_gammas[j, 1] * xgrid,
linestyle='--',
linewidth=2,
color = 'black'
)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Data Points colored by posterior mode of $z_i$')
plt.grid(True)
plt.show()
The mixture of regression model assumes that there are three groups that the population is made of. Each group corresponds to a different regression line. Given data (without any group information), the model can be used to divide the data into three groups, as shown above