Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Nineteen

Spring 2026, UC Berkeley

A High-Dimensional Linear Regression Model

We have a response variable yy and a single covariate xx. In our example, yy denotes weekly earnings and xx denotes years of experience. The covariate xx takes the values 0,1,,m0, 1, \dots, m for some fixed integer mm.

Our data is (xi,yi),i=1,,n(x_i, y_i), i = 1, \dots, n. The model is:

yi=β0+β1xi+β2ReLU(xi1)++βmReLU(xi(m1))+ϵi\begin{align} y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots + \beta_{m}\relu(x_i - (m-1)) + \epsilon_i \end{align}

where ReLU(u):=max(u,0)\relu(u) := \max(u, 0), and ϵii.i.dN(0,σ2)\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). Note we did not include ReLU(xim)\relu(x_i - m) because it always equals 0.

This linear regression equation can be written in the usual notation as:

y=Xβ+ϵ\begin{align} y = X \beta + \epsilon \end{align}

where y=(y1,,yn)Ty = (y_1, \dots, y_n)^T and XX is the n×(m+1)n \times (m+1) matrix with columns 1, xx, ReLU(xj)\relu(x - j) for j=1,,m1j = 1, \dots, m-1.

Recap: Linear Regression with default priors

As we saw before, the default prior on β0,,βm\beta_0, \dots, \beta_m in the linear regression model (2) is:

β0,,βmi.i.duniform(C,C)\begin{align} \beta_0, \dots, \beta_{m} \overset{\text{i.i.d}}{\sim} \text{uniform}(-C, C) \end{align}

with CC \rightarrow \infty. With this prior, the posterior of β\beta given σ\sigma becomes (see e.g., Problem 7(a) in Homework Three)

βdata,σN((XTX)1XTy,σ2(XTX)1)\begin{align} \beta \mid \text{data}, \sigma \sim N\left((X^T X)^{-1} X^T y, \sigma^2 (X^T X)^{-1} \right) \end{align}

which means that the posterior mean of β\beta is just the least squares estimator (XTX)1XTy(X^T X)^{-1} X^T y (note that this posterior mean of β\beta does not depend on σ\sigma, so is both the conditional posterior mean given σ\sigma, as well as the unconditional posterior mean).

The fact that the posterior mean of β\beta coincides with the least squares estimator is shared by some priors other than (3) as well. For example, this is also true for the following prior:

β0,,βmi.i.dN(0,C)\begin{align} \beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C) \end{align}

as CC \rightarrow \infty. This can be seen by the following general result (which follows, for example, from Problem 9 in Homework Two):

βσN(0,Q)    βdata,σN((XTXσ2+Q1)1XTyσ2,(XTXσ2+Q1)1).\begin{align} \beta \mid \sigma \sim N(0, Q) \implies \beta \mid \text{data}, \sigma \sim N \left(\left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}, \left(\frac{X^T X}{\sigma^2} + Q^{-1}\right)^{-1} \right). \end{align}

From here, it is clear that if QQ is chosen so that Q10Q^{-1} \approx 0, then we get the same posterior as (4). The prior (5) corresponds to Q=CIQ = C I so that Q1=(1/C)IQ^{-1} = (1/C)I which goes to zero when CC \rightarrow \infty. So the prior (5) leads to the same posterior (4).

Regularization

In the earnings-experience regression example, we have seen in the last class that the least squares estimator for (2) is not interpretable. We want the fitted curve

xβ^0+β^1x+j=2mβ^jReLU(x(j1))\begin{align*} x \mapsto \hat{\beta}_0 + \hat{\beta}_1 x + \sum_{j=2}^m \hat{\beta}_j \relu(x - (j-1)) \end{align*}

to be smooth for it to be interpretable. This can be achieved in the Bayesian setting if we use the prior:

β0,β1i.i.dN(0,C)   and   β2,,βmi.i.dN(0,τ2).\begin{align} \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) ~~ \text{ and } ~~ \beta_2, \dots, \beta_{m} \overset{\text{i.i.d}}{\sim} N(0, \tau^2). \end{align}

for some small τ\tau. The above prior is equivalent to βN(0,Q)\beta \sim N(0, Q) where QQ is the (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix with diagonal entries C,C,τ2,,τ2C, C, \tau^2, \dots, \tau^2. So using the formula (6), the posterior mean corresponding to this prior is:

E(βdata,σ,τ)=(XTXσ2+Q1)1XTyσ2.\begin{align*} \E \left(\beta \mid \text{data}, \sigma, \tau \right) = \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}. \end{align*}

Because QQ is diagonal with diagonal entries C,C,1/τ2,,1/τ2C, C, 1/\tau^2, \dots, 1/\tau^2, it is easy to see that Q1Q^{-1} is also diagonal with diagonal entries 1/C,1/C,τ2,,τ21/C, 1/C, \tau^{-2}, \dots, \tau^{-2}. Because CC is large, we can approximate Q1Q^{-1} by the matrix with diagonal entries 0,0,τ2,,τ20, 0, \tau^{-2}, \dots, \tau^{-2}. In other words:

Q11τ2J    where J=(0000000000001000001000001)\begin{align*} Q^{-1} \approx \frac{1}{\tau^2} J ~~~ \text{ where } J = \begin{pmatrix} 0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 0 \\ 0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 0 \\ 0 & 0 & 1 & 0 &\cdot & \cdot & \cdot & 0 \\ 0 & 0 & 0 & 1 & \cdot & \cdot & \cdot & 0 \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ 0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 1 \\ \end{pmatrix} \end{align*}

Note that JJ is an (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix. Thus the posterior mean becomes:

E(βdata,σ,τ)=(XTXσ2+Jτ2)1XTyσ2=(XTX+Jτ2/σ2)1XTy.\begin{align*} \E \left(\beta \mid \text{data}, \sigma, \tau \right) = \left(\frac{X^T X}{\sigma^2} + \frac{J}{\tau^2} \right)^{-1} \frac{X^T y}{\sigma^2} = \left(X^T X + \frac{J}{\tau^2/\sigma^2} \right)^{-1} X^T y. \end{align*}

We use the notation τ2=σ2γ2\tau^2 = \sigma^2 \gamma^2 or equivalently γ2=τ2/σ2\gamma^2 = \tau^2/\sigma^2. This gives

E(βdata,σ,τ)=(XTX+Jγ2)1XTy.\begin{align} \E \left(\beta \mid \text{data}, \sigma, \tau \right) = \left(X^T X + \frac{J}{\gamma^2} \right)^{-1} X^T y. \end{align}

It can be verified that this quantity (XTX+J/γ2)1XTy(X^TX + J/\gamma^2)^{-1} X^T y coincides with a ridge regularized least squares estimator. This is shown in the next section.

Connection to Ridge Regularization

For the model (1), the ridge regression estimator β^ridge(λ)\ridge for β\beta is given by the minimizer of:

i=1n(yiβ0β1xiβ2ReLU(xi1)βmReLU(xi(m1)))2+λ(β22+β32++βm2).\begin{split} & \sum_{i=1}^n \left(y_i - \beta_0 - \beta_1 x_i - \beta_2 \relu(x_i-1) - \dots - \beta_{m} \relu(x_i-(m-1)) \right)^2 \\ &+ \lambda \left(\beta_2^2 + \beta_3^2 + \dots + \beta_{m}^2 \right). \end{split}

The objective function above can also be written as

yXβ2+λj=2mβj2.\|y - X \beta\|^2 + \lambda \sum_{j=2}^{m} \beta_j^2.

It turns out that β^ridge(λ)\ridge can be written in closed form using matrix notation. To see this, note first that the gradient of the above objective function with respect to β\beta is given by

(yXβ2+λj=2mβj2)=2XTy+2XTXβ+2λ(00β2βm)\nabla \left( \|y - X \beta\|^2 + \lambda \sum_{j=2}^{m} \beta_j^2 \right) = -2 X^T y + 2 X^T X \beta + 2 \lambda \begin{pmatrix} 0 \\ 0 \\ \beta_2 \\ \cdot \\ \cdot \\ \cdot \\ \beta_{m} \end{pmatrix}

Recalling that JJ is the (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix with diagonal entries 0,0,1,,10, 0, 1, \dots, 1, we can write

(yXβ2+λt=2n1βj2)=2XTy+2XTXβ+2λ(00β2βn1)=2XTy+2XTXβ+2λJβ.\nabla \left( \|y - X \beta\|^2 + \lambda \sum_{t=2}^{n-1} \beta_j^2 \right) = -2 X^T y + 2 X^T X \beta + 2 \lambda \begin{pmatrix} 0 \\ 0 \\ \beta_2 \\ \cdot \\ \cdot \\ \cdot \\ \beta_{n-1} \end{pmatrix} = -2 X^T y + 2 X^T X \beta + 2 \lambda J \beta.

Setting this gradient equal to zero, we get

2XTy+2XTXβ+2λJβ=0    (XTX+λJ)β=XTy.-2 X^T y + 2 X^T X \beta + 2 \lambda J \beta = 0 \implies \left(X^T X + \lambda J \right) \beta = X^T y.

which gives

β^ridge(λ)=(XTX+λJ)1XTy.\ridge = (X^T X + \lambda J)^{-1} X^T y.

Note that λ=0\lambda = 0 corresponds to least squares, and λ\lambda \rightarrow \infty leads to linear regression. When λ\lambda is set to be neither very close to 0 nor very large, we should get a smooth estimate that still gives a good fit to the data.

Comparing (14) to (12), it is clear that they are identical when λ=1/γ2\lambda = 1/\gamma^2. In other words, when we use the prior (8), then the posterior mean (given τ\tau and σ\sigma) coincides with ridge regularized least squares with regularization parameter λ=1/γ2=σ2/τ2\lambda = 1/\gamma^2 = \sigma^2/\tau^2.

Bayesian inference for γ\gamma and σ\sigma

The big advantage of Bayesian inference is that it also allows inference on γ\gamma and σ\sigma. We shall use the prior:

logγ,logσi.i.duniform(,).\begin{align*} \log \gamma, \log \sigma \overset{\text{i.i.d}}{\sim} \text{uniform}(-\infty, \infty). \end{align*}

Recall again that τ=γσ\tau = \gamma \sigma. The joint prior on all the parameters (β,γ,σ)(\beta, \gamma, \sigma) is:

fβ,γ,σ(β,γ,σ)1γσ1detQexp(12βTQ1β).\begin{align*} f_{\beta, \gamma, \sigma}(\beta, \gamma, \sigma) \propto \frac{1}{\gamma \sigma} \frac{1}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \beta^T Q^{-1} \beta \right). \end{align*}

Here QQ is the (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix with diagonal entries C,C,γ2σ2,,γ2σ2C, C, \gamma^2 \sigma^2, \dots, \gamma^2 \sigma^2.

With this prior, we shall argue in the next lecture that the posterior is given by:

βdata,σ,γN((XTX+γ2J)1XTy,σ2(XTX+γ2J)1)\begin{align} \beta \mid \text{data}, \sigma, \gamma \sim N \left(\left(X^T X + \gamma^{-2} J \right)^{-1} X^T y, \sigma^2 \left(X^T X + \gamma^{-2} J \right)^{-1} \right) \end{align}

and

1σ2data,γGamma(n21,yTyyTX(XTX+γ2J)1XTy2.)\begin{align} \frac{1}{\sigma^2} \mid \text{data}, \gamma \sim \text{Gamma}\left(\frac{n}{2} - 1, \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2}. \right) \end{align}

and

fγdata(γ)γmXTX+γ2J1/2(yTyyTX(XTX+γ2J)1XTy)(n/2)1.\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto \frac{\gamma^{-m} \left|X^T X + \gamma^{-2} J \right|^{-1/2}}{\left(y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y \right)^{(n/2) - 1}}. \end{align*}

With this, inference can be carried out by first taking a grid of γ\gamma values and computing the above posterior (on the logarithmic scale) at the grid points. This posterior can be used to obtain posterior samples of γ\gamma. For each sample of γ\gamma, we can then sample σ\sigma using the distribution (18). Given samples from both γ\gamma and σ\sigma, we can then sample β\beta using (17).