Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twenty

Spring 2026, UC Berkeley

Bayesian Inference in a High-Dimensional Linear Regression Model

We studied the following model in the last lecture.

We have a response variable yy and a single covariate xx. In our example, yy denotes weekly earnings and xx denotes years of experience. The covariate xx takes the values 0,1,,m0, 1, \dots, m for some fixed integer mm.

Our data is (xi,yi),i=1,,n(x_i, y_i), i = 1, \dots, n. The model is:

yi=β0+β1xi+β2ReLU(xi1)++βmReLU(xi(m1))+ϵi\begin{align} y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots + \beta_{m}\relu(x_i - (m-1)) + \epsilon_i \end{align}

where ReLU(u):=max(u,0)\relu(u) := \max(u, 0), and ϵii.i.dN(0,σ2)\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). Note we did not include ReLU(xim)\relu(x_i - m) because it always equals 0.

The model (1) can be rewritten in the usual regression way as:

y=Xβ+ϵ  with ϵN(0,σ2Id).\begin{align*} y = X \beta + \epsilon ~~\text{with $\epsilon \sim N(0, \sigma^2 I_d)$}. \end{align*}

Here XX is the n×(m+1)n \times (m+1) matrix with columns 1,x,ReLU(xj)1, x, \relu(x-j) for j=1,,m1j = 1, \dots, m-1, where xx denotes observed values of the experience variable.

The usual least squares analysis coincides with Bayesian inference using the prior:

β0,,βmi.i.dN(0,C) and logσuniform(C,C).\begin{align} \beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C) \text{ and } \log \sigma \sim \text{uniform}(-C, C). \end{align}

for CC \rightarrow \infty. In Problem 7 of Homework 3, it is proved that the posterior for the above prior is:

βdata,σNm+1(β^,σ2(XTX)1) & fσdata(σ)σn+mexp(yTyyTX(XTX)1XTy2σ2)I{σ>0}\begin{align*} \beta \mid \text{data}, \sigma \sim N_{m+1}(\hat{\beta}, \sigma^2 (X^T X)^{-1}) ~ \& ~ f_{\sigma \mid \text{data}}(\sigma) \propto \sigma^{-n+m} \exp \left(-\frac{y^T y - y^T X (X^T X)^{-1} X^T y}{2 \sigma^2} \right) I\{\sigma > 0\} \end{align*}

where β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y is the least squares estimator. If we marginalize σ\sigma and write the posterior distribution of β\beta, we will get a tt-distribution. The posterior distribution for σ\sigma above can be written in terms of the Gamma (or chi-squared) distribution (this is problem 7(d) in Homework 3) as:

1σ2dataGamma(nm12,yTyyTX(XTX)1XTy2).\begin{align*} \frac{1}{\sigma^2} \Big| \text{data} \sim \text{Gamma} \left(\frac{n-m-1}{2}, \frac{y^T y - y^T X (X^T X)^{-1} X^T y}{2} \right). \end{align*}

Because least squares does not give sensible results in this example, we change the prior in (3) to:

β0N(0,C),β1N(0,C),β2,,βmi.i.dN(0,τ2)   and logσuniform(C,C).\begin{align*} \beta_0 \sim N(0, C), \beta_1 \sim N(0, C), \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, \tau^2) ~~ \text{ and } \log \sigma \sim \text{uniform}(-C, C). \end{align*}

Therefore, we are bringing in a new parameter τ\tau which controls the scale of β2,,βm\beta_2, \dots, \beta_m. For now, treat τ\tau as fixed. So the prior on β\beta and σ\sigma is now:

fβ,σ(β,σ)1σ1detQexp(12βTQ1β)\begin{align*} f_{\beta, \sigma}(\beta, \sigma) \propto \frac{1}{\sigma} \frac{1}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \beta^T Q^{-1} \beta \right) \end{align*}

where QQ is the (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix with diagonal entries C,C,τ2,,τ2C, C, \tau^2, \dots, \tau^2.

The likelihood is unchanged:

(12π)nσnexp(12σ2yXβ2).\left(\frac{1}{\sqrt{2 \pi}} \right)^n \sigma^{-n} \exp \left(-\frac{1}{2 \sigma^2} \|y - X \beta\|^2 \right).

So the posterior for β,σ\beta, \sigma is:

fβ,σdata(β,σ)σn1detQexp(12(1σ2yXβ2+βTQ1β)).\begin{align*} f_{\beta,\sigma \mid \text{data}}(\beta, \sigma) & \propto \frac{\sigma^{-n-1}}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \left(\frac{1}{\sigma^2} \|y - X \beta\|^2 + \beta^T Q^{-1} \beta \right) \right). \end{align*}

Using the completing the square formula:

1σ2yXβ2+βTQ1β=(ββ^)T(XTXσ2+Q1)(ββ^)+yTyσ2(yTXσ2)(XTXσ2+Q1)1(XTyσ2),\begin{align*} \frac{1}{\sigma^2} \|y - X \beta\|^2 + \beta^T Q^{-1} \beta &= \left( \beta - \hat{\beta}\right)^T \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right) \left(\beta - \hat{\beta} \right) \\ &+ \frac{y^T y}{\sigma^2} - \left(\frac{y^T X}{\sigma^2} \right) \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \left(\frac{X^T y}{\sigma^2} \right), \end{align*}

where

β^=(XTXσ2+Q1)1XTyσ2\begin{align*} \hat{\beta} = \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2} \end{align*}

one can then show that

βdata,σ,τN((XTXσ2+Q1)1XTyσ2,(XTXσ2+Q1)1)\begin{align} \beta \mid \text{data}, \sigma, \tau \sim N\left( \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}, \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \right) \end{align}

Note that when QQ \rightarrow \infty (i.e., when CC \rightarrow \infty and τ\tau \rightarrow \infty), this reverts toN((XTX)1XTy,σ2(XTX)1)N((X^T X)^{-1} X^T y, \sigma^2 (X^T X)^{-1}). To get the posterior of σ\sigma given τ\tau, we simply integrate β\beta from the joint posterior to obtain:

Integrating β\beta from the joint posterior gives the posterior of γ,σ\gamma, \sigma:

fσdata,τ(σ)σn1detQdet(XTXσ2+Q1)1exp(yTy2σ2)exp(yTX2σ2(XTXσ2+Q1)1XTyσ2).\begin{align*} & f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto \frac{\sigma^{-n-1}}{\sqrt{\det Q}} \sqrt{\det \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1}} \exp \left(-\frac{y^T y}{2 \sigma^2} \right)\exp \left(\frac{y^T X}{2\sigma^2} \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2} \right). \end{align*}

Simplifying by using detQ(τ2)m1\det Q \propto (\tau^2)^{m-1} and Q1J/τ2Q^{-1} \approx J/\tau^2 where JJ is the (m+1)×(m+1)(m+1) \times (m+1) diagonal matrix with diagonals 0,0,1,,10, 0, 1, \dots, 1, we get

fσdata,τ(σ)σn+mτm1XTX+σ2τ2J1/2exp(yTyyTX(XTX+σ2τ2J)1XTy2σ2)\begin{align*} & f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto \frac{\sigma^{-n+m}}{\tau^{m-1}} \left|X^T X + \frac{\sigma^2}{\tau^2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \sigma^2\tau^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) \end{align*}

This distribution is not easy to handle because of the presence of the term σ2/τ2\sigma^2/\tau^2 inside the determinant as well as inside the inverse (in the exponent). One trick to simplify this is to reparametrize and assume τ=σγ\tau = \sigma \gamma. With this, the above conditional density becomes

fσdata,γ(σ)1γm1σn1XTX+γ2J1/2exp(yTyyTX(XTX+γ2J)1XTy2σ2)\begin{align*} & f_{\sigma \mid \text{data}, \gamma}(\sigma) \\ &\propto \frac{1}{\gamma^{m-1}\sigma^{n-1}} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) \end{align*}

Ignoring terms above which do not depend on γ\gamma, we get

fσdata,γ(σ)σn+1exp(yTyyTX(XTX+γ2J)1XTy2σ2).f_{\sigma \mid \text{data}, \gamma}(\sigma) \propto \sigma^{-n+1} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right).

The right hand side above is actually the pdf of an inverse gamma density (see Inverse-gamma distribution). This can be seen by converting it into the density of 1/σ21/\sigma^2:

f1/σ2data,γ(x)fσdata,γ(x1/2)x3/2(x1/2)n+1exp(xyTyyTX(XTX+γ2J)1XTy2)x3/2=x(n4)/2exp(xyTyyTX(XTX+γ2J)1XTy2).\begin{align*} f_{1/\sigma^2 \mid \text{data}, \gamma}(x) &\propto f_{\sigma \mid \text{data}, \gamma}(x^{-1/2}) x^{-3/2} \\ &\propto \left(x^{-1/2} \right)^{-n+1} \exp \left(-x \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2} \right) x^{-3/2} \\ &= x^{(n-4)/2} \exp \left(-x \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2} \right). \end{align*}

Thus

1σ2data,γGamma(n21,yTyyTX(XTX+γ2J)1XTy2.)\begin{align} \frac{1}{\sigma^2} \mid \text{data}, \gamma \sim \text{Gamma}\left(\frac{n}{2} - 1, \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2}. \right) \end{align}

Thus when γ\gamma is fixed, we can perform inference on β,σ\beta, \sigma using the above closed form formulae. Since γ\gamma is also unknown, we can place a prior p(γ)p(\gamma) on it, and then calculate its posterior.

Finally, we can marginalize σ\sigma to obtain the posterior of γ\gamma alone as follows:

fγdata(γ)0p(γ)1γm1σn1XTX+γ2J1/2exp(yTyyTX(XTX+γ2J)1XTy2σ2)dσ=p(γ)γm+1XTX+γ2J1/20σn+1exp(yTyyTX(XTX+γ2J)1XTy2σ2)dσ.\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto \int_0^{\infty} p(\gamma) \frac{1}{\gamma^{m-1} \sigma^{n-1}} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) d\sigma \\ &= p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \int_0^{\infty} \sigma^{-n+1}\exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) d\sigma. \end{align*}

Letting

A:=yTyyTX(XTX+γ2J)1XTy,\begin{align*} A := y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y, \end{align*}

we get

fγdata(γ)p(γ)γm+1XTX+γ2J1/20σn+1exp(A2σ2)dσ\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \int_0^{\infty} \sigma^{-n+1} \exp \left(-\frac{A}{2 \sigma^2} \right) d\sigma \end{align*}

By the change of variable σ=sA\sigma = s \sqrt{A}, we obtain

fγdata(γ)p(γ)γm+1XTX+γ2J1/2A(n/2)+10sn+1exp(12s2)dsp(γ)γm+1XTX+γ2J1/2A(n/2)+1=p(γ)γm+1XTX+γ2J1/2(yTyyTX(XTX+γ2J)1XTy)(n/2)1.\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} A^{-(n/2) + 1} \int_0^{\infty} s^{-n+1} \exp \left(-\frac{1}{2s^2} \right) ds \\ &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} A^{-(n/2) + 1} \\ &= \frac{p(\gamma)\gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2}}{\left(y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y \right)^{(n/2) - 1}}. \end{align*}

We usually choose p(γ)p(\gamma) as:

p(γ)1γI{low<γ<high}\begin{align*} p(\gamma) \propto \frac{1}{\gamma} I\{\text{low} < \gamma < \text{high}\} \end{align*}

for two fixed values low\text{low} and high\text{high}. These could be the values of γ\gamma which lead to the posterior mean (which coincides with ridge regression) leading to underfitting and overfitting respectively.

Inference can be carried out by first taking a grid of γ\gamma values and computing the above posterior (on the logarithmic scale) at the grid points. This posterior can be used to obtain posterior samples of γ\gamma. For each sample of γ\gamma, we can then sample σ\sigma using the distribution (18). Given samples from both γ\gamma and σ\sigma, we can then sample β\beta using (17).

Induced Prior for the Regression Function

Our regression function is being modeled as:

f(x)=β0+β1x+β2(x1)+++βm(x(m1))+\begin{align} f(x) = \beta_0 + \beta_1 x + \beta_2 (x - 1)_+ + \dots + \beta_m(x - (m-1))_+ \end{align}

where β0,β1,,βm\beta_0, \beta_1, \dots, \beta_m are all independent with

β0,β1i.i.dN(0,C)   and   β2,,βmi.i.dN(0,τ2).\begin{align*} \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) ~~ \text{ and } ~~ \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, \tau^2). \end{align*}

This can also be seen as a prior on the regression function ff more directly. For every 0u1<<uk<0 \leq u_1 < \dots < u_k < \infty, the joint distribution of (f(u1),,f(uk))(f(u_1), \dots, f(u_k)) induced by (22) will be multivariate normal. This implies that (f(u),u0)(f(u), u \geq 0) is a Gaussian Process. The description of the GP can be done in terms of the mean function Ef(u)\E f(u) and the covariance kernel Cov(f(u),f(v))\text{Cov}(f(u), f(v)).

The mean function is clearly zero (because each Eβj=0\E \beta_j = 0). The covariance kernel is given by:

cov(f(u),f(v))=cov(β0+β1u+β2(u1)+++βm(u(m1))+,β0+β1v+β2(v1)+++βm(v(m1))+)=C(1+uv)+τ2j=1m1(uj)+(vj)+=C(1+uv)+τ2j=1k(uj)(vj)=C(1+uv)+τ2(uvk+16k(k+1)(2k+1)(u+v)k(k+1)2)\begin{align*} &\text{cov}(f(u), f(v)) \\ &= \text{cov} \left(\beta_0 + \beta_1u + \beta_2 (u - 1)_+ + \dots + \beta_m(u - (m-1))_+, \beta_0 + \beta_1 v + \beta_2 (v - 1)_+ + \dots + \beta_m(v - (m-1))_+ \right) \\ &= C(1 + uv) + \tau^2 \sum_{j=1}^{m-1} (u - j)_+ (v - j)_+ \\ &= C(1 + uv) + \tau^2 \sum_{j=1}^{k} (u - j)(v - j) \\ &= C(1 + uv) + \tau^2 \left(u v k + \frac{1}{6} k(k+1)(2k+1) - (u+v)\frac{k(k+1)}{2} \right) \end{align*}

where k=uvk = \lfloor u \wedge v \rfloor (and uv:=min(u,v)u \wedge v := \min(u, v)).

Suppose now we use the approximation

kuv   and   k(k+1)(2k+1)2k32(uv)3   and   k(k+1)k2(uv)2.\begin{align*} k \approx u \wedge v ~~ \text{ and } ~~ k(k+1)(2k+1) \approx 2k^3 \approx 2(u \wedge v)^3 ~~ \text{ and } ~~ k(k+1) \approx k^2 \approx (u \wedge v)^2. \end{align*}

Then the covariance kernel becomes:

cov(f(u),f(v))C(1+uv)+τ2(uv(uv)+(uv)33(uv)22(u+v)).\begin{align*} \text{cov}(f(u), f(v)) \approx C(1 + uv) + \tau^2 \left(u v (u \wedge v) + \frac{(u \wedge v)^3}{3} - \frac{(u \wedge v)^2}{2} (u + v) \right). \end{align*}

It turns out that the right hand side above is precisely the covariance kernel of:

Gx=β0+β1x+τIx\begin{align*} G_x = \beta_0 + \beta_1 x + \tau I_x \end{align*}

where β0,β1i.i.dN(0,C)\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) and IxI_x is Integrated Brownian Motion on [0,)[0, \infty). We will revisit this in the next lecture.