STAT 238 - Bayesian Statistics Lecture Twenty

Spring 2026, UC Berkeley

Bayesian Inference in a High-Dimensional Linear Regression Model¶

We studied the following model in the last lecture.

We have a response variable $y$ and a single covariate $x$ . In our example, $y$ denotes weekly earnings and $x$ denotes years of experience. The covariate $x$ takes the values $0, 1, \dots, m$ for some fixed integer $m$ .

Our data is $(x_i, y_i), i = 1, \dots, n$ . The model is:

\begin{align} y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots + \beta_{m}\relu(x_i - (m-1)) + \epsilon_i \end{align}

(1)

where $\relu(u) := \max(u, 0)$ , and $\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$ . Note we did not include $\relu(x_i - m)$ because it always equals 0.

The model (1) can be rewritten in the usual regression way as:

\begin{align*} y = X \beta + \epsilon ~~\text{with $\epsilon \sim N(0, \sigma^2 I_d)$}. \end{align*}

(2)

Here $X$ is the $n \times (m+1)$ matrix with columns $1, x, \relu(x-j)$ for $j = 1, \dots, m-1$ , where $x$ denotes observed values of the experience variable.

The usual least squares analysis coincides with Bayesian inference using the prior:

\begin{align} \beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C) \text{ and } \log \sigma \sim \text{uniform}(-C, C). \end{align}

(3)

for $C \rightarrow \infty$ . In Problem 7 of Homework 3, it is proved that the posterior for the above prior is:

\begin{align*} \beta \mid \text{data}, \sigma \sim N_{m+1}(\hat{\beta}, \sigma^2 (X^T X)^{-1}) ~ \& ~ f_{\sigma \mid \text{data}}(\sigma) \propto \sigma^{-n+m} \exp \left(-\frac{y^T y - y^T X (X^T X)^{-1} X^T y}{2 \sigma^2} \right) I\{\sigma > 0\} \end{align*}

(4)

where $\hat{\beta} = (X^T X)^{-1} X^T y$ is the least squares estimator. If we marginalize $\sigma$ and write the posterior distribution of $\beta$ , we will get a $t$ -distribution. The posterior distribution for $\sigma$ above can be written in terms of the Gamma (or chi-squared) distribution (this is problem 7(d) in Homework 3) as:

\begin{align*} \frac{1}{\sigma^2} \Big| \text{data} \sim \text{Gamma} \left(\frac{n-m-1}{2}, \frac{y^T y - y^T X (X^T X)^{-1} X^T y}{2} \right). \end{align*}

(5)

Because least squares does not give sensible results in this example, we change the prior in (3) to:

\begin{align*} \beta_0 \sim N(0, C), \beta_1 \sim N(0, C), \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, \tau^2) ~~ \text{ and } \log \sigma \sim \text{uniform}(-C, C). \end{align*}

(6)

Therefore, we are bringing in a new parameter $\tau$ which controls the scale of $\beta_2, \dots, \beta_m$ . For now, treat $\tau$ as fixed. So the prior on $\beta$ and $\sigma$ is now:

\begin{align*} f_{\beta, \sigma}(\beta, \sigma) \propto \frac{1}{\sigma} \frac{1}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \beta^T Q^{-1} \beta \right) \end{align*}

(7)

where $Q$ is the $(m+1) \times (m+1)$ diagonal matrix with diagonal entries $C, C, \tau^2, \dots, \tau^2$ .

The likelihood is unchanged:

\left(\frac{1}{\sqrt{2 \pi}} \right)^n \sigma^{-n} \exp \left(-\frac{1}{2 \sigma^2} \|y - X \beta\|^2 \right).

So the posterior for $\beta, \sigma$ is:

\begin{align*} f_{\beta,\sigma \mid \text{data}}(\beta, \sigma) & \propto \frac{\sigma^{-n-1}}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \left(\frac{1}{\sigma^2} \|y - X \beta\|^2 + \beta^T Q^{-1} \beta \right) \right). \end{align*}

(8)

Using the completing the square formula:

\begin{align*} \frac{1}{\sigma^2} \|y - X \beta\|^2 + \beta^T Q^{-1} \beta &= \left( \beta - \hat{\beta}\right)^T \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right) \left(\beta - \hat{\beta} \right) \\ &+ \frac{y^T y}{\sigma^2} - \left(\frac{y^T X}{\sigma^2} \right) \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \left(\frac{X^T y}{\sigma^2} \right), \end{align*}

(9)

where

\begin{align*} \hat{\beta} = \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2} \end{align*}

(10)

one can then show that

\begin{align} \beta \mid \text{data}, \sigma, \tau \sim N\left( \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}, \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \right) \end{align}

(11)

Note that when $Q \rightarrow \infty$ (i.e., when $C \rightarrow \infty$ and $\tau \rightarrow \infty$ ), this reverts to $N((X^T X)^{-1} X^T y, \sigma^2 (X^T X)^{-1})$ . To get the posterior of $\sigma$ given $\tau$ , we simply integrate $\beta$ from the joint posterior to obtain:

Integrating $\beta$ from the joint posterior gives the posterior of $\gamma, \sigma$ :

\begin{align*} & f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto \frac{\sigma^{-n-1}}{\sqrt{\det Q}} \sqrt{\det \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1}} \exp \left(-\frac{y^T y}{2 \sigma^2} \right)\exp \left(\frac{y^T X}{2\sigma^2} \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2} \right). \end{align*}

(12)

Simplifying by using $\det Q \propto (\tau^2)^{m-1}$ and $Q^{-1} \approx J/\tau^2$ where $J$ is the $(m+1) \times (m+1)$ diagonal matrix with diagonals $0, 0, 1, \dots, 1$ , we get

\begin{align*} & f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto \frac{\sigma^{-n+m}}{\tau^{m-1}} \left|X^T X + \frac{\sigma^2}{\tau^2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \sigma^2\tau^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) \end{align*}

(13)

This distribution is not easy to handle because of the presence of the term $\sigma^2/\tau^2$ inside the determinant as well as inside the inverse (in the exponent). One trick to simplify this is to reparametrize and assume $\tau = \sigma \gamma$ . With this, the above conditional density becomes

\begin{align*} & f_{\sigma \mid \text{data}, \gamma}(\sigma) \\ &\propto \frac{1}{\gamma^{m-1}\sigma^{n-1}} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) \end{align*}

(14)

Ignoring terms above which do not depend on $\gamma$ , we get

f_{\sigma \mid \text{data}, \gamma}(\sigma) \propto \sigma^{-n+1} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right).

The right hand side above is actually the pdf of an inverse gamma density (see Inverse-gamma distribution). This can be seen by converting it into the density of $1/\sigma^2$ :

\begin{align*} f_{1/\sigma^2 \mid \text{data}, \gamma}(x) &\propto f_{\sigma \mid \text{data}, \gamma}(x^{-1/2}) x^{-3/2} \\ &\propto \left(x^{-1/2} \right)^{-n+1} \exp \left(-x \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2} \right) x^{-3/2} \\ &= x^{(n-4)/2} \exp \left(-x \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2} \right). \end{align*}

(15)

Thus

\begin{align} \frac{1}{\sigma^2} \mid \text{data}, \gamma \sim \text{Gamma}\left(\frac{n}{2} - 1, \frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2}. \right) \end{align}

(16)

Thus when $\gamma$ is fixed, we can perform inference on $\beta, \sigma$ using the above closed form formulae. Since $\gamma$ is also unknown, we can place a prior $p(\gamma)$ on it, and then calculate its posterior.

Finally, we can marginalize $\sigma$ to obtain the posterior of $\gamma$ alone as follows:

\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto \int_0^{\infty} p(\gamma) \frac{1}{\gamma^{m-1} \sigma^{n-1}} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) d\sigma \\ &= p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \int_0^{\infty} \sigma^{-n+1}\exp \left(-\frac{y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2} \right) d\sigma. \end{align*}

(17)

Letting

\begin{align*} A := y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y, \end{align*}

(18)

we get

\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} \int_0^{\infty} \sigma^{-n+1} \exp \left(-\frac{A}{2 \sigma^2} \right) d\sigma \end{align*}

(19)

By the change of variable $\sigma = s \sqrt{A}$ , we obtain

\begin{align*} f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} A^{-(n/2) + 1} \int_0^{\infty} s^{-n+1} \exp \left(-\frac{1}{2s^2} \right) ds \\ &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2} A^{-(n/2) + 1} \\ &= \frac{p(\gamma)\gamma^{-m+1} \left|X^T X + \gamma^{-2} J \right|^{-1/2}}{\left(y^T y - y^T X \left(X^T X + \gamma^{-2} J \right)^{-1} X^T y \right)^{(n/2) - 1}}. \end{align*}

(20)

We usually choose $p(\gamma)$ as:

\begin{align*} p(\gamma) \propto \frac{1}{\gamma} I\{\text{low} < \gamma < \text{high}\} \end{align*}

(21)

for two fixed values $\text{low}$ and $\text{high}$ . These could be the values of $\gamma$ which lead to the posterior mean (which coincides with ridge regression) leading to underfitting and overfitting respectively.

Inference can be carried out by first taking a grid of $\gamma$ values and computing the above posterior (on the logarithmic scale) at the grid points. This posterior can be used to obtain posterior samples of $\gamma$ . For each sample of $\gamma$ , we can then sample $\sigma$ using the distribution (18). Given samples from both $\gamma$ and $\sigma$ , we can then sample $\beta$ using (17).

Induced Prior for the Regression Function¶

Our regression function is being modeled as:

\begin{align} f(x) = \beta_0 + \beta_1 x + \beta_2 (x - 1)_+ + \dots + \beta_m(x - (m-1))_+ \end{align}

(22)

where $\beta_0, \beta_1, \dots, \beta_m$ are all independent with

\begin{align*} \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) ~~ \text{ and } ~~ \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, \tau^2). \end{align*}

(23)

This can also be seen as a prior on the regression function $f$ more directly. For every $0 \leq u_1 < \dots < u_k < \infty$ , the joint distribution of $(f(u_1), \dots, f(u_k))$ induced by (22) will be multivariate normal. This implies that $(f(u), u \geq 0)$ is a Gaussian Process. The description of the GP can be done in terms of the mean function $\E f(u)$ and the covariance kernel $\text{Cov}(f(u), f(v))$ .

The mean function is clearly zero (because each $\E \beta_j = 0$ ). The covariance kernel is given by:

\begin{align*} &\text{cov}(f(u), f(v)) \\ &= \text{cov} \left(\beta_0 + \beta_1u + \beta_2 (u - 1)_+ + \dots + \beta_m(u - (m-1))_+, \beta_0 + \beta_1 v + \beta_2 (v - 1)_+ + \dots + \beta_m(v - (m-1))_+ \right) \\ &= C(1 + uv) + \tau^2 \sum_{j=1}^{m-1} (u - j)_+ (v - j)_+ \\ &= C(1 + uv) + \tau^2 \sum_{j=1}^{k} (u - j)(v - j) \\ &= C(1 + uv) + \tau^2 \left(u v k + \frac{1}{6} k(k+1)(2k+1) - (u+v)\frac{k(k+1)}{2} \right) \end{align*}

(24)

where $k = \lfloor u \wedge v \rfloor$ (and $u \wedge v := \min(u, v)$ ).

Suppose now we use the approximation

\begin{align*} k \approx u \wedge v ~~ \text{ and } ~~ k(k+1)(2k+1) \approx 2k^3 \approx 2(u \wedge v)^3 ~~ \text{ and } ~~ k(k+1) \approx k^2 \approx (u \wedge v)^2. \end{align*}

(25)

Then the covariance kernel becomes:

\begin{align*} \text{cov}(f(u), f(v)) \approx C(1 + uv) + \tau^2 \left(u v (u \wedge v) + \frac{(u \wedge v)^3}{3} - \frac{(u \wedge v)^2}{2} (u + v) \right). \end{align*}

(26)

It turns out that the right hand side above is precisely the covariance kernel of:

\begin{align*} G_x = \beta_0 + \beta_1 x + \tau I_x \end{align*}

(27)

where $\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C)$ and $I_x$ is Integrated Brownian Motion on $[0, \infty)$ . We will revisit this in the next lecture.