STAT 238 - Bayesian Statistics Lecture Nineteen
Spring 2026, UC Berkeley
A High-Dimensional Linear Regression Model ¶ We have a response variable y y y and a single covariate x x x . In our example, y y y denotes weekly earnings and x x x denotes years of experience. The covariate x x x takes the values 0 , 1 , … , m 0, 1, \dots, m 0 , 1 , … , m for some fixed integer m m m .
Our data is ( x i , y i ) , i = 1 , … , n (x_i, y_i), i = 1, \dots, n ( x i , y i ) , i = 1 , … , n . The model is:
y i = β 0 + β 1 x i + β 2 ReLU ( x i − 1 ) + ⋯ + β m ReLU ( x i − ( m − 1 ) ) + ϵ i \begin{align}
y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots +
\beta_{m}\relu(x_i - (m-1)) + \epsilon_i
\end{align} y i = β 0 + β 1 x i + β 2 ReLU ( x i − 1 ) + ⋯ + β m ReLU ( x i − ( m − 1 )) + ϵ i where ReLU ( u ) : = max ( u , 0 ) \relu(u) := \max(u, 0) ReLU ( u ) := max ( u , 0 ) , and ϵ i ∼ i.i.d N ( 0 , σ 2 ) \epsilon_i
\overset{\text{i.i.d}}{\sim} N(0, \sigma^2) ϵ i ∼ i.i.d N ( 0 , σ 2 ) . Note we did not include ReLU ( x i − m ) \relu(x_i - m) ReLU ( x i − m ) because it always equals 0.
This linear regression equation can be written in the usual notation as:
y = X β + ϵ \begin{align}
y = X \beta + \epsilon
\end{align} y = Xβ + ϵ where y = ( y 1 , … , y n ) T y = (y_1, \dots, y_n)^T y = ( y 1 , … , y n ) T and X X X is the n × ( m + 1 ) n \times (m+1) n × ( m + 1 ) matrix with columns 1, x x x , ReLU ( x − j ) \relu(x - j) ReLU ( x − j ) for j = 1 , … , m − 1 j = 1, \dots, m-1 j = 1 , … , m − 1 .
Recap: Linear Regression with default priors ¶ As we saw before, the default prior on β 0 , … , β m \beta_0, \dots, \beta_m β 0 , … , β m in the linear regression model (2) is:
with C → ∞ C \rightarrow \infty C → ∞ . With this prior, the posterior of β \beta β given σ \sigma σ becomes (see e.g., Problem 7(a) in Homework Three)
β ∣ data , σ ∼ N ( ( X T X ) − 1 X T y , σ 2 ( X T X ) − 1 ) \begin{align}
\beta \mid \text{data}, \sigma \sim N\left((X^T X)^{-1} X^T y,
\sigma^2 (X^T X)^{-1} \right)
\end{align} β ∣ data , σ ∼ N ( ( X T X ) − 1 X T y , σ 2 ( X T X ) − 1 ) which means that the posterior mean of β \beta β is just the least squares estimator ( X T X ) − 1 X T y (X^T X)^{-1} X^T y ( X T X ) − 1 X T y (note that this posterior mean of β \beta β does not depend on σ \sigma σ , so is both the conditional posterior mean given σ \sigma σ , as well as the unconditional posterior mean).
The fact that the posterior mean of β \beta β coincides with the least squares estimator is shared by some priors other than (3) as well. For example, this is also true for the following prior:
β 0 , … , β m ∼ i.i.d N ( 0 , C ) \begin{align}
\beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C)
\end{align} β 0 , … , β m ∼ i.i.d N ( 0 , C ) as C → ∞ C \rightarrow \infty C → ∞ . This can be seen by the following general result (which follows, for example, from Problem 9 in Homework Two):
β ∣ σ ∼ N ( 0 , Q ) ⟹ β ∣ data , σ ∼ N ( ( X T X σ 2 + Q − 1 ) − 1 X T y σ 2 , ( X T X σ 2 + Q − 1 ) − 1 ) . \begin{align}
\beta \mid \sigma \sim N(0, Q) \implies
\beta \mid \text{data}, \sigma \sim N \left(\left(\frac{X^T
X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2},
\left(\frac{X^T X}{\sigma^2} + Q^{-1}\right)^{-1} \right).
\end{align} β ∣ σ ∼ N ( 0 , Q ) ⟹ β ∣ data , σ ∼ N ( ( σ 2 X T X + Q − 1 ) − 1 σ 2 X T y , ( σ 2 X T X + Q − 1 ) − 1 ) . From here, it is clear that if Q Q Q is chosen so that Q − 1 ≈ 0 Q^{-1} \approx
0 Q − 1 ≈ 0 , then we get the same posterior as (4) . The prior (5) corresponds to Q = C I Q = C I Q = C I so that Q − 1 = ( 1 / C ) I Q^{-1} = (1/C)I Q − 1 = ( 1/ C ) I which goes to zero when C → ∞ C \rightarrow \infty C → ∞ . So the prior (5) leads to the same posterior (4) .
Regularization ¶ In the earnings-experience regression example, we have seen in the last class that the least squares estimator for (2) is not interpretable. We want the fitted curve
x ↦ β ^ 0 + β ^ 1 x + ∑ j = 2 m β ^ j ReLU ( x − ( j − 1 ) ) \begin{align*}
x \mapsto \hat{\beta}_0 + \hat{\beta}_1 x + \sum_{j=2}^m
\hat{\beta}_j \relu(x - (j-1))
\end{align*} x ↦ β ^ 0 + β ^ 1 x + j = 2 ∑ m β ^ j ReLU ( x − ( j − 1 )) to be smooth for it to be interpretable. This can be achieved in the Bayesian setting if we use the prior:
β 0 , β 1 ∼ i.i.d N ( 0 , C ) and β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) . \begin{align}
\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) ~~ \text{ and
} ~~ \beta_2, \dots, \beta_{m} \overset{\text{i.i.d}}{\sim} N(0,
\tau^2).
\end{align} β 0 , β 1 ∼ i.i.d N ( 0 , C ) and β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) . for some small τ \tau τ . The above prior is equivalent to β ∼ N ( 0 , Q ) \beta \sim N(0, Q) β ∼ N ( 0 , Q ) where Q Q Q is the ( m + 1 ) × ( m + 1 ) (m+1) \times
(m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix with diagonal entries C , C , τ 2 , … , τ 2 C, C, \tau^2, \dots,
\tau^2 C , C , τ 2 , … , τ 2 . So using the formula (6) , the posterior mean corresponding to this prior is:
E ( β ∣ data , σ , τ ) = ( X T X σ 2 + Q − 1 ) − 1 X T y σ 2 . \begin{align*}
\E \left(\beta \mid \text{data}, \sigma, \tau \right) =
\left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T
y}{\sigma^2}.
\end{align*} E ( β ∣ data , σ , τ ) = ( σ 2 X T X + Q − 1 ) − 1 σ 2 X T y . Because Q Q Q is diagonal with diagonal entries C , C , 1 / τ 2 , … , 1 / τ 2 C, C, 1/\tau^2, \dots,
1/\tau^2 C , C , 1/ τ 2 , … , 1/ τ 2 , it is easy to see that Q − 1 Q^{-1} Q − 1 is also diagonal with diagonal entries 1 / C , 1 / C , τ − 2 , … , τ − 2 1/C, 1/C, \tau^{-2}, \dots, \tau^{-2} 1/ C , 1/ C , τ − 2 , … , τ − 2 . Because C C C is large, we can approximate Q − 1 Q^{-1} Q − 1 by the matrix with diagonal entries 0 , 0 , τ − 2 , … , τ − 2 0, 0, \tau^{-2}, \dots, \tau^{-2} 0 , 0 , τ − 2 , … , τ − 2 . In other words:
Q − 1 ≈ 1 τ 2 J where J = ( 0 0 0 0 ⋅ ⋅ ⋅ 0 0 0 0 0 ⋅ ⋅ ⋅ 0 0 0 1 0 ⋅ ⋅ ⋅ 0 0 0 0 1 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0 0 0 0 ⋅ ⋅ ⋅ 1 ) \begin{align*}
Q^{-1} \approx \frac{1}{\tau^2} J ~~~ \text{ where } J = \begin{pmatrix}
0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 0 \\
0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 0 \\
0 & 0 & 1 & 0 &\cdot & \cdot & \cdot & 0 \\
0 & 0 & 0 & 1 & \cdot & \cdot & \cdot & 0 \\
\cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\
\cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\
\cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\
0 & 0 & 0 & 0 & \cdot & \cdot & \cdot & 1 \\
\end{pmatrix}
\end{align*} Q − 1 ≈ τ 2 1 J where J = ⎝ ⎛ 0 0 0 0 ⋅ ⋅ ⋅ 0 0 0 0 0 ⋅ ⋅ ⋅ 0 0 0 1 0 ⋅ ⋅ ⋅ 0 0 0 0 1 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0 0 0 0 ⋅ ⋅ ⋅ 1 ⎠ ⎞ Note that J J J is an ( m + 1 ) × ( m + 1 ) (m+1) \times (m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix. Thus the posterior mean becomes:
E ( β ∣ data , σ , τ ) = ( X T X σ 2 + J τ 2 ) − 1 X T y σ 2 = ( X T X + J τ 2 / σ 2 ) − 1 X T y . \begin{align*}
\E \left(\beta \mid \text{data}, \sigma, \tau \right) =
\left(\frac{X^T X}{\sigma^2} + \frac{J}{\tau^2} \right)^{-1}
\frac{X^T
y}{\sigma^2} = \left(X^T X + \frac{J}{\tau^2/\sigma^2} \right)^{-1}
X^T y.
\end{align*} E ( β ∣ data , σ , τ ) = ( σ 2 X T X + τ 2 J ) − 1 σ 2 X T y = ( X T X + τ 2 / σ 2 J ) − 1 X T y . We use the notation τ 2 = σ 2 γ 2 \tau^2 = \sigma^2 \gamma^2 τ 2 = σ 2 γ 2 or equivalently γ 2 = τ 2 / σ 2 \gamma^2 = \tau^2/\sigma^2 γ 2 = τ 2 / σ 2 . This gives
E ( β ∣ data , σ , τ ) = ( X T X + J γ 2 ) − 1 X T y . \begin{align}
\E \left(\beta \mid \text{data}, \sigma, \tau \right) = \left(X^T X
+ \frac{J}{\gamma^2} \right)^{-1}
X^T y.
\end{align} E ( β ∣ data , σ , τ ) = ( X T X + γ 2 J ) − 1 X T y . It can be verified that this quantity ( X T X + J / γ 2 ) − 1 X T y (X^TX + J/\gamma^2)^{-1} X^T y ( X T X + J / γ 2 ) − 1 X T y coincides with a ridge regularized least squares estimator. This is shown in the next section.
Connection to Ridge Regularization ¶ For the model (1) , the ridge regression estimator β ^ ridge ( λ ) \ridge β ^ ridge ( λ ) for β \beta β is given by the minimizer of:
∑ i = 1 n ( y i − β 0 − β 1 x i − β 2 ReLU ( x i − 1 ) − ⋯ − β m ReLU ( x i − ( m − 1 ) ) ) 2 + λ ( β 2 2 + β 3 2 + ⋯ + β m 2 ) . \begin{split}
& \sum_{i=1}^n \left(y_i - \beta_0 - \beta_1 x_i - \beta_2
\relu(x_i-1) - \dots - \beta_{m} \relu(x_i-(m-1)) \right)^2 \\ &+
\lambda \left(\beta_2^2 + \beta_3^2 + \dots + \beta_{m}^2
\right).
\end{split} i = 1 ∑ n ( y i − β 0 − β 1 x i − β 2 ReLU ( x i − 1 ) − ⋯ − β m ReLU ( x i − ( m − 1 )) ) 2 + λ ( β 2 2 + β 3 2 + ⋯ + β m 2 ) . The objective function above can also be written as
∥ y − X β ∥ 2 + λ ∑ j = 2 m β j 2 . \|y - X \beta\|^2 + \lambda \sum_{j=2}^{m} \beta_j^2. ∥ y − Xβ ∥ 2 + λ j = 2 ∑ m β j 2 . It turns out that β ^ ridge ( λ ) \ridge β ^ ridge ( λ ) can be written in closed form using matrix notation. To see this, note first that the gradient of the above objective function with respect to β \beta β is given by
∇ ( ∥ y − X β ∥ 2 + λ ∑ j = 2 m β j 2 ) = − 2 X T y + 2 X T X β + 2 λ ( 0 0 β 2 ⋅ ⋅ ⋅ β m ) \nabla \left( \|y - X \beta\|^2 + \lambda \sum_{j=2}^{m} \beta_j^2
\right) = -2 X^T y + 2 X^T X \beta + 2 \lambda
\begin{pmatrix}
0 \\ 0 \\ \beta_2 \\ \cdot \\ \cdot \\ \cdot \\ \beta_{m}
\end{pmatrix} ∇ ( ∥ y − Xβ ∥ 2 + λ j = 2 ∑ m β j 2 ) = − 2 X T y + 2 X T Xβ + 2 λ ⎝ ⎛ 0 0 β 2 ⋅ ⋅ ⋅ β m ⎠ ⎞ Recalling that J J J is the ( m + 1 ) × ( m + 1 ) (m+1) \times (m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix with diagonal entries 0 , 0 , 1 , … , 1 0, 0, 1, \dots, 1 0 , 0 , 1 , … , 1 , we can write
∇ ( ∥ y − X β ∥ 2 + λ ∑ t = 2 n − 1 β j 2 ) = − 2 X T y + 2 X T X β + 2 λ ( 0 0 β 2 ⋅ ⋅ ⋅ β n − 1 ) = − 2 X T y + 2 X T X β + 2 λ J β . \nabla \left( \|y - X \beta\|^2 + \lambda \sum_{t=2}^{n-1} \beta_j^2
\right) = -2 X^T y + 2 X^T X \beta + 2 \lambda
\begin{pmatrix}
0 \\ 0 \\ \beta_2 \\ \cdot \\ \cdot \\ \cdot \\ \beta_{n-1}
\end{pmatrix} = -2 X^T y + 2 X^T X \beta + 2 \lambda J \beta. ∇ ( ∥ y − Xβ ∥ 2 + λ t = 2 ∑ n − 1 β j 2 ) = − 2 X T y + 2 X T Xβ + 2 λ ⎝ ⎛ 0 0 β 2 ⋅ ⋅ ⋅ β n − 1 ⎠ ⎞ = − 2 X T y + 2 X T Xβ + 2 λ J β . Setting this gradient equal to zero, we get
− 2 X T y + 2 X T X β + 2 λ J β = 0 ⟹ ( X T X + λ J ) β = X T y . -2 X^T y + 2 X^T X \beta + 2 \lambda J \beta = 0 \implies \left(X^T
X + \lambda J \right) \beta = X^T y. − 2 X T y + 2 X T Xβ + 2 λ J β = 0 ⟹ ( X T X + λ J ) β = X T y . which gives
Note that λ = 0 \lambda = 0 λ = 0 corresponds to least squares, and λ → ∞ \lambda
\rightarrow \infty λ → ∞ leads to linear regression. When λ \lambda λ is set to be neither very close to 0 nor very large, we should get a smooth estimate that still gives a good fit to the data.
Comparing (14) to (12) , it is clear that they are identical when λ = 1 / γ 2 \lambda = 1/\gamma^2 λ = 1/ γ 2 . In other words, when we use the prior (8) , then the posterior mean (given τ \tau τ and σ \sigma σ ) coincides with ridge regularized least squares with regularization parameter λ = 1 / γ 2 = σ 2 / τ 2 \lambda = 1/\gamma^2 =
\sigma^2/\tau^2 λ = 1/ γ 2 = σ 2 / τ 2 .
Bayesian inference for γ \gamma γ and σ \sigma σ ¶ The big advantage of Bayesian inference is that it also allows inference on γ \gamma γ and σ \sigma σ . We shall use the prior:
log γ , log σ ∼ i.i.d uniform ( − ∞ , ∞ ) . \begin{align*}
\log \gamma, \log \sigma \overset{\text{i.i.d}}{\sim}
\text{uniform}(-\infty, \infty).
\end{align*} log γ , log σ ∼ i.i.d uniform ( − ∞ , ∞ ) . Recall again that τ = γ σ \tau = \gamma \sigma τ = γσ . The joint prior on all the parameters ( β , γ , σ ) (\beta, \gamma, \sigma) ( β , γ , σ ) is:
f β , γ , σ ( β , γ , σ ) ∝ 1 γ σ 1 det Q exp ( − 1 2 β T Q − 1 β ) . \begin{align*}
f_{\beta, \gamma, \sigma}(\beta, \gamma, \sigma) \propto
\frac{1}{\gamma \sigma} \frac{1}{\sqrt{\det Q}} \exp
\left(-\frac{1}{2} \beta^T Q^{-1} \beta \right).
\end{align*} f β , γ , σ ( β , γ , σ ) ∝ γσ 1 det Q 1 exp ( − 2 1 β T Q − 1 β ) . Here Q Q Q is the ( m + 1 ) × ( m + 1 ) (m+1) \times (m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix with diagonal entries C , C , γ 2 σ 2 , … , γ 2 σ 2 C, C, \gamma^2 \sigma^2, \dots, \gamma^2 \sigma^2 C , C , γ 2 σ 2 , … , γ 2 σ 2 .
With this prior, we shall argue in the next lecture that the posterior is given by:
β ∣ data , σ , γ ∼ N ( ( X T X + γ − 2 J ) − 1 X T y , σ 2 ( X T X + γ − 2 J ) − 1 ) \begin{align}
\beta \mid \text{data}, \sigma, \gamma \sim N \left(\left(X^T X +
\gamma^{-2} J
\right)^{-1} X^T y, \sigma^2 \left(X^T X + \gamma^{-2} J \right)^{-1} \right)
\end{align} β ∣ data , σ , γ ∼ N ( ( X T X + γ − 2 J ) − 1 X T y , σ 2 ( X T X + γ − 2 J ) − 1 ) and
1 σ 2 ∣ data , γ ∼ Gamma ( n 2 − 1 , y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 . ) \begin{align}
\frac{1}{\sigma^2} \mid \text{data}, \gamma \sim
\text{Gamma}\left(\frac{n}{2} - 1, \frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2}.
\right)
\end{align} σ 2 1 ∣ data , γ ∼ Gamma ( 2 n − 1 , 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y . ) and
f γ ∣ data ( γ ) ∝ γ − m ∣ X T X + γ − 2 J ∣ − 1 / 2 ( y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) ( n / 2 ) − 1 . \begin{align*}
f_{\gamma \mid \text{data}}(\gamma) &\propto \frac{\gamma^{-m}
\left|X^T X + \gamma^{-2} J
\right|^{-1/2}}{\left(y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y \right)^{(n/2) - 1}}.
\end{align*} f γ ∣ data ( γ ) ∝ ( y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) ( n /2 ) − 1 γ − m ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 . With this, inference can be carried out by first taking a grid of γ \gamma γ values and computing the above posterior (on the logarithmic scale) at the grid points. This posterior can be used to obtain posterior samples of γ \gamma γ . For each sample of γ \gamma γ , we can then sample σ \sigma σ using the distribution (18) . Given samples from both γ \gamma γ and σ \sigma σ , we can then sample β \beta β using (17) .