STAT 238 - Bayesian Statistics Lecture Twenty
Spring 2026, UC Berkeley
Bayesian Inference in a High-Dimensional Linear Regression Model ¶ We studied the following model in the last lecture.
We have a response variable y y y and a single covariate x x x . In our example, y y y denotes weekly earnings and x x x denotes years of experience. The covariate x x x takes the values 0 , 1 , … , m 0, 1, \dots, m 0 , 1 , … , m for some fixed integer m m m .
Our data is ( x i , y i ) , i = 1 , … , n (x_i, y_i), i = 1, \dots, n ( x i , y i ) , i = 1 , … , n . The model is:
y i = β 0 + β 1 x i + β 2 ReLU ( x i − 1 ) + ⋯ + β m ReLU ( x i − ( m − 1 ) ) + ϵ i \begin{align}
y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots +
\beta_{m}\relu(x_i - (m-1)) + \epsilon_i
\end{align} y i = β 0 + β 1 x i + β 2 ReLU ( x i − 1 ) + ⋯ + β m ReLU ( x i − ( m − 1 )) + ϵ i where ReLU ( u ) : = max ( u , 0 ) \relu(u) := \max(u, 0) ReLU ( u ) := max ( u , 0 ) , and ϵ i ∼ i.i.d N ( 0 , σ 2 ) \epsilon_i
\overset{\text{i.i.d}}{\sim} N(0, \sigma^2) ϵ i ∼ i.i.d N ( 0 , σ 2 ) . Note we did not include ReLU ( x i − m ) \relu(x_i - m) ReLU ( x i − m ) because it always equals 0.
The model (1) can be rewritten in the usual regression way as:
y = X β + ϵ with ϵ ∼ N ( 0 , σ 2 I d ) . \begin{align*}
y = X \beta + \epsilon ~~\text{with $\epsilon \sim N(0, \sigma^2
I_d)$}.
\end{align*} y = Xβ + ϵ with ϵ ∼ N ( 0 , σ 2 I d ) . Here X X X is the n × ( m + 1 ) n \times (m+1) n × ( m + 1 ) matrix with columns 1 , x , ReLU ( x − j ) 1, x,
\relu(x-j) 1 , x , ReLU ( x − j ) for j = 1 , … , m − 1 j = 1, \dots, m-1 j = 1 , … , m − 1 , where x x x denotes observed values of the experience variable.
The usual least squares analysis coincides with Bayesian inference using the prior:
β 0 , … , β m ∼ i.i.d N ( 0 , C ) and log σ ∼ uniform ( − C , C ) . \begin{align}
\beta_0, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, C)
\text{ and } \log \sigma \sim \text{uniform}(-C, C).
\end{align} β 0 , … , β m ∼ i.i.d N ( 0 , C ) and log σ ∼ uniform ( − C , C ) . for C → ∞ C \rightarrow \infty C → ∞ . In Problem 7 of Homework 3, it is proved that the posterior for the above prior is:
β ∣ data , σ ∼ N m + 1 ( β ^ , σ 2 ( X T X ) − 1 ) & f σ ∣ data ( σ ) ∝ σ − n + m exp ( − y T y − y T X ( X T X ) − 1 X T y 2 σ 2 ) I { σ > 0 } \begin{align*}
\beta \mid \text{data}, \sigma \sim N_{m+1}(\hat{\beta}, \sigma^2
(X^T X)^{-1}) ~ \& ~ f_{\sigma \mid \text{data}}(\sigma)
\propto \sigma^{-n+m} \exp \left(-\frac{y^T y - y^T X (X^T X)^{-1}
X^T y}{2
\sigma^2} \right) I\{\sigma > 0\}
\end{align*} β ∣ data , σ ∼ N m + 1 ( β ^ , σ 2 ( X T X ) − 1 ) & f σ ∣ data ( σ ) ∝ σ − n + m exp ( − 2 σ 2 y T y − y T X ( X T X ) − 1 X T y ) I { σ > 0 } where β ^ = ( X T X ) − 1 X T y \hat{\beta} = (X^T X)^{-1} X^T y β ^ = ( X T X ) − 1 X T y is the least squares estimator. If we marginalize σ \sigma σ and write the posterior distribution of β \beta β , we will get a t t t -distribution. The posterior distribution for σ \sigma σ above can be written in terms of the Gamma (or chi-squared) distribution (this is problem 7(d) in Homework 3) as:
1 σ 2 ∣ data ∼ Gamma ( n − m − 1 2 , y T y − y T X ( X T X ) − 1 X T y 2 ) . \begin{align*}
\frac{1}{\sigma^2} \Big| \text{data} \sim
\text{Gamma} \left(\frac{n-m-1}{2}, \frac{y^T y -
y^T X (X^T X)^{-1} X^T y}{2}
\right).
\end{align*} σ 2 1 ∣ ∣ data ∼ Gamma ( 2 n − m − 1 , 2 y T y − y T X ( X T X ) − 1 X T y ) . Because least squares does not give sensible results in this example, we change the prior in (3) to:
β 0 ∼ N ( 0 , C ) , β 1 ∼ N ( 0 , C ) , β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) and log σ ∼ uniform ( − C , C ) . \begin{align*}
\beta_0 \sim N(0, C), \beta_1 \sim N(0, C), \beta_2, \dots, \beta_m
\overset{\text{i.i.d}}{\sim} N(0, \tau^2) ~~ \text{ and }
\log \sigma \sim \text{uniform}(-C, C).
\end{align*} β 0 ∼ N ( 0 , C ) , β 1 ∼ N ( 0 , C ) , β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) and log σ ∼ uniform ( − C , C ) . Therefore, we are bringing in a new parameter τ \tau τ which controls the scale of β 2 , … , β m \beta_2, \dots, \beta_m β 2 , … , β m . For now, treat τ \tau τ as fixed. So the prior on β \beta β and σ \sigma σ is now:
f β , σ ( β , σ ) ∝ 1 σ 1 det Q exp ( − 1 2 β T Q − 1 β ) \begin{align*}
f_{\beta, \sigma}(\beta, \sigma) \propto \frac{1}{\sigma}
\frac{1}{\sqrt{\det Q}} \exp \left(-\frac{1}{2} \beta^T Q^{-1} \beta \right)
\end{align*} f β , σ ( β , σ ) ∝ σ 1 det Q 1 exp ( − 2 1 β T Q − 1 β ) where Q Q Q is the ( m + 1 ) × ( m + 1 ) (m+1) \times (m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix with diagonal entries C , C , τ 2 , … , τ 2 C, C, \tau^2, \dots, \tau^2 C , C , τ 2 , … , τ 2 .
The likelihood is unchanged:
( 1 2 π ) n σ − n exp ( − 1 2 σ 2 ∥ y − X β ∥ 2 ) . \left(\frac{1}{\sqrt{2 \pi}} \right)^n \sigma^{-n} \exp
\left(-\frac{1}{2 \sigma^2} \|y - X \beta\|^2 \right). ( 2 π 1 ) n σ − n exp ( − 2 σ 2 1 ∥ y − Xβ ∥ 2 ) . So the posterior for β , σ \beta, \sigma β , σ is:
f β , σ ∣ data ( β , σ ) ∝ σ − n − 1 det Q exp ( − 1 2 ( 1 σ 2 ∥ y − X β ∥ 2 + β T Q − 1 β ) ) . \begin{align*}
f_{\beta,\sigma \mid \text{data}}(\beta, \sigma)
& \propto \frac{\sigma^{-n-1}}{\sqrt{\det Q}} \exp
\left(-\frac{1}{2} \left(\frac{1}{\sigma^2} \|y - X \beta\|^2 +
\beta^T Q^{-1} \beta \right) \right).
\end{align*} f β , σ ∣ data ( β , σ ) ∝ det Q σ − n − 1 exp ( − 2 1 ( σ 2 1 ∥ y − Xβ ∥ 2 + β T Q − 1 β ) ) . Using the completing the square formula:
1 σ 2 ∥ y − X β ∥ 2 + β T Q − 1 β = ( β − β ^ ) T ( X T X σ 2 + Q − 1 ) ( β − β ^ ) + y T y σ 2 − ( y T X σ 2 ) ( X T X σ 2 + Q − 1 ) − 1 ( X T y σ 2 ) , \begin{align*}
\frac{1}{\sigma^2} \|y - X \beta\|^2 +
\beta^T Q^{-1} \beta &= \left( \beta - \hat{\beta}\right)^T
\left(\frac{X^T X}{\sigma^2} + Q^{-1}
\right) \left(\beta - \hat{\beta} \right) \\
&+ \frac{y^T y}{\sigma^2} - \left(\frac{y^T X}{\sigma^2} \right)
\left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1} \left(\frac{X^T
y}{\sigma^2} \right),
\end{align*} σ 2 1 ∥ y − Xβ ∥ 2 + β T Q − 1 β = ( β − β ^ ) T ( σ 2 X T X + Q − 1 ) ( β − β ^ ) + σ 2 y T y − ( σ 2 y T X ) ( σ 2 X T X + Q − 1 ) − 1 ( σ 2 X T y ) , where
β ^ = ( X T X σ 2 + Q − 1 ) − 1 X T y σ 2 \begin{align*}
\hat{\beta} = \left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1}
\frac{X^T y}{\sigma^2}
\end{align*} β ^ = ( σ 2 X T X + Q − 1 ) − 1 σ 2 X T y one can then show that
β ∣ data , σ , τ ∼ N ( ( X T X σ 2 + Q − 1 ) − 1 X T y σ 2 , ( X T X σ 2 + Q − 1 ) − 1 ) \begin{align}
\beta \mid \text{data}, \sigma, \tau \sim N\left( \left(\frac{X^T
X}{\sigma^2} + Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2}, \left(\frac{X^T
X}{\sigma^2} +
Q^{-1} \right)^{-1} \right)
\end{align} β ∣ data , σ , τ ∼ N ( ( σ 2 X T X + Q − 1 ) − 1 σ 2 X T y , ( σ 2 X T X + Q − 1 ) − 1 ) Note that when Q → ∞ Q \rightarrow \infty Q → ∞ (i.e., when C → ∞ C \rightarrow
\infty C → ∞ and τ → ∞ \tau \rightarrow \infty τ → ∞ ), this reverts toN ( ( X T X ) − 1 X T y , σ 2 ( X T X ) − 1 ) N((X^T
X)^{-1} X^T y, \sigma^2 (X^T X)^{-1}) N (( X T X ) − 1 X T y , σ 2 ( X T X ) − 1 ) . To get the posterior of σ \sigma σ given τ \tau τ , we simply integrate β \beta β from the joint posterior to obtain:
Integrating β \beta β from the joint posterior gives the posterior of γ , σ \gamma, \sigma γ , σ :
f σ ∣ data , τ ( σ ) ∝ σ − n − 1 det Q det ( X T X σ 2 + Q − 1 ) − 1 exp ( − y T y 2 σ 2 ) exp ( y T X 2 σ 2 ( X T X σ 2 + Q − 1 ) − 1 X T y σ 2 ) . \begin{align*}
& f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto
\frac{\sigma^{-n-1}}{\sqrt{\det Q}} \sqrt{\det
\left(\frac{X^T X}{\sigma^2} + Q^{-1} \right)^{-1}} \exp \left(-\frac{y^T y}{2 \sigma^2} \right)\exp
\left(\frac{y^T X}{2\sigma^2} \left(\frac{X^T
X}{\sigma^2} +
Q^{-1} \right)^{-1} \frac{X^T y}{\sigma^2} \right).
\end{align*} f σ ∣ data , τ ( σ ) ∝ det Q σ − n − 1 det ( σ 2 X T X + Q − 1 ) − 1 exp ( − 2 σ 2 y T y ) exp ( 2 σ 2 y T X ( σ 2 X T X + Q − 1 ) − 1 σ 2 X T y ) . Simplifying by using det Q ∝ ( τ 2 ) m − 1 \det Q \propto (\tau^2)^{m-1} det Q ∝ ( τ 2 ) m − 1 and Q − 1 ≈ J / τ 2 Q^{-1} \approx J/\tau^2 Q − 1 ≈ J / τ 2 where J J J is the ( m + 1 ) × ( m + 1 ) (m+1) \times
(m+1) ( m + 1 ) × ( m + 1 ) diagonal matrix with diagonals 0 , 0 , 1 , … , 1 0, 0, 1, \dots, 1 0 , 0 , 1 , … , 1 , we get
f σ ∣ data , τ ( σ ) ∝ σ − n + m τ m − 1 ∣ X T X + σ 2 τ 2 J ∣ − 1 / 2 exp ( − y T y − y T X ( X T X + σ 2 τ − 2 J ) − 1 X T y 2 σ 2 ) \begin{align*}
& f_{\sigma \mid \text{data}, \tau}(\sigma) \\ &\propto
\frac{\sigma^{-n+m}}{\tau^{m-1}} \left|X^T X + \frac{\sigma^2}{\tau^2}
J
\right|^{-1/2} \exp
\left(-\frac{y^T y - y^T X \left(X^T X +
\sigma^2\tau^{-2} J \right)^{-1} X^T y}{2\sigma^2}
\right)
\end{align*} f σ ∣ data , τ ( σ ) ∝ τ m − 1 σ − n + m ∣ ∣ X T X + τ 2 σ 2 J ∣ ∣ − 1/2 exp ( − 2 σ 2 y T y − y T X ( X T X + σ 2 τ − 2 J ) − 1 X T y ) This distribution is not easy to handle because of the presence of the term σ 2 / τ 2 \sigma^2/\tau^2 σ 2 / τ 2 inside the determinant as well as inside the inverse (in the exponent). One trick to simplify this is to reparametrize and assume τ = σ γ \tau = \sigma \gamma τ = σγ . With this, the above conditional density becomes
f σ ∣ data , γ ( σ ) ∝ 1 γ m − 1 σ n − 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 exp ( − y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 σ 2 ) \begin{align*}
& f_{\sigma \mid \text{data}, \gamma}(\sigma) \\ &\propto
\frac{1}{\gamma^{m-1}\sigma^{n-1}} \left|X^T X + \gamma^{-2} J
\right|^{-1/2} \exp
\left(-\frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2}
\right)
\end{align*} f σ ∣ data , γ ( σ ) ∝ γ m − 1 σ n − 1 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 exp ( − 2 σ 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) Ignoring terms above which do not depend on γ \gamma γ , we get
f σ ∣ data , γ ( σ ) ∝ σ − n + 1 exp ( − y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 σ 2 ) . f_{\sigma \mid \text{data}, \gamma}(\sigma) \propto \sigma^{-n+1} \exp
\left(-\frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2}
\right). f σ ∣ data , γ ( σ ) ∝ σ − n + 1 exp ( − 2 σ 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) . The right hand side above is actually the pdf of an inverse gamma density (see Inverse-gamma distribution ). This can be seen by converting it into the density of 1 / σ 2 1/\sigma^2 1/ σ 2 :
f 1 / σ 2 ∣ data , γ ( x ) ∝ f σ ∣ data , γ ( x − 1 / 2 ) x − 3 / 2 ∝ ( x − 1 / 2 ) − n + 1 exp ( − x y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 ) x − 3 / 2 = x ( n − 4 ) / 2 exp ( − x y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 ) . \begin{align*}
f_{1/\sigma^2 \mid \text{data}, \gamma}(x) &\propto f_{\sigma \mid
\text{data}, \gamma}(x^{-1/2}) x^{-3/2} \\
&\propto \left(x^{-1/2} \right)^{-n+1} \exp
\left(-x \frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2}
\right) x^{-3/2} \\
&= x^{(n-4)/2} \exp
\left(-x \frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2}
\right).
\end{align*} f 1/ σ 2 ∣ data , γ ( x ) ∝ f σ ∣ data , γ ( x − 1/2 ) x − 3/2 ∝ ( x − 1/2 ) − n + 1 exp ( − x 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) x − 3/2 = x ( n − 4 ) /2 exp ( − x 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) . Thus
1 σ 2 ∣ data , γ ∼ Gamma ( n 2 − 1 , y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 . ) \begin{align}
\frac{1}{\sigma^2} \mid \text{data}, \gamma \sim
\text{Gamma}\left(\frac{n}{2} - 1, \frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2}.
\right)
\end{align} σ 2 1 ∣ data , γ ∼ Gamma ( 2 n − 1 , 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y . ) Thus when γ \gamma γ is fixed, we can perform inference on β , σ \beta,
\sigma β , σ using the above closed form formulae. Since γ \gamma γ is also unknown, we can place a prior p ( γ ) p(\gamma) p ( γ ) on it, and then calculate its posterior.
Finally, we can marginalize σ \sigma σ to obtain the posterior of γ \gamma γ alone as follows:
f γ ∣ data ( γ ) ∝ ∫ 0 ∞ p ( γ ) 1 γ m − 1 σ n − 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 exp ( − y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 σ 2 ) d σ = p ( γ ) γ − m + 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 ∫ 0 ∞ σ − n + 1 exp ( − y T y − y T X ( X T X + γ − 2 J ) − 1 X T y 2 σ 2 ) d σ . \begin{align*}
f_{\gamma \mid \text{data}}(\gamma) &\propto \int_0^{\infty}
p(\gamma)
\frac{1}{\gamma^{m-1} \sigma^{n-1}}
\left|X^T X + \gamma^{-2} J
\right|^{-1/2} \exp
\left(-\frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2}
\right) d\sigma \\
&= p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J
\right|^{-1/2} \int_0^{\infty} \sigma^{-n+1}\exp
\left(-\frac{y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y}{2\sigma^2}
\right) d\sigma.
\end{align*} f γ ∣ data ( γ ) ∝ ∫ 0 ∞ p ( γ ) γ m − 1 σ n − 1 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 exp ( − 2 σ 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) d σ = p ( γ ) γ − m + 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 ∫ 0 ∞ σ − n + 1 exp ( − 2 σ 2 y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) d σ . Letting
A : = y T y − y T X ( X T X + γ − 2 J ) − 1 X T y , \begin{align*}
A := y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y,
\end{align*} A := y T y − y T X ( X T X + γ − 2 J ) − 1 X T y , we get
f γ ∣ data ( γ ) ∝ p ( γ ) γ − m + 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 ∫ 0 ∞ σ − n + 1 exp ( − A 2 σ 2 ) d σ \begin{align*}
f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J
\right|^{-1/2} \int_0^{\infty} \sigma^{-n+1} \exp \left(-\frac{A}{2
\sigma^2} \right) d\sigma
\end{align*} f γ ∣ data ( γ ) ∝ p ( γ ) γ − m + 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 ∫ 0 ∞ σ − n + 1 exp ( − 2 σ 2 A ) d σ By the change of variable σ = s A \sigma = s \sqrt{A} σ = s A , we obtain
f γ ∣ data ( γ ) ∝ p ( γ ) γ − m + 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 A − ( n / 2 ) + 1 ∫ 0 ∞ s − n + 1 exp ( − 1 2 s 2 ) d s ∝ p ( γ ) γ − m + 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 A − ( n / 2 ) + 1 = p ( γ ) γ − m + 1 ∣ X T X + γ − 2 J ∣ − 1 / 2 ( y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) ( n / 2 ) − 1 . \begin{align*}
f_{\gamma \mid \text{data}}(\gamma) &\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J
\right|^{-1/2} A^{-(n/2) + 1} \int_0^{\infty} s^{-n+1} \exp \left(-\frac{1}{2s^2}
\right) ds \\
&\propto p(\gamma) \gamma^{-m+1} \left|X^T X + \gamma^{-2} J
\right|^{-1/2} A^{-(n/2) + 1} \\
&= \frac{p(\gamma)\gamma^{-m+1} \left|X^T X + \gamma^{-2} J
\right|^{-1/2}}{\left(y^T y - y^T X \left(X^T X +
\gamma^{-2} J \right)^{-1} X^T y \right)^{(n/2) - 1}}.
\end{align*} f γ ∣ data ( γ ) ∝ p ( γ ) γ − m + 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 A − ( n /2 ) + 1 ∫ 0 ∞ s − n + 1 exp ( − 2 s 2 1 ) d s ∝ p ( γ ) γ − m + 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 A − ( n /2 ) + 1 = ( y T y − y T X ( X T X + γ − 2 J ) − 1 X T y ) ( n /2 ) − 1 p ( γ ) γ − m + 1 ∣ ∣ X T X + γ − 2 J ∣ ∣ − 1/2 . We usually choose p ( γ ) p(\gamma) p ( γ ) as:
p ( γ ) ∝ 1 γ I { low < γ < high } \begin{align*}
p(\gamma) \propto \frac{1}{\gamma} I\{\text{low} < \gamma <
\text{high}\}
\end{align*} p ( γ ) ∝ γ 1 I { low < γ < high } for two fixed values low \text{low} low and high \text{high} high . These could be the values of γ \gamma γ which lead to the posterior mean (which coincides with ridge regression) leading to underfitting and overfitting respectively.
Inference can be carried out by first taking a grid of γ \gamma γ values and computing the above posterior (on the logarithmic scale) at the grid points. This posterior can be used to obtain posterior samples of γ \gamma γ . For each sample of γ \gamma γ , we can then sample σ \sigma σ using the distribution (18) . Given samples from both γ \gamma γ and σ \sigma σ , we can then sample β \beta β using (17) .
Induced Prior for the Regression Function ¶ Our regression function is being modeled as:
f ( x ) = β 0 + β 1 x + β 2 ( x − 1 ) + + ⋯ + β m ( x − ( m − 1 ) ) + \begin{align}
f(x) = \beta_0 + \beta_1 x + \beta_2 (x - 1)_+ + \dots + \beta_m(x -
(m-1))_+
\end{align} f ( x ) = β 0 + β 1 x + β 2 ( x − 1 ) + + ⋯ + β m ( x − ( m − 1 ) ) + where β 0 , β 1 , … , β m \beta_0, \beta_1, \dots, \beta_m β 0 , β 1 , … , β m are all independent with
β 0 , β 1 ∼ i.i.d N ( 0 , C ) and β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) . \begin{align*}
\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) ~~ \text{ and
} ~~ \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0,
\tau^2).
\end{align*} β 0 , β 1 ∼ i.i.d N ( 0 , C ) and β 2 , … , β m ∼ i.i.d N ( 0 , τ 2 ) . This can also be seen as a prior on the regression function f f f more directly. For every 0 ≤ u 1 < ⋯ < u k < ∞ 0 \leq u_1 < \dots < u_k < \infty 0 ≤ u 1 < ⋯ < u k < ∞ , the joint distribution of ( f ( u 1 ) , … , f ( u k ) ) (f(u_1), \dots, f(u_k)) ( f ( u 1 ) , … , f ( u k )) induced by (22) will be multivariate normal. This implies that ( f ( u ) , u ≥ 0 ) (f(u), u \geq 0) ( f ( u ) , u ≥ 0 ) is a Gaussian Process. The description of the GP can be done in terms of the mean function E f ( u ) \E f(u) E f ( u ) and the covariance kernel Cov ( f ( u ) , f ( v ) ) \text{Cov}(f(u), f(v)) Cov ( f ( u ) , f ( v )) .
The mean function is clearly zero (because each E β j = 0 \E \beta_j = 0 E β j = 0 ). The covariance kernel is given by:
cov ( f ( u ) , f ( v ) ) = cov ( β 0 + β 1 u + β 2 ( u − 1 ) + + ⋯ + β m ( u − ( m − 1 ) ) + , β 0 + β 1 v + β 2 ( v − 1 ) + + ⋯ + β m ( v − ( m − 1 ) ) + ) = C ( 1 + u v ) + τ 2 ∑ j = 1 m − 1 ( u − j ) + ( v − j ) + = C ( 1 + u v ) + τ 2 ∑ j = 1 k ( u − j ) ( v − j ) = C ( 1 + u v ) + τ 2 ( u v k + 1 6 k ( k + 1 ) ( 2 k + 1 ) − ( u + v ) k ( k + 1 ) 2 ) \begin{align*}
&\text{cov}(f(u), f(v)) \\ &= \text{cov} \left(\beta_0 + \beta_1u + \beta_2 (u - 1)_+ + \dots + \beta_m(u -
(m-1))_+, \beta_0 + \beta_1 v + \beta_2 (v - 1)_+ + \dots + \beta_m(v -
(m-1))_+ \right) \\
&= C(1 + uv) + \tau^2 \sum_{j=1}^{m-1} (u - j)_+ (v - j)_+ \\
&= C(1 + uv) + \tau^2 \sum_{j=1}^{k} (u -
j)(v - j) \\
&= C(1 + uv) + \tau^2 \left(u v k + \frac{1}{6} k(k+1)(2k+1) -
(u+v)\frac{k(k+1)}{2} \right)
\end{align*} cov ( f ( u ) , f ( v )) = cov ( β 0 + β 1 u + β 2 ( u − 1 ) + + ⋯ + β m ( u − ( m − 1 ) ) + , β 0 + β 1 v + β 2 ( v − 1 ) + + ⋯ + β m ( v − ( m − 1 ) ) + ) = C ( 1 + uv ) + τ 2 j = 1 ∑ m − 1 ( u − j ) + ( v − j ) + = C ( 1 + uv ) + τ 2 j = 1 ∑ k ( u − j ) ( v − j ) = C ( 1 + uv ) + τ 2 ( uv k + 6 1 k ( k + 1 ) ( 2 k + 1 ) − ( u + v ) 2 k ( k + 1 ) ) where k = ⌊ u ∧ v ⌋ k = \lfloor u \wedge v \rfloor k = ⌊ u ∧ v ⌋ (and u ∧ v : = min ( u , v ) u \wedge v := \min(u,
v) u ∧ v := min ( u , v ) ).
Suppose now we use the approximation
k ≈ u ∧ v and k ( k + 1 ) ( 2 k + 1 ) ≈ 2 k 3 ≈ 2 ( u ∧ v ) 3 and k ( k + 1 ) ≈ k 2 ≈ ( u ∧ v ) 2 . \begin{align*}
k \approx u \wedge v ~~ \text{ and } ~~ k(k+1)(2k+1) \approx 2k^3
\approx 2(u \wedge v)^3 ~~ \text{ and } ~~ k(k+1) \approx k^2
\approx (u \wedge v)^2.
\end{align*} k ≈ u ∧ v and k ( k + 1 ) ( 2 k + 1 ) ≈ 2 k 3 ≈ 2 ( u ∧ v ) 3 and k ( k + 1 ) ≈ k 2 ≈ ( u ∧ v ) 2 . Then the covariance kernel becomes:
cov ( f ( u ) , f ( v ) ) ≈ C ( 1 + u v ) + τ 2 ( u v ( u ∧ v ) + ( u ∧ v ) 3 3 − ( u ∧ v ) 2 2 ( u + v ) ) . \begin{align*}
\text{cov}(f(u), f(v)) \approx C(1 + uv) + \tau^2 \left(u v (u
\wedge v) + \frac{(u \wedge v)^3}{3} - \frac{(u \wedge v)^2}{2} (u + v)
\right).
\end{align*} cov ( f ( u ) , f ( v )) ≈ C ( 1 + uv ) + τ 2 ( uv ( u ∧ v ) + 3 ( u ∧ v ) 3 − 2 ( u ∧ v ) 2 ( u + v ) ) . It turns out that the right hand side above is precisely the covariance kernel of:
G x = β 0 + β 1 x + τ I x \begin{align*}
G_x = \beta_0 + \beta_1 x + \tau I_x
\end{align*} G x = β 0 + β 1 x + τ I x where β 0 , β 1 ∼ i.i.d N ( 0 , C ) \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C) β 0 , β 1 ∼ i.i.d N ( 0 , C ) and I x I_x I x is Integrated Brownian Motion on [ 0 , ∞ ) [0, \infty) [ 0 , ∞ ) . We will revisit this in the next lecture.