STAT 238 - Bayesian Statistics Lecture Twenty Two
Spring 2026, UC Berkeley
Gaussian Processes ¶ The goal is to infer an unknown function f : Ω → R f : \Omega \rightarrow \R f : Ω → R that is defined on a known domain Ω ⊆ R p \Omega \subseteq \R^p Ω ⊆ R p . We assume that { f ( x ) , x ∈ Ω } \{f(x), x \in \Omega\} { f ( x ) , x ∈ Ω } forms a Gaussian process with mean zero and covariance function or kernel given by K ( x , x ′ ) K(x, x') K ( x , x ′ ) i.e.,
Cov ( f ( x ) , f ( x ′ ) ) = K ( x , x ′ ) for all x , x ′ ∈ Ω . \text{Cov}(f(x), f(x')) = K(x, x') ~~ \text{for all $x, x' \in
\Omega$}. Cov ( f ( x ) , f ( x ′ )) = K ( x , x ′ ) for all x , x ′ ∈ Ω . The kernel needs to be positive semi-definite i.e., for every N ≥ 1 N \geq
1 N ≥ 1 , distinct points u 1 , … , u N ∈ Ω u_1, \dots, u_N \in \Omega u 1 , … , u N ∈ Ω , the N × N N \times N N × N matrix with ( i , j ) t h (i, j)^{th} ( i , j ) t h entry K ( u i , u j ) K(u_i, u_j) K ( u i , u j ) is positive semi-definite. Sometimes, this matrix will be positive definite, and hence invertible.
Here are some examples of Gaussian processes and kernels.
Brownian Motion : Here Ω = [ 0 , ∞ ) \Omega = [0, \infty) Ω = [ 0 , ∞ ) and K ( s , t ) = min ( s , t ) K(s,
t) = \min(s, t) K ( s , t ) = min ( s , t ) .
(scaled) Brownian Motion plus constant : Here again Ω = [ 0 , ∞ ) \Omega =
[0, \infty) Ω = [ 0 , ∞ ) . In the Brownian motion model, f ( 0 ) = 0 f(0) = 0 f ( 0 ) = 0 which is generally an unrealistic assumption. To fix this, we assume that f ( t ) = β 0 + τ W t f(t) = \beta_0 + \tau W_t f ( t ) = β 0 + τ W t where W t ∼ B M W_t
\sim BM W t ∼ BM , β 0 ∼ N ( 0 , C ) \beta_0 \sim N(0, C) β 0 ∼ N ( 0 , C ) and τ > 0 \tau > 0 τ > 0 (and independence between β 0 \beta_0 β 0 and { W t } \{W_t\} { W t } ). C C C will be taken to be large (potentially C → ∞ C \rightarrow \infty C → ∞ ). Now
K ( s , t ) = C + τ 2 min ( s , t ) . K(s, t) = C + \tau^2 \min(s, t). K ( s , t ) = C + τ 2 min ( s , t ) . Integrated Brownian Motion : Ω = [ 0 , ∞ ) \Omega = [0,
\infty) Ω = [ 0 , ∞ ) and
f ( t ) = ∫ 0 t W s d s where W s ∼ B M . f(t) = \int_0^t W_s ds ~~ \text{where $W_s \sim BM$}. f ( t ) = ∫ 0 t W s d s where W s ∼ BM . The kernel is given by
K ( s , t ) = Cov ( ∫ 0 s W u d u , ∫ 0 t W v d v ) = ∫ 0 s ∫ 0 t Cov ( W u , W v ) d v d u = ∫ 0 s ∫ 0 t min ( u , v ) d v d u \begin{align*}
K(s, t) &= \text{Cov} \left(\int_0^s W_u du, \int_0^t W_v dv
\right) \\
&= \int_0^s \int_0^t \text{Cov}(W_u, W_v) dv du
= \int_0^s \int_0^t \min(u, v) dv du
\end{align*} K ( s , t ) = Cov ( ∫ 0 s W u d u , ∫ 0 t W v d v ) = ∫ 0 s ∫ 0 t Cov ( W u , W v ) d v d u = ∫ 0 s ∫ 0 t min ( u , v ) d v d u To simplify further, assume that s ≤ t s \leq t s ≤ t so that
K ( s , t ) = ∫ 0 s ( ∫ 0 u v d v + ∫ u t u d v ) d u = ∫ 0 s ( u 2 2 + u ( t − u ) ) d u = ∫ 0 s ( u t − u 2 2 ) d u = t s 2 2 − s 3 6 . \begin{align*}
K(s, t) &= \int_0^s \left(\int_0^u v dv + \int_u^t u dv \right) du
\\
&= \int_0^s \left(\frac{u^2}{2} + u(t - u) \right) du = \int_0^s \left(ut - \frac{u^2}{2} \right) du = t \frac{s^2}{2}
- \frac{s^3}{6}.
\end{align*} K ( s , t ) = ∫ 0 s ( ∫ 0 u v d v + ∫ u t u d v ) d u = ∫ 0 s ( 2 u 2 + u ( t − u ) ) d u = ∫ 0 s ( u t − 2 u 2 ) d u = t 2 s 2 − 6 s 3 . For general s s s and t t t , we have
K ( s , t ) = 1 2 max ( s , t ) ( min ( s , t ) ) 2 − 1 6 ( min ( s , t ) ) 3 . K(s, t) = \frac{1}{2} \max(s, t) \left(\min(s, t)\right)^2 -
\frac{1}{6} \left(\min(s, t) \right)^3. K ( s , t ) = 2 1 max ( s , t ) ( min ( s , t ) ) 2 − 6 1 ( min ( s , t ) ) 3 . (scaled) Integrated Brownian Motion Plus a Linear Term : In the IBM model, we have f ( 0 ) = 0 f(0) = 0 f ( 0 ) = 0 and f ′ ( 0 ) = 0 f'(0) = 0 f ′ ( 0 ) = 0 . This might be an unrealistic assumption to make when f f f is completely unknown. In this case, a better model is:
f ( t ) = β 0 + β 1 t + τ ∫ 0 t W s d s f(t) = \beta_0 + \beta_1 t + \tau \int_0^t W_s ds f ( t ) = β 0 + β 1 t + τ ∫ 0 t W s d s where β 0 , β 1 , { W s } \beta_0, \beta_1, \{W_s\} β 0 , β 1 , { W s } are independent with W s W_s W s being Brownian motion and β 0 , β 1 ∼ i.i.d N ( 0 , C ) \beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0,
C) β 0 , β 1 ∼ i.i.d N ( 0 , C ) . The kernel now becomes
K ( s , t ) = Cov ( β 0 + β 1 s + τ ∫ 0 s W u d u , β 0 + β 1 t + τ ∫ 0 t W v d v ) = C ( 1 + s t ) + τ 2 2 max ( s , t ) ( min ( s , t ) ) 2 − τ 2 6 ( min ( s , t ) ) 3 . \begin{align*}
K(s, t) &= \text{Cov} \left(\beta_0 + \beta_1 s + \tau \int_0^s W_u du,
\beta_0 + \beta_1 t + \tau \int_0^t W_v dv \right) \\ &= C (1 +
st) +
\frac{\tau^2}{2}
\max(s,
t)
\left(\min(s,
t)\right)^2
-
\frac{\tau^2}{6} \left(\min(s, t) \right)^3.
\end{align*} K ( s , t ) = Cov ( β 0 + β 1 s + τ ∫ 0 s W u d u , β 0 + β 1 t + τ ∫ 0 t W v d v ) = C ( 1 + s t ) + 2 τ 2 max ( s , t ) ( min ( s , t ) ) 2 − 6 τ 2 ( min ( s , t ) ) 3 . We will see some more examples of kernels later. Next we see some basic applications of Gaussian Processes.
Interpolation ¶ Suppose we are given values of f f f at 0 ≤ x 1 < ⋯ < x n ≤ 1 0 \leq x_1 < \dots < x_n \leq
1 0 ≤ x 1 < ⋯ < x n ≤ 1 : f ( x 1 ) , … , f ( x n ) f(x_1),
\dots, f(x_n) f ( x 1 ) , … , f ( x n ) . Let x ∈ [ 0 , 1 ] x \in [0, 1] x ∈ [ 0 , 1 ] be a new point (i.e., distinct from x 1 , … , x n x_1, \dots,
x_n x 1 , … , x n ). What can we say about f ( x ) f(x) f ( x ) ?
We will solve this problem assuming a Gaussian process prior for f f f with mean zero and covariance kernel K ( ⋅ , ⋅ ) K(\cdot, \cdot) K ( ⋅ , ⋅ ) . The answer then will be given by the conditional distribution of f ( x ) f(x) f ( x ) given f ( x 1 ) , … , f ( x n ) f(x_1), \dots, f(x_n) f ( x 1 ) , … , f ( x n ) . To compute this, we note that:
( f ( x 1 ) , … , f ( x n ) , f ( x ) ) T ∼ N ( 0 , ( ( K ( x i , x j ) ) n × n ( K ( x i , x ) ) n × 1 ( K ( x , x i ) ) 1 × n K ( x , x ) ) ) . \begin{align*}
(f(x_1), \dots, f(x_n), f(x))^T \sim N \left(0, \begin{pmatrix} (K(x_i,
x_j))_{n \times n}
& (K(x_i,
x))_{n
\times 1}
\\ (K(x, x_i))_{1
\times n} & K(x,
x) \end{pmatrix}
\right).
\end{align*} ( f ( x 1 ) , … , f ( x n ) , f ( x ) ) T ∼ N ( 0 , ( ( K ( x i , x j ) ) n × n ( K ( x , x i ) ) 1 × n ( K ( x i , x ) ) n × 1 K ( x , x ) ) ) . Using the notation K = ( K ( x i , x j ) ) n × n K = (K(x_i,
x_j))_{n \times n} K = ( K ( x i , x j ) ) n × n and k = ( K ( x i , x ) ) n × 1 \mathbf{k} = (K(x_i,
x))_{n
\times
1} k = ( K ( x i , x ) ) n × 1 , we can write the conditional distribution of f ( x ) f(x) f ( x ) given f ( x 1 ) , … , f ( x n ) f(x_1),
\dots,
f(x_n) f ( x 1 ) , … , f ( x n ) as
f ( x ) ∣ f ( x 1 ) , … , f ( x n ) ∼ N ( k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T , K ( x , x ) − k T K − 1 k ) . \begin{align*}
f(x) \mid f(x_1), \dots, f(x_n) \sim N\left(\mathbf{k}^TK^{-1} (f(x_1), \dots,
f(x_n))^T, K(x, x) -
\mathbf{k}^T K^{-1} \mathbf{k} \right).
\end{align*} f ( x ) ∣ f ( x 1 ) , … , f ( x n ) ∼ N ( k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T , K ( x , x ) − k T K − 1 k ) . Thus the posterior mean (or mode) estimate of f ( x ) f(x) f ( x ) is given by
f ( x ) ^ = k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T . \begin{align}
\widehat{f(x)} = \mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T.
\end{align} f ( x ) = k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T . For different kernels K K K , the above expression behaves differently on f ( x 1 ) , … , f ( x n ) f(x_1), \dots, f(x_n) f ( x 1 ) , … , f ( x n ) . The simplest example is that of Brownian Motion.
Suppose f ( x ) = β 0 + τ B ( x ) f(x) = \beta_0 + \tau B(x) f ( x ) = β 0 + τ B ( x ) where B ( x ) B(x) B ( x ) is Brownian Motion, β 0 ∼ N ( 0 , C ) \beta_0 \sim N(0, C) β 0 ∼ N ( 0 , C ) with C → ∞ C \rightarrow \infty C → ∞ and τ \tau τ is a fixed positive constant. Then the kernel is:
K ( x 1 , x 2 ) = C + τ 2 min ( x 1 , x 2 ) . \begin{align}
K(x_1, x_2) = C + \tau^2 \min(x_1, x_2).
\end{align} K ( x 1 , x 2 ) = C + τ 2 min ( x 1 , x 2 ) . In this case, (6) can be explicitly evaluated (in the limit C → ∞ C \rightarrow \infty C → ∞ ) as
Undefined control sequence: \6 at position 52: …, & x \le x_1, \̲6̲pt]
\dfrac{x_{i…
\widehat{f(x)}=
\begin{cases}
f(x_1), & x \le x_1, \6pt]
\dfrac{x_{i+1}-x}{x_{i+1}-x_i} f(x_i)
+
\dfrac{x-x_i}{x_{i+1}-x_i} f(x_{i+1}),
& x_i \le x \le x_{i+1}, \10pt]
f(x_n), & x \ge x_n .
\end{cases}The details behind why (6) with the kernel (7) leads to (8) will be seen later.
The formula (8) for f ( x ) ^ \widehat{f(x)} f ( x ) performs linear interpolation between the observed points, and constant extrapolation outside the observed range.
Also note that f ( x ) ^ \widehat{f(x)} f ( x ) in (8) does not depend on τ \tau τ . The hyperparameter τ \tau τ only affects the posterior variance of f ( x ) f(x) f ( x ) , and not the posterior mean.
Next example is that of the Integrated Brownian Motion prior.
Suppose
f ( x ) = β 0 + β 1 x + τ I ( x ) , \begin{align}
f(x)=\beta_0+\beta_1 x+\tau I(x),
\end{align} f ( x ) = β 0 + β 1 x + τ I ( x ) , where I ( x ) I(x) I ( x ) is integrated Brownian motion and β 0 , β 1 ∼ i . i . d . N ( 0 , C ) \beta_0,\beta_1
\stackrel{i.i.d.}{\sim}N(0,C) β 0 , β 1 ∼ i . i . d . N ( 0 , C ) with C → ∞ C\to\infty C → ∞ . In this case, the posterior mean in (6) can be evaluated explicitly and is the natural cubic spline interpolant of the observations. We will see details behind this later.
Integration ¶ Suppose we are given values of f f f at 0 ≤ x 1 < ⋯ < x n ≤ 0 \leq x_1 < \dots < x_n \leq 0 ≤ x 1 < ⋯ < x n ≤ : f ( x 1 ) , f … , f ( x n ) f(x_1),
f\dots, f(x_n) f ( x 1 ) , f … , f ( x n ) . What can we say about ∫ 0 1 f ( x ) \int_0^1f(x) ∫ 0 1 f ( x ) ?
Again we place a Gaussian process prior on f f f . By linearity, we can write
E [ ∫ 0 1 f ( x ) d x ∣ f ( x 1 ) , … , f ( x n ) ] = ∫ 0 1 E [ f ( x ) ∣ f ( x 1 ) , … , f ( x n ) ] = ∫ 0 1 k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T = ( ∫ 0 1 k ( x 1 , x ) d x , … , ∫ 0 1 k ( x n , x ) d x ) T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T . \begin{align*}
& \E \left[ \int_0^1 f(x) dx \mid f(x_1), \dots, f(x_n) \right] \\ &=
\int_0^1 \E \left[ f(x) \mid f(x_1), \dots, f(x_n) \right] \\ &=
\int_0^1 \mathbf{k}^T K^{-1} (f(x_1), \dots, f(x_n))^T \\
&= \left(\int_0^1 k(x_1, x) dx, \dots,\int_0^1 k(x_n, x) dx \right)^T
K^{-1} (f(x_1), \dots, f(x_n))^T.
\end{align*} E [ ∫ 0 1 f ( x ) d x ∣ f ( x 1 ) , … , f ( x n ) ] = ∫ 0 1 E [ f ( x ) ∣ f ( x 1 ) , … , f ( x n ) ] = ∫ 0 1 k T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T = ( ∫ 0 1 k ( x 1 , x ) d x , … , ∫ 0 1 k ( x n , x ) d x ) T K − 1 ( f ( x 1 ) , … , f ( x n ) ) T . For the Brownian motion, we already have the formula (8) for E [ f ( x ) ∣ f ( x 1 ) , … , f ( x n ) ] \E [f(x) \mid f(x_1), \dots, f(x_n)] E [ f ( x ) ∣ f ( x 1 ) , … , f ( x n )] . So the posterior mean for ∫ 0 1 f ( x ) d x \int_0^1 f(x) dx ∫ 0 1 f ( x ) d x is obtained by simply integrating the right hand side of (8) from 0 to 1. This leads to:
E ( ∫ 0 1 f ( x ) d x ∣ f ( x 1 ) , … , f ( x n ) ) = x 1 f ( x 1 ) + ∑ i = 1 n − 1 ( x i + 1 − x i ) f ( x i ) + f ( x i + 1 ) 2 + ( 1 − x n ) f ( x n ) . \begin{align*}
& \E \left(\int_0^1 f(x) dx \mid f(x_1), \dots, f(x_n) \right) \\
&= x_1 f(x_1) + \sum_{i=1}^{n-1} (x_{i+1} - x_i) \frac{f(x_i) +
f(x_{i+1})}{2} + (1 - x_n) f(x_n).
\end{align*} E ( ∫ 0 1 f ( x ) d x ∣ f ( x 1 ) , … , f ( x n ) ) = x 1 f ( x 1 ) + i = 1 ∑ n − 1 ( x i + 1 − x i ) 2 f ( x i ) + f ( x i + 1 ) + ( 1 − x n ) f ( x n ) . This the trapezoidal integration rule. Therefore, with the Brownian motion prior, the posterior mean estimate of ∫ 0 1 f ( x ) d x \int_0^1 f(x) dx ∫ 0 1 f ( x ) d x coincides with the trapezoidal rule.
For more on the relation between classical integration (quadrature) rules and Gaussian processes, see the book hennig2022probabilistic or the paper diaconis1988bayesian .
In the next lecture, we shall see how to use Gaussian processes for the regression problem.