STAT 238 - Bayesian Statistics Lecture Twenty Two

Spring 2026, UC Berkeley

Gaussian Processes¶

The goal is to infer an unknown function $f : \Omega \rightarrow \R$ that is defined on a known domain $\Omega \subseteq \R^p$ . We assume that $\{f(x), x \in \Omega\}$ forms a Gaussian process with mean zero and covariance function or kernel given by $K(x, x')$ i.e.,

\text{Cov}(f(x), f(x')) = K(x, x') ~~ \text{for all $x, x' \in \Omega$}.

The kernel needs to be positive semi-definite i.e., for every $N \geq 1$ , distinct points $u_1, \dots, u_N \in \Omega$ , the $N \times N$ matrix with $(i, j)^{th}$ entry $K(u_i, u_j)$ is positive semi-definite. Sometimes, this matrix will be positive definite, and hence invertible.

Here are some examples of Gaussian processes and kernels.

Brownian Motion: Here $\Omega = [0, \infty)$ and $K(s, t) = \min(s, t)$ .
(scaled) Brownian Motion plus constant: Here again $\Omega = [0, \infty)$ . In the Brownian motion model, $f(0) = 0$ which is generally an unrealistic assumption. To fix this, we assume that $f(t) = \beta_0 + \tau W_t$ where $W_t \sim BM$ , $\beta_0 \sim N(0, C)$ and $\tau > 0$ (and independence between $\beta_0$ and $\{W_t\}$ ). $C$ will be taken to be large (potentially $C \rightarrow \infty$ ). Now
$K(s, t) = C + \tau^2 \min(s, t).$
Integrated Brownian Motion: $\Omega = [0, \infty)$ and
$f(t) = \int_0^t W_s ds ~~ \text{where $W_s \sim BM$}.$
The kernel is given by
$\begin{align*} K(s, t) &= \text{Cov} \left(\int_0^s W_u du, \int_0^t W_v dv \right) \\ &= \int_0^s \int_0^t \text{Cov}(W_u, W_v) dv du = \int_0^s \int_0^t \min(u, v) dv du \end{align*}$
(1)
To simplify further, assume that $s \leq t$ so that
$\begin{align*} K(s, t) &= \int_0^s \left(\int_0^u v dv + \int_u^t u dv \right) du \\ &= \int_0^s \left(\frac{u^2}{2} + u(t - u) \right) du = \int_0^s \left(ut - \frac{u^2}{2} \right) du = t \frac{s^2}{2} - \frac{s^3}{6}. \end{align*}$
(2)
For general $s$ and $t$ , we have
$K(s, t) = \frac{1}{2} \max(s, t) \left(\min(s, t)\right)^2 - \frac{1}{6} \left(\min(s, t) \right)^3.$
(scaled) Integrated Brownian Motion Plus a Linear Term: In the IBM model, we have $f(0) = 0$ and $f'(0) = 0$ . This might be an unrealistic assumption to make when $f$ is completely unknown. In this case, a better model is:
$f(t) = \beta_0 + \beta_1 t + \tau \int_0^t W_s ds$
where $\beta_0, \beta_1, \{W_s\}$ are independent with $W_s$ being Brownian motion and $\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C)$ . The kernel now becomes
$\begin{align*} K(s, t) &= \text{Cov} \left(\beta_0 + \beta_1 s + \tau \int_0^s W_u du, \beta_0 + \beta_1 t + \tau \int_0^t W_v dv \right) \\ &= C (1 + st) + \frac{\tau^2}{2} \max(s, t) \left(\min(s, t)\right)^2 - \frac{\tau^2}{6} \left(\min(s, t) \right)^3. \end{align*}$
(3)

We will see some more examples of kernels later. Next we see some basic applications of Gaussian Processes.

Interpolation¶

Suppose we are given values of $f$ at $0 \leq x_1 < \dots < x_n \leq 1$ : $f(x_1), \dots, f(x_n)$ . Let $x \in [0, 1]$ be a new point (i.e., distinct from $x_1, \dots, x_n$ ). What can we say about $f(x)$ ?

We will solve this problem assuming a Gaussian process prior for $f$ with mean zero and covariance kernel $K(\cdot, \cdot)$ . The answer then will be given by the conditional distribution of $f(x)$ given $f(x_1), \dots, f(x_n)$ . To compute this, we note that:

\begin{align*} (f(x_1), \dots, f(x_n), f(x))^T \sim N \left(0, \begin{pmatrix} (K(x_i, x_j))_{n \times n} & (K(x_i, x))_{n \times 1} \\ (K(x, x_i))_{1 \times n} & K(x, x) \end{pmatrix} \right). \end{align*}

(4)

Using the notation $K = (K(x_i, x_j))_{n \times n}$ and $\mathbf{k} = (K(x_i, x))_{n \times 1}$ , we can write the conditional distribution of $f(x)$ given $f(x_1), \dots, f(x_n)$ as

\begin{align*} f(x) \mid f(x_1), \dots, f(x_n) \sim N\left(\mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T, K(x, x) - \mathbf{k}^T K^{-1} \mathbf{k} \right). \end{align*}

(5)

Thus the posterior mean (or mode) estimate of $f(x)$ is given by

\begin{align} \widehat{f(x)} = \mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T. \end{align}

(6)

For different kernels $K$ , the above expression behaves differently on $f(x_1), \dots, f(x_n)$ . The simplest example is that of Brownian Motion.

Example 1

Suppose $f(x) = \beta_0 + \tau B(x)$ where $B(x)$ is Brownian Motion, $\beta_0 \sim N(0, C)$ with $C \rightarrow \infty$ and $\tau$ is a fixed positive constant. Then the kernel is:

\begin{align} K(x_1, x_2) = C + \tau^2 \min(x_1, x_2). \end{align}

(7)

In this case, (6) can be explicitly evaluated (in the limit $C \rightarrow \infty$ ) as

Undefined control sequence: \6 at position 52: …, & x \le x_1, \̲6̲pt]
\dfrac{x_{i…

\widehat{f(x)}=
\begin{cases}
f(x_1), & x \le x_1, \6pt]
\dfrac{x_{i+1}-x}{x_{i+1}-x_i} f(x_i)
+
\dfrac{x-x_i}{x_{i+1}-x_i} f(x_{i+1}),
& x_i \le x \le x_{i+1}, \10pt]
f(x_n), & x \ge x_n .
\end{cases}

The details behind why (6) with the kernel (7) leads to (8) will be seen later.

The formula (8) for $\widehat{f(x)}$ performs linear interpolation between the observed points, and constant extrapolation outside the observed range.

Also note that $\widehat{f(x)}$ in (8) does not depend on $\tau$ . The hyperparameter $\tau$ only affects the posterior variance of $f(x)$ , and not the posterior mean.

Next example is that of the Integrated Brownian Motion prior.

Integration¶

Suppose we are given values of $f$ at $0 \leq x_1 < \dots < x_n \leq$ : $f(x_1), f\dots, f(x_n)$ . What can we say about $\int_0^1f(x)$ ?

Again we place a Gaussian process prior on $f$ . By linearity, we can write

\begin{align*} & \E \left[ \int_0^1 f(x) dx \mid f(x_1), \dots, f(x_n) \right] \\ &= \int_0^1 \E \left[ f(x) \mid f(x_1), \dots, f(x_n) \right] \\ &= \int_0^1 \mathbf{k}^T K^{-1} (f(x_1), \dots, f(x_n))^T \\ &= \left(\int_0^1 k(x_1, x) dx, \dots,\int_0^1 k(x_n, x) dx \right)^T K^{-1} (f(x_1), \dots, f(x_n))^T. \end{align*}

(10)

Example 3

For the Brownian motion, we already have the formula (8) for $\E [f(x) \mid f(x_1), \dots, f(x_n)]$ . So the posterior mean for $\int_0^1 f(x) dx$ is obtained by simply integrating the right hand side of (8) from 0 to 1. This leads to:

\begin{align*} & \E \left(\int_0^1 f(x) dx \mid f(x_1), \dots, f(x_n) \right) \\ &= x_1 f(x_1) + \sum_{i=1}^{n-1} (x_{i+1} - x_i) \frac{f(x_i) + f(x_{i+1})}{2} + (1 - x_n) f(x_n). \end{align*}

(11)

This the trapezoidal integration rule. Therefore, with the Brownian motion prior, the posterior mean estimate of $\int_0^1 f(x) dx$ coincides with the trapezoidal rule.

For more on the relation between classical integration (quadrature) rules and Gaussian processes, see the book hennig2022probabilistic or the paper diaconis1988bayesian.

In the next lecture, we shall see how to use Gaussian processes for the regression problem.