STAT 238 - Bayesian Statistics Lecture Twenty Three

Spring 2026, UC Berkeley

Last lecture: interpolation with Gaussian processes¶

The interpolation problem is: Suppose we are given values of $f$ at points $x_1, \dots, x_n$ in the domain $\Omega$ of $f$ . Let $x \in \Omega$ be a new point (i.e., distinct from $x_1, \dots, x_n$ ). What can we say about $f(x)$ ?

In the last lecture, we saw how to use Gaussian processes to solve this problem. We model $\{f(x), x \in \Omega\}$ as a Gaussian process with mean zero and covariance function or kernel given by $K(x, x')$ i.e.,

\text{Cov}(f(x), f(x')) = K(x, x') ~~\text{for all $x, x' \in \Omega$}.

Under this modeling assumption, the answer to the interpolation question is given by the conditional distribution of $f(x)$ given $f(x_1), \dots, f(x_n)$ which is calculated as follows. Note that:

\begin{align*} (f(x_1), \dots, f(x_n), f(x)) \sim N \left(0, \begin{pmatrix} (K(x_i, x_j))_{n \times n} & (K(x_i, x))_{n \times 1} \\ (K(x, x_i))_{1 \times n} & K(x, x) \end{pmatrix} \right). \end{align*}

(1)

Using the notation $K = (K(x_i, x_j))_{n \times n}$ and $\mathbf{k} = (K(x_i, x))_{n \times 1}$ , we can write the conditional distribution of $f(x)$ given $f(x_1), \dots, f(x_n)$ as

\begin{align*} f(x) \mid f(x_1), \dots, f(x_n) \sim N\left(\mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T, K(x, x) - \mathbf{k}^T K^{-1} \mathbf{k} \right). \end{align*}

(2)

Thus the posterior mean (or mode) estimate of $f(x)$ is given by

\begin{align} \widehat{f(x)} = \mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T. \end{align}

(3)

Regression with Gaussian Processes¶

Today, we will see how to perform regression using Gaussian processes. The key difference between regression and interpolation is that, in regression, the values $f(x_1), \dots, f(x_n)$ are not observed exactly; instead, they are observed with noise.

More precisely, we are given observations $y_1, \dots, y_n$ modeled as

y_i = f(x_i) + \epsilon_i, \qquad \text{where } \epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2).

(4)

The parameter $\sigma$ controls the level of noise, i.e., how much each observation $y_i$ deviates from the true value $f(x_i)$ .

As in the interpolation setting, our goal is to estimate $f(x)$ at a test point $x$ . This test point may be different from the observed inputs $x_1, \dots, x_n$ , or it may coincide with one of them.

Importantly, because the observations are noisy, it is meaningful to estimate $f(x_i)$ even at observed points. In fact, by combining information from all observations, we can often obtain a better estimate of $f(x_i)$ than the raw observation $y_i$ .

Another key difference from interpolation is that the input points $x_1, \dots, x_n$ need not necessarily be distinct. Since each observation contains noise, having repeated measurements at the same input can help improve the overall estimate.

The solution to the regression problem is very similar to that of the interpolation problem with only one difference. The goal is to estimate $f(x)$ at the test point $x$ given the data $(x_1, y_1), \dots, (x_n, y_n)$ . We will use the conditional distribution of $f(x)$ given $y_1, \dots, y_n$ (we will assume that $x_1, \dots, x_n, x$ are deterministic). To calculate this conditional distribution, first note that the marginal distribution of $(y_1, \dots, y_n, f(x_*))$ is given by

\begin{align*} (y_1, \dots, y_n, f(x)) \sim N \left(0, \begin{pmatrix} (K(x_i, x_j))_{n \times n} + \sigma^2 I_n & (K(x_i, x))_{n \times 1} \\ (K(x, x_i))_{1 \times n} & K(x, x) \end{pmatrix} \right). \end{align*}

(5)

Using the notation $K = (K(x_i, x_j))_{n \times n}$ and $\mathbf{k} = (K(x_i, x))_{n \times 1}$ , we can write the conditional distribution of $f(x)$ given $y_1, \dots, y_n$ as

\begin{align*} f(x) \mid \text{data} \sim N\left(\mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} Y, K(x, x) - \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} \mathbf{k} \right). \end{align*}

(6)

Thus the posterior mean (or mode) estimate of $f(x)$ is given by

\begin{align} \widehat{f(x)} = \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} y, \end{align}

(7)

where $y$ is the $n \times 1$ vector with entries $y_1, \dots, y_n$ .

So the only difference between (6) and (3) is the presence of the additional $\sigma^2 I_n$ term in case of regression.

Often, we would need to estimate $\sigma$ from the observed data (and also additional hyperparameters present in the kernel $K$ ). For these, the marginal likelihood of $y_1, \dots, y_n$ is important. This is simply the multivariate normal distribution with mean vector 0 and covariance $K + \sigma^2 I_n$ .

To illustrate the calculations, let us take the special case of the Integrated Brownian Motion prior.

Calculations for the IBM prior¶

We take the prior

\begin{align*} f(x) = \beta_0 + \beta_1 x + \tau I(x) \end{align*}

(8)

where $\beta_0, \beta_1$ are i.i.d $N(0, C)$ and $I(x)$ is integrated Brownian Motion.

This is a Gaussian process prior with mean zero and covariance kernel:

\begin{align*} K(u, v) = C(1 + uv) + \tau^2 K_I(u, v) \end{align*}

(9)

where $K_I(u, v)$ is the kernel corresponding to IBM, which is:

\begin{align*} K_I(u, v) &= \frac{1}{2} \left(\min(u, v) \right)^2\max(u, v) - \frac{1}{6} \left(\min(u, v)\right)^3 \\ &= \frac{1}{6} \left(\min(u, v) \right)^2 \left(3 \max(u, v) - \min(u, v) \right) \\ &= u v (\min(u, v)) - \frac{u+v}{2} \left(\min(u, v) \right)^2 + \frac{1}{3} \left(\min(u, v) \right)^3. \end{align*}

(10)

Note that the kernel has the unknown parameter $\tau$ (which we write as $\tau = \gamma \sigma$ ). The constant $C$ is assumed to be large and it is not to be estimated (ideally we want to take $C = +\infty$ ).

Given data $(x_1, y_1), \dots, (x_n, y_n)$ , what is the posterior of $f(x)$ ? First let us assume that $\tau, \sigma$ are given. The estimate is simply (6). We simplify this expression below. Let $X$ denote the $n \times 2$ matrix with columns 1 and $x_i$ (this is the usual $X$ matrix in the simple linear regression of $y$ on $x$ based on the data $(x_1, y_1), \dots, (x_n, y_n)$ ).

Note that

\begin{align*} \mathbf{k}^T &= (K(x, x_1), \dots, K(x, x_n)) \\ &= \left(C(1 + x x_i) + \tau^2 K_I(x, x_i), i = 1, \dots, n \right) \\ &= C (1, x) X^T + \tau^2 (K_I(x, x_1), \dots, K_I(x, x_n)) \\ &= C (1, x) X^T + \tau^2 \mathbf{k}_I^T. \end{align*}

(11)

where

\begin{align*} \mathbf{k}_I^T := (K_I(x, x_1), \dots, K_I(x, x_n)), \end{align*}

(12)

and $(1, x)$ denotes the row vector with entries 1 and $x$ .

Further the $n \times n$ matrix $K$ has the $(i, j)$ -th entry:

\begin{align*} C(1 + x_i x_j) + \gamma^2 \sigma^2 K_I(x_i, x_j) \end{align*}

(13)

so that

\begin{align} K = C X X^T + \tau^2 K_I \end{align}

(14)

where $K_I$ is the $n \times n$ matrix with $(i, j)$ -th entry $K_I(x_i, x_j)$ .

As a result,

\begin{align*} \widehat{f(x)} = \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} y = \left(C (1, x) X^T + \tau^2 \mathbf{k}_I^T \right) \left(C X X^T + \tau^2 K_I + \sigma^2 I_n \right)^{-1} y. \end{align*}

(15)

This expression depends on the large constant $C$ . Direct computation with a large $C$ might make it unstable. It is therefore natural to compute the limit as $C \rightarrow \infty$ . Using the Sherman-Morrison-Woodbury identity,

\begin{align} (K + \sigma^2 I_n)^{-1} &= (C X X^T + \tau^2 K_I + \sigma^2 I_n)^{-1} \nonumber \\ &= \left(C X X^T + \Sigma \right)^{-1} \nonumber \\ &= \Sigma^{-1} - \Sigma^{-1} X \left(C^{-1} I_2 + X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1} \end{align}

(16)

where

\begin{align} \Sigma = \tau^2 K_I + \sigma^2 I_n. \end{align}

(17)

Further

\begin{align*} \mathbf{k} = C (1, x) X^T + \tau^2 \mathbf{k}_I^T. \end{align*}

(18)

We thus get

\begin{align*} \widehat{f(x)} &= \left(C (1, x) X^T + \tau^2 \mathbf{k}_I^T \right) \left(\Sigma^{-1} - \Sigma^{-1} X \left(C^{-1} I_2 + X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1} \right) y. \end{align*}

(19)

Write $\tau = \gamma \sigma$ . In Fact 1, it is proved that, as $C \rightarrow \infty$ the above converges to:

\begin{align*} \widehat{f(x)} := (1, x) \left(X^T A_{\gamma}^{-1} X \right)^{-1} X^T A_{\gamma}^{-1} y + \gamma^2 \mathbf{k}_I^T \left(A_{\gamma}^{-1} - A_{\gamma}^{-1} X (X^T A_{\gamma}^{-1} X)^{-1} X^T A_{\gamma}^{-1} \right) y \end{align*}

(20)

where

\begin{align*} A_{\gamma} = I_{n \times n} + \gamma^2 K_I. \end{align*}

(21)

This expression for $\widehat{f(x)}$ , which only depends on $\gamma = \tau/\sigma$ and not on $\tau$ and $\sigma$ individually, can be used for computation.

Estimation of $\gamma$ and $\sigma$ ¶

The hyperparameters $\gamma$ and $\sigma$ also need to be estimated from the observed data $(x_1, y_1), \dots, (x_n, y_n)$ (recall that $\tau = \gamma \sigma$ ). For this, the marginal likelihood of the data given $\sigma, \gamma$ is important. This is calculated using: