Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twenty Three

Spring 2026, UC Berkeley

Last lecture: interpolation with Gaussian processes

The interpolation problem is: Suppose we are given values of ff at points x1,,xnx_1, \dots, x_n in the domain Ω\Omega of ff. Let xΩx \in \Omega be a new point (i.e., distinct from x1,,xnx_1, \dots, x_n). What can we say about f(x)f(x)?

In the last lecture, we saw how to use Gaussian processes to solve this problem. We model {f(x),xΩ}\{f(x), x \in \Omega\} as a Gaussian process with mean zero and covariance function or kernel given by K(x,x)K(x, x') i.e.,

Cov(f(x),f(x))=K(x,x)  for all x,xΩ.\text{Cov}(f(x), f(x')) = K(x, x') ~~\text{for all $x, x' \in \Omega$}.

Under this modeling assumption, the answer to the interpolation question is given by the conditional distribution of f(x)f(x) given f(x1),,f(xn)f(x_1), \dots, f(x_n) which is calculated as follows. Note that:

(f(x1),,f(xn),f(x))N(0,((K(xi,xj))n×n(K(xi,x))n×1(K(x,xi))1×nK(x,x))).\begin{align*} (f(x_1), \dots, f(x_n), f(x)) \sim N \left(0, \begin{pmatrix} (K(x_i, x_j))_{n \times n} & (K(x_i, x))_{n \times 1} \\ (K(x, x_i))_{1 \times n} & K(x, x) \end{pmatrix} \right). \end{align*}

Using the notation K=(K(xi,xj))n×nK = (K(x_i, x_j))_{n \times n} and k=(K(xi,x))n×1\mathbf{k} = (K(x_i, x))_{n \times 1}, we can write the conditional distribution of f(x)f(x) given f(x1),,f(xn)f(x_1), \dots, f(x_n) as

f(x)f(x1),,f(xn)N(kTK1(f(x1),,f(xn))T,K(x,x)kTK1k).\begin{align*} f(x) \mid f(x_1), \dots, f(x_n) \sim N\left(\mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T, K(x, x) - \mathbf{k}^T K^{-1} \mathbf{k} \right). \end{align*}

Thus the posterior mean (or mode) estimate of f(x)f(x) is given by

f(x)^=kTK1(f(x1),,f(xn))T.\begin{align} \widehat{f(x)} = \mathbf{k}^TK^{-1} (f(x_1), \dots, f(x_n))^T. \end{align}

Regression with Gaussian Processes

Today, we will see how to perform regression using Gaussian processes. The key difference between regression and interpolation is that, in regression, the values f(x1),,f(xn)f(x_1), \dots, f(x_n) are not observed exactly; instead, they are observed with noise.

More precisely, we are given observations y1,,yny_1, \dots, y_n modeled as

yi=f(xi)+ϵi,where ϵii.i.d.N(0,σ2).y_i = f(x_i) + \epsilon_i, \qquad \text{where } \epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2).

The parameter σ\sigma controls the level of noise, i.e., how much each observation yiy_i deviates from the true value f(xi)f(x_i).

As in the interpolation setting, our goal is to estimate f(x)f(x) at a test point xx. This test point may be different from the observed inputs x1,,xnx_1, \dots, x_n, or it may coincide with one of them.

Importantly, because the observations are noisy, it is meaningful to estimate f(xi)f(x_i) even at observed points. In fact, by combining information from all observations, we can often obtain a better estimate of f(xi)f(x_i) than the raw observation yiy_i.

Another key difference from interpolation is that the input points x1,,xnx_1, \dots, x_n need not necessarily be distinct. Since each observation contains noise, having repeated measurements at the same input can help improve the overall estimate.

The solution to the regression problem is very similar to that of the interpolation problem with only one difference. The goal is to estimate f(x)f(x) at the test point xx given the data (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n). We will use the conditional distribution of f(x)f(x) given y1,,yny_1, \dots, y_n (we will assume that x1,,xn,xx_1, \dots, x_n, x are deterministic). To calculate this conditional distribution, first note that the marginal distribution of (y1,,yn,f(x))(y_1, \dots, y_n, f(x_*)) is given by

(y1,,yn,f(x))N(0,((K(xi,xj))n×n+σ2In(K(xi,x))n×1(K(x,xi))1×nK(x,x))).\begin{align*} (y_1, \dots, y_n, f(x)) \sim N \left(0, \begin{pmatrix} (K(x_i, x_j))_{n \times n} + \sigma^2 I_n & (K(x_i, x))_{n \times 1} \\ (K(x, x_i))_{1 \times n} & K(x, x) \end{pmatrix} \right). \end{align*}

Using the notation K=(K(xi,xj))n×nK = (K(x_i, x_j))_{n \times n} and k=(K(xi,x))n×1\mathbf{k} = (K(x_i, x))_{n \times 1}, we can write the conditional distribution of f(x)f(x) given y1,,yny_1, \dots, y_n as

f(x)dataN(kT(K+σ2In)1Y,K(x,x)kT(K+σ2In)1k).\begin{align*} f(x) \mid \text{data} \sim N\left(\mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} Y, K(x, x) - \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} \mathbf{k} \right). \end{align*}

Thus the posterior mean (or mode) estimate of f(x)f(x) is given by

f(x)^=kT(K+σ2In)1y,\begin{align} \widehat{f(x)} = \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} y, \end{align}

where yy is the n×1n \times 1 vector with entries y1,,yny_1, \dots, y_n.

So the only difference between (6) and (3) is the presence of the additional σ2In\sigma^2 I_n term in case of regression.

Often, we would need to estimate σ\sigma from the observed data (and also additional hyperparameters present in the kernel KK). For these, the marginal likelihood of y1,,yny_1, \dots, y_n is important. This is simply the multivariate normal distribution with mean vector 0 and covariance K+σ2InK + \sigma^2 I_n.

To illustrate the calculations, let us take the special case of the Integrated Brownian Motion prior.

Calculations for the IBM prior

We take the prior

f(x)=β0+β1x+τI(x)\begin{align*} f(x) = \beta_0 + \beta_1 x + \tau I(x) \end{align*}

where β0,β1\beta_0, \beta_1 are i.i.d N(0,C)N(0, C) and I(x)I(x) is integrated Brownian Motion.

This is a Gaussian process prior with mean zero and covariance kernel:

K(u,v)=C(1+uv)+τ2KI(u,v)\begin{align*} K(u, v) = C(1 + uv) + \tau^2 K_I(u, v) \end{align*}

where KI(u,v)K_I(u, v) is the kernel corresponding to IBM, which is:

KI(u,v)=12(min(u,v))2max(u,v)16(min(u,v))3=16(min(u,v))2(3max(u,v)min(u,v))=uv(min(u,v))u+v2(min(u,v))2+13(min(u,v))3.\begin{align*} K_I(u, v) &= \frac{1}{2} \left(\min(u, v) \right)^2\max(u, v) - \frac{1}{6} \left(\min(u, v)\right)^3 \\ &= \frac{1}{6} \left(\min(u, v) \right)^2 \left(3 \max(u, v) - \min(u, v) \right) \\ &= u v (\min(u, v)) - \frac{u+v}{2} \left(\min(u, v) \right)^2 + \frac{1}{3} \left(\min(u, v) \right)^3. \end{align*}

Note that the kernel has the unknown parameter τ\tau (which we write as τ=γσ\tau = \gamma \sigma). The constant CC is assumed to be large and it is not to be estimated (ideally we want to take C=+C = +\infty).

Given data (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n), what is the posterior of f(x)f(x)? First let us assume that τ,σ\tau, \sigma are given. The estimate is simply (6). We simplify this expression below. Let XX denote the n×2n \times 2 matrix with columns 1 and xix_i (this is the usual XX matrix in the simple linear regression of yy on xx based on the data (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n)).

Note that

kT=(K(x,x1),,K(x,xn))=(C(1+xxi)+τ2KI(x,xi),i=1,,n)=C(1,x)XT+τ2(KI(x,x1),,KI(x,xn))=C(1,x)XT+τ2kIT.\begin{align*} \mathbf{k}^T &= (K(x, x_1), \dots, K(x, x_n)) \\ &= \left(C(1 + x x_i) + \tau^2 K_I(x, x_i), i = 1, \dots, n \right) \\ &= C (1, x) X^T + \tau^2 (K_I(x, x_1), \dots, K_I(x, x_n)) \\ &= C (1, x) X^T + \tau^2 \mathbf{k}_I^T. \end{align*}

where

kIT:=(KI(x,x1),,KI(x,xn)),\begin{align*} \mathbf{k}_I^T := (K_I(x, x_1), \dots, K_I(x, x_n)), \end{align*}

and (1,x)(1, x) denotes the row vector with entries 1 and xx.

Further the n×nn \times n matrix KK has the (i,j)(i, j)-th entry:

C(1+xixj)+γ2σ2KI(xi,xj)\begin{align*} C(1 + x_i x_j) + \gamma^2 \sigma^2 K_I(x_i, x_j) \end{align*}

so that

K=CXXT+τ2KI\begin{align} K = C X X^T + \tau^2 K_I \end{align}

where KIK_I is the n×nn \times n matrix with (i,j)(i, j)-th entry KI(xi,xj)K_I(x_i, x_j).

As a result,

f(x)^=kT(K+σ2In)1y=(C(1,x)XT+τ2kIT)(CXXT+τ2KI+σ2In)1y.\begin{align*} \widehat{f(x)} = \mathbf{k}^T \left(K + \sigma^2 I_n \right)^{-1} y = \left(C (1, x) X^T + \tau^2 \mathbf{k}_I^T \right) \left(C X X^T + \tau^2 K_I + \sigma^2 I_n \right)^{-1} y. \end{align*}

This expression depends on the large constant CC. Direct computation with a large CC might make it unstable. It is therefore natural to compute the limit as CC \rightarrow \infty. Using the Sherman-Morrison-Woodbury identity,

(K+σ2In)1=(CXXT+τ2KI+σ2In)1=(CXXT+Σ)1=Σ1Σ1X(C1I2+XTΣ1X)1XTΣ1\begin{align} (K + \sigma^2 I_n)^{-1} &= (C X X^T + \tau^2 K_I + \sigma^2 I_n)^{-1} \nonumber \\ &= \left(C X X^T + \Sigma \right)^{-1} \nonumber \\ &= \Sigma^{-1} - \Sigma^{-1} X \left(C^{-1} I_2 + X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1} \end{align}

where

Σ=τ2KI+σ2In.\begin{align} \Sigma = \tau^2 K_I + \sigma^2 I_n. \end{align}

Further

k=C(1,x)XT+τ2kIT.\begin{align*} \mathbf{k} = C (1, x) X^T + \tau^2 \mathbf{k}_I^T. \end{align*}

We thus get

f(x)^=(C(1,x)XT+τ2kIT)(Σ1Σ1X(C1I2+XTΣ1X)1XTΣ1)y.\begin{align*} \widehat{f(x)} &= \left(C (1, x) X^T + \tau^2 \mathbf{k}_I^T \right) \left(\Sigma^{-1} - \Sigma^{-1} X \left(C^{-1} I_2 + X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1} \right) y. \end{align*}

Write τ=γσ\tau = \gamma \sigma. In Fact 1, it is proved that, as CC \rightarrow \infty the above converges to:

f(x)^:=(1,x)(XTAγ1X)1XTAγ1y+γ2kIT(Aγ1Aγ1X(XTAγ1X)1XTAγ1)y\begin{align*} \widehat{f(x)} := (1, x) \left(X^T A_{\gamma}^{-1} X \right)^{-1} X^T A_{\gamma}^{-1} y + \gamma^2 \mathbf{k}_I^T \left(A_{\gamma}^{-1} - A_{\gamma}^{-1} X (X^T A_{\gamma}^{-1} X)^{-1} X^T A_{\gamma}^{-1} \right) y \end{align*}

where

Aγ=In×n+γ2KI.\begin{align*} A_{\gamma} = I_{n \times n} + \gamma^2 K_I. \end{align*}

This expression for f(x)^\widehat{f(x)}, which only depends on γ=τ/σ\gamma = \tau/\sigma and not on τ\tau and σ\sigma individually, can be used for computation.

Estimation of γ\gamma and σ\sigma

The hyperparameters γ\gamma and σ\sigma also need to be estimated from the observed data (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n) (recall that τ=γσ\tau = \gamma \sigma). For this, the marginal likelihood of the data given σ,γ\sigma, \gamma is important. This is calculated using:

yσ,γN(0,K+σ2In)\begin{align*} y \mid \sigma, \gamma \sim N(0, K + \sigma^2 I_n) \end{align*}

where KK is given by (14). In other words,

fyσ,γ(y)1det(K+σ2In)exp(12yT(K+σ2In)1y).\begin{align*} f_{y \mid \sigma, \gamma}(y) \propto \frac{1}{\sqrt{\det(K + \sigma^2 I_n)}} \exp \left(-\frac{1}{2} y^T (K + \sigma^2 I_n)^{-1} y \right). \end{align*}

For (K+σ2In)1(K + \sigma^2 I_n)^{-1}, we use (16) and let CC \rightarrow \infty to get

(K+σ2In)1Σ1Σ1X(XTΣ1X)1XTΣ1.\begin{align*} (K + \sigma^2 I_n)^{-1} \rightarrow \Sigma^{-1} - \Sigma^{-1} X \left(X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1}. \end{align*}

Further, using the matrix determinant lemma, we get

K+σ2In=Σ+CXXT=ΣI+CXTΣ1XΣCXTΣ1X when C is large=ΣC2XTΣ1XΣXTΣ1X.\begin{align*} |K + \sigma^2 I_n| &= |\Sigma + C X X^T| \\ &= |\Sigma| |I + C X^T \Sigma^{-1} X| \\ &\approx |\Sigma| |C X^T \Sigma^{-1} X| \text{ when $C$ is large} \\ &= |\Sigma| C^2 |X^T \Sigma^{-1} X| \propto |\Sigma| |X^T \Sigma^{-1} X|. \end{align*}

So the marginal likelihood of yy given τ,σ\tau, \sigma becomes:

fyτ,σ(y)Σ1/2XTΣ1X1/2exp(12yT[Σ1Σ1X(XTΣ1X)1XTΣ1]y)\begin{align*} f_{y \mid \tau, \sigma}(y) \propto |\Sigma|^{-1/2} |X^T \Sigma^{-1} X|^{-1/2} \exp \left(-\frac{1}{2} y^T \left[\Sigma^{-1} - \Sigma^{-1} X \left(X^T \Sigma^{-1} X \right)^{-1} X^T \Sigma^{-1}\right] y \right) \end{align*}

with Σ\Sigma defined in (17).

We now take τ=γσ\tau = \gamma \sigma so that

Σ=σ2(In+γ2KI)=σ2Aγ   where Aγ:=In+γ2KI.\begin{align*} \Sigma = \sigma^2 \left(I_n + \gamma^2 K_I \right) = \sigma^2 A_{\gamma} ~~ \text{ where $A_{\gamma} := I_n + \gamma^2 K_I$}. \end{align*}

This gives

fyγ,σ(y)σ(n2)Aγ1/2XTAγ1X1/2exp(yT[Aγ1Aγ1X(XTAγ1X)1XTAγ1]y2σ2).\begin{align*} f_{y \mid \gamma, \sigma}(y) \propto \sigma^{-(n-2)} |A_{\gamma}|^{-1/2} |X^T A_{\gamma}^{-1} X|^{-1/2} \exp \left(-\frac{y^T \left[A_{\gamma}^{-1} - A_{\gamma}^{-1} X (X^T A_{\gamma}^{-1} X)^{-1} X^T A_{\gamma}^{-1} \right] y}{2 \sigma^2} \right). \end{align*}

Combining this with the prior logσγuniform(,)\log \sigma \mid \gamma \sim \text{uniform}(-\infty, \infty) gives

1σ2y,γGamma(n21,yT[Aγ1Aγ1X(XTAγ1X)1XTAγ1]y2).\begin{align*} \frac{1}{\sigma^2} \mid y, \gamma \sim \text{Gamma} \left(\frac{n}{2} - 1, \frac{y^T \left[A_{\gamma}^{-1} - A_{\gamma}^{-1} X (X^T A_{\gamma}^{-1} X)^{-1} X^T A_{\gamma}^{-1} \right] y}{2} \right). \end{align*}

Finally if we take a prior p(γ)p(\gamma) for γ\gamma, then the posterior of γ\gamma becomes:

p(γy)p(γ)Aγ1/2XTAγ1X1/2(yT[Aγ1Aγ1X(XTAγ1X)1XTAγ1]y)(n/2)1.\begin{align*} p(\gamma \mid y) \propto \frac{p(\gamma) |A_{\gamma}|^{-1/2} |X^T A_{\gamma}^{-1} X|^{-1/2}}{\left(y^T \left[A_{\gamma}^{-1} - A_{\gamma}^{-1} X (X^T A_{\gamma}^{-1} X)^{-1} X^T A_{\gamma}^{-1} \right] y \right)^{(n/2) - 1}}. \end{align*}

Proof of the CC \rightarrow \infty fact