Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Twenty One

Spring 2026, UC Berkeley

Recap: Model from past few lectures

We studied the following model in the past few lectures.

We have a response variable yy and a single covariate xx. In our example, yy denotes weekly earnings and xx denotes years of experience. The covariate xx takes the values 0,1,,m0, 1, \dots, m for some fixed integer mm.

Our data is (xi,yi),i=1,,n(x_i, y_i), i = 1, \dots, n. The model is:

yi=β0+β1xi+β2ReLU(xi1)++βmReLU(xi(m1))+ϵi\begin{align} y_i = \beta_0 + \beta_1 x_i + \beta_2 \relu(x_i - 1) + \dots + \beta_{m}\relu(x_i - (m-1)) + \epsilon_i \end{align}

where ReLU(u):=max(u,0)\relu(u) := \max(u, 0), and ϵii.i.dN(0,σ2)\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2). Note we did not include ReLU(xim)\relu(x_i - m) because it always equals 0.

We discussed Bayesian inference in this model using the prior:

β0N(0,C),β1N(0,C),β2,,βmi.i.dN(0,τ2)\begin{align} \beta_0 \sim N(0, C), \beta_1 \sim N(0, C), \beta_2, \dots, \beta_m \overset{\text{i.i.d}}{\sim} N(0, \tau^2) \end{align}

Towards the end of last lecture, we remarked that the resulting model on the regression function:

f(x)=β0+β1x+β2ReLU(x1)++βmReLU(x(m1))\begin{align*} f(x) = \beta_0 + \beta_1 x + \beta_2 \relu(x - 1) + \dots + \beta_{m}\relu(x - (m-1)) \end{align*}

with prior (2) is very similar to:

f(x)=β0+β1x+τI(x)\begin{align} f(x) = \beta_0 + \beta_1 x + \tau I(x) \end{align}

where I(x)I(x) is integrated Brownian motion (and β0,β1i.i.dN(0,C)\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C)). We shall some intuition today as to why the Integrated Brownian Motion is appearing here.

Brownian Motion and Integrated Brownian Motion

Before defining Brownian motion, let us recall the definition of Brownian motion. Brownian motion on [0,M][0, M] (for some fixed M>0M > 0) is a stochastic process {B(t),0tM}\{B(t), 0 \le t \le M\} characterized by the following two properties:

  1. Every realization is a continuous function on [0,M][0,M], and B(0)=0B(0) = 0.

  2. For every fixed t1,,tk[0,M]t_1,\dots,t_k \in [0,M], the random vector (B(t1),,B(tk))(B(t_1),\dots,B(t_k)) has a multivariate normal distribution with mean zero and covariance matrix

    Cov(B(ti),B(tj))=min(ti,tj).\text{Cov}(B(t_i), B(t_j)) = \min(t_i,t_j).

This definition suggests the following method for simulating B(t)B(t). Construct the dense grid

0=t0<t1<<tN=M0 = t_0 < t_1 < \dots < t_N = M

on [0,M][0,M] with ti=iM/Nt_i = iM/N. One can then simulate (B(t1),,B(tN))(B(t_1),\dots,B(t_N)) from the multivariate normal distribution with mean zero and covariance matrix min(ti,tj)\min(t_i,t_j).

However, this approach is computationally inefficient for large NN. Sampling directly from an N×NN \times N covariance matrix typically requires a matrix factorization (such as a Cholesky decomposition), which has computational cost on the order of O(N3)O(N^3).

For simulations in particular, the following alternative definition is more appealing:

  1. Every realization is a continuous function on [0,M][0, M] with B(0)=0B(0) = 0.

  2. The process has independent increments which means that B(t1),B(t2)B(t1),,B(tk)B(tk1)B(t_1), B(t_2) - B(t_1), \dots, B(t_k) - B(t_{k-1}) are independent whenever 0t1<<tkM0 \leq t_1 < \dots < t_k \leq M.

  3. B(t)B(s)N(0,ts)B(t) - B(s) \sim N(0, t-s) whenever 0st0 \leq s \leq t.

With this definition, we can simulate B(t1),,B(tN)B(t_1), \dots, B(t_N) for ti=iM/Nt_i = iM/N in the following way:

Note that in this way αi=B(ti)B(ti1)\alpha_i = B(t_i) - B(t_{i-1}). The equation B(ti)=α1++αiB(t_i) = \alpha_1 + \dots + \alpha_i can also be written as

B(t)=i=1NαiI{tti}   for t{t1,,tN}.\begin{align*} B(t) = \sum_{i=1}^N \alpha_i I\{t \geq t_i\} ~~ \text{ for $t \in \{t_1, \dots, t_N\}$}. \end{align*}

Because of the denseness of the grid (when NN is large), we are claiming that

B(t)i=1NαiI{tti}   for all t[0,M].\begin{align} B(t) \approx \sum_{i=1}^N \alpha_i I\{t \geq t_i\} ~~ \text{ for all $t \in [0, M]$}. \end{align}

This means that Brownian motion can be well-approximated (when NN is large) by a piecewise constant process which jumps at each grid point ti=iM/Nt_i = iM/N by a small amount given by N(0,M/N)N(0, M/N).

We are now ready to define the Integrated Brownian Motion. This is given by:

I(t)=0tB(s)ds.\begin{align} I(t) = \int_0^t B(s) ds. \end{align}

Using the approximation (8), we can write

I(t)0t(i=1NαiI{tti})ds=i=1Nαi0tI{sti}ds=i=1Nαi(tti)+.\begin{align*} I(t) \approx \int_0^t \left(\sum_{i=1}^N \alpha_i I\{t \geq t_i\} \right) ds = \sum_{i=1}^N \alpha_i \int_0^t I\{s \geq t_i\} ds = \sum_{i=1}^N \alpha_i (t - t_i)_+. \end{align*}

Therefore Integrated Brownian Motion can be approximated as:

I(t)i=1Nαi(tti)+\begin{align} I(t) \approx \sum_{i=1}^N \alpha_i (t - t_i)_+ \end{align}

when NN is large, ti=iM/Nt_i = iM/N and αii.i.dN(0,M/N)\alpha_i \overset{\text{i.i.d}}{\sim} N(0, M/N).

The Model (1)

Using (11), the (prior) model (1) can therefore be written as:

f(x)β0+β1x+τi=1Nαi(xiMN)+\begin{align*} f(x) \approx \beta_0 + \beta_1 x + \tau \sum_{i=1}^N \alpha_i \left(x - \frac{iM}{N}\right)_+ \end{align*}

Here NN is a large number. If we do a further coarse approximation with N=MN = M, we get back the model (1) if we make the identification βi=ταi1\beta_i = \tau \alpha_{i-1} for i=2,,Ni = 2, \dots, N and M=mM = m. Therefore, the model (1) can be understood as a coarse discretization of the model (1) which stipulates that ff is an Integrated Brownian Motion (scaled by τ\tau) and added to a linear function β0+β1x\beta_0 + \beta_1 x with the uniformative N(0,C)N(0, C) prior on the coefficients β0,β1\beta_0, \beta_1.

The prior (1) is an example of a Gaussian process prior.

Gaussian Process Regression

Consider the usual nonparametric regression problem where the goal is to estimate an unknown function f:ΩRf : \Omega \rightarrow \R from observations (x1,y1),,(xn,yn)(x_1, y_1), \dots, (x_n, y_n) under the model:

yi=f(xi)+ϵi  where ϵii.i.dN(0,σ2).y_i = f(x_i) + \epsilon_i ~~\text{where $\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$}.

Here Ω\Omega denotes the domain which is a subset of Rd\R^d for some d1d \geq 1.

We assume that {f(x),xΩ}\{f(x), x \in \Omega\} forms a Gaussian process with mean function μ(x),xΩ\mu(x), x \in \Omega and covariance function or kernel given by K(x,x)K(x, x') i.e.,

Cov(f(x),f(x))=K(x,x)  for all x,xΩ.\text{Cov}(f(x), f(x')) = K(x, x') ~~\text{for all $x, x' \in \Omega$}.

The kernel is positive semi-definite i.e., for every N1N \geq 1, distinct points u1,,uNΩu_1, \dots, u_N \in \Omega, the N×NN \times N matrix with (i,j)th(i, j)^{th} entry K(ui,uj)K(u_i, u_j) is positive semi-definite. Often, this matrix will be positive definite, and hence invertible.

The mean function is usually taken to be zero. Here are some standard special cases. Here are some examples of Gaussian processes and kernels:

  1. Brownian Motion: Here Ω=[0,)\Omega = [0, \infty) and K(s,t)=min(s,t)K(s, t) = \min(s, t).

  2. (scaled) Brownian Motion plus constant: Here Ω=[0,)\Omega = [0, \infty) and we assume that f(t)=β0+τWtf(t) = \beta_0 + \tau W_t where WtBMW_t \sim BM, β0N(0,C)\beta_0 \sim N(0, C) and τ>0\tau > 0 (and independence between β0\beta_0 and {Wt}\{W_t\}). CC will be taken to be large. Now

    K(s,t)=C+τ2min(s,t).K(s, t) = C + \tau^2 \min(s, t).
  3. Integrated Brownian Motion: Ω=[0,)\Omega = [0, \infty) and

    f(t)=0tWsds  where WsBM.f(t) = \int_0^t W_s ds ~~ \text{where $W_s \sim BM$}.

    The kernel is given by

    K(s,t)=Cov(0sWudu,0tWvdv)=0s0tCov(Wu,Wv)dvdu=0s0tmin(u,v)dvdu\begin{align*} K(s, t) &= \text{Cov} \left(\int_0^s W_u du, \int_0^t W_v dv \right) \\ &= \int_0^s \int_0^t \text{Cov}(W_u, W_v) dv du = \int_0^s \int_0^t \min(u, v) dv du \end{align*}

    To simplify further, assume that sts \leq t so that

    K(s,t)=0s(0uvdv+utudv)du=0s(u22+u(tu))du=0s(utu22)du=ts22s36.\begin{align*} K(s, t) &= \int_0^s \left(\int_0^u v dv + \int_u^t u dv \right) du \\ &= \int_0^s \left(\frac{u^2}{2} + u(t - u) \right) du = \int_0^s \left(ut - \frac{u^2}{2} \right) du = t \frac{s^2}{2} - \frac{s^3}{6}. \end{align*}

    For general ss and tt, we have

    K(s,t)=12max(s,t)(min(s,t))216(min(s,t))3.K(s, t) = \frac{1}{2} \max(s, t) \left(\min(s, t)\right)^2 - \frac{1}{6} \left(\min(s, t) \right)^3.
  4. (scaled) Integrated Brownian Motion Plus a Linear Term: In the IBM model, we have f(0)=0f(0) = 0 and f(0)=0f'(0) = 0. This might be an unrealistic assumption to make when ff is completely unknown. In this case, a better model might be

    f(t)=β0+β1t+τ0tWsdsf(t) = \beta_0 + \beta_1 t + \tau \int_0^t W_s ds

    where β0,β1,{Ws}\beta_0, \beta_1, \{W_s\} are independent with WsW_s being Brownian motion and β0,β1i.i.dN(0,C)\beta_0, \beta_1 \overset{\text{i.i.d}}{\sim} N(0, C). The kernel now becomes

    K(s,t)=Cov(β0+β1s+τ0sWudu,β0+β1t+τ0tWvdv)=C(1+st)+τ22max(s,t)(min(s,t))2τ26(min(s,t))3.\begin{align*} K(s, t) &= \text{Cov} \left(\beta_0 + \beta_1 s + \tau \int_0^s W_u du, \beta_0 + \beta_1 t + \tau \int_0^t W_v dv \right) \\ &= C (1 + st) + \frac{\tau^2}{2} \max(s, t) \left(\min(s, t)\right)^2 - \frac{\tau^2}{6} \left(\min(s, t) \right)^3. \end{align*}