Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Fifteen

Spring 2026, UC Berkeley

Linear Regression

Our next topic is (multiple) linear regression. Here, we have one response variable yy and mm covariates x1,,xmx_1, \dots, x_m (m=1m = 1 corresponds to simple linear regression). We observe data on nn instances or subjects for all these variables: (yi,xi1,,xim)(y_i, x_{i1}, \dots, x_{im}) for i=1,,ni = 1, \dots, n. The multiple linear regression model (with normal errors) is given by:

yi=β0+β1xi1++βmxim+ϵi  with ϵii.i.dN(0,σ2).y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} + \epsilon_i ~~ \text{with $\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$}.

The default prior for linear regression is:

β0,β1,,βm,logσi.i.dunif(C,C)\beta_0, \beta_1, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-C, C)

for a very large positive CC. The joint posterior density of β0,,βm,σ\beta_0, \dots, \beta_m, \sigma is then given by

fβ0,β1,,βm,σdata(β0,β1,,βm,σ)σn1exp(S(β0,,βm)2σ2)I{C<β0,β1,,βm,logσ<C}.\begin{align*} & f_{\beta_0, \beta_1, \dots, \beta_m, \sigma \mid \text{data}}(\beta_0, \beta_1, \dots, \beta_m, \sigma) \\ &\propto \sigma^{-n-1} \exp \left(-\frac{S(\beta_0, \dots, \beta_m)}{2 \sigma^2} \right) I \left\{-C < \beta_0, \beta_1, \dots, \beta_m, \log \sigma < C \right\}. \end{align*}

where we use the notation

S(β0,β1,,βm):=i=1n(yiβ0β1xi1βmxim)2S(\beta_0, \beta_1, \dots, \beta_m) := \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \dots - \beta_m x_{im})^2

for the sum of squares.

The posterior over only the coefficient parameters β0,,βm\beta_0, \dots, \beta_m can be obtained by integrating (or marginalizing) the parameter σ\sigma.

fβ0,β1,,βmdata(β0,β1,,βm)=fβ0,β1,,βm,σdata(β0,β1,,βm,σ)dσI{C<β0,β1,,βm<C}eCeCσn1exp(S(β0,,βm)2σ2)dσI{C<β0,β1,,βm<C}0σn1exp(S(β0,,βm)2σ2)dσ=I{C<β0,β1,,βm<C}(1S(β0,,βm))n/20sn1exp(12s2)dsI{C<β0,β1,,βm<C}(1S(β0,,βm))n/2(1S(β0,,βm))n/2(S(β^0,β^1,,β^m)S(β0,β1,,βm))n/2\begin{align*} & f_{\beta_0, \beta_1, \dots, \beta_m \mid \text{data}} (\beta_0, \beta_1, \dots, \beta_m) \\ &= \int f_{\beta_0, \beta_1, \dots, \beta_m, \sigma \mid \text{data}}(\beta_0, \beta_1, \dots, \beta_m, \sigma) d\sigma \\ &\propto I\{-C < \beta_0, \beta_1, \dots, \beta_m < C\} \int_{e^{-C}}^{e^C} \sigma^{-n-1} \exp \left(-\frac{S(\beta_0, \dots, \beta_m)}{2 \sigma^2} \right) d\sigma \\ &\approx I\{-C < \beta_0, \beta_1, \dots, \beta_m < C\} \int_{0}^{\infty} \sigma^{-n-1} \exp \left(-\frac{S(\beta_0,\dots, \beta_m)}{2 \sigma^2} \right) d\sigma \\ &= I\{-C < \beta_0, \beta_1, \dots, \beta_m < C\} \left(\frac{1}{S(\beta_0, \dots, \beta_m)}\right)^{n/2} \int_{0}^{\infty} s^{-n-1} \exp \left(-\frac{1}{2s^2} \right) ds \\ &\propto I\{-C < \beta_0, \beta_1, \dots, \beta_m < C\} \left(\frac{1}{S(\beta_0, \dots, \beta_m)}\right)^{n/2} \\ &\approx \left(\frac{1}{S(\beta_0, \dots, \beta_m)}\right)^{n/2} \propto \left(\frac{S(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_m)}{S(\beta_0, \beta_1, \dots, \beta_m)} \right)^{n/2} \end{align*}

where β^0,,β^m\hat{\beta}_0, \dots, \hat{\beta}_m denote the least squares estimators of β0,,βm\beta_0, \dots, \beta_m (i.e., (β^0,,β^m)(\hat{\beta}_0, \dots, \hat{\beta}_m) minimizes S(β0,,βm)S(\beta_0, \dots, \beta_m) over all values of β0,,βm\beta_0, \dots, \beta_m).

Our posterior density for β0,,βm\beta_0, \dots, \beta_m is thus:

fβ0,β1,,βmdata(β0,β1,,βm)(S(β^0,β^1,,β^m)S(β0,β1,,βm))n/2.f_{\beta_0, \beta_1, \dots, \beta_m \mid \text{data}} (\beta_0, \beta_1, \dots, \beta_m) \propto \left(\frac{S(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_m)}{S(\beta_0, \beta_1, \dots, \beta_m)} \right)^{n/2}.

It turns out that this is a multivariate tt-density. Here is some basic background on multivariate tt-densities.

tt-density

The formula for the density corresponding to the tt-distribution tp(μ,Σ,ν)t_{p}(\mu, \Sigma, \nu) is (see (Multivariate t-distribution):

f(x):=Γ((ν+p)/2)Γ(ν/2)νp/2πp/2det Σ[11+1ν(xμ)TΣ1(xμ)](ν+p)/2[11+1ν(xμ)TΣ1(xμ)](ν+p)/2.\begin{align} f(x) &:= \frac{\Gamma((\nu + p)/2)}{\Gamma(\nu/2) \nu^{p/2} \pi^{p/2} \sqrt{\text{det}~\Sigma}} \left[\frac{1}{1 + \frac{1}{\nu} (x - \mu)^T \Sigma^{-1} (x - \mu)} \right]^{(\nu + p)/2} \nonumber \\ &\propto \left[\frac{1}{1 + \frac{1}{\nu} (x - \mu)^T \Sigma^{-1} (x - \mu)} \right]^{(\nu + p)/2}. \end{align}

Here:

  1. pp denotes dimension of the vector xx (this is a pp-variate joint density)

  2. μ\mu is a p×1p \times 1 vector called the location

  3. Σ\Sigma is a p×pp \times p matrix called the scale matrix

  4. ν>0\nu > 0 denotes the degrees of freedom.

Here is some more information about the tt-density (5):

  1. Connection to the Multivariate Normal Density: The most important term in the formula (5) is (xμ)TΣ1(xμ)(x - \mu)^T\Sigma^{-1}(x - \mu). This exact term also appears in the multivariate normal density. If XN(μ,Σ)X \sim N(\mu, \Sigma), then the density of XX is given by:

    1(2π)p/2detΣexp(12(xμ)TΣ1(xμ)).\frac{1}{(2\pi)^{p/2}\sqrt{\det \Sigma}} \exp \left(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right).

    This suggests that the tt-density is closely related to the multivariate normal density. Here is the connection. Suppose XNp(μ,Σ)X \sim N_p(\mu, \Sigma) and Vχν2V \sim \chi^2_{\nu} (this is the chi-squared distribution with ν\nu degrees of freedom) are independent. Then

    T:=μ+XμV/νtp(μ,Σ,ν).T := \mu + \frac{X - \mu}{\sqrt{V/\nu}} \sim t_p(\mu, \Sigma, \nu).

    Thus, in the notation tp(μ,Σ,ν)t_{p}(\mu, \Sigma, \nu), ν\nu denotes degrees of freedom, pp denotes dimension, μ\mu and Σ\Sigma denote the mean vector and covariance matrix of the corresponding normal random vector XX. For completeness, we include a proof of (6) in Section ??.

  2. Individual Components as well as Linear Combinations of Components of TT are also tt-distributed: Suppose Ttp(μ,Σ,ν)T \sim t_p(\mu, \Sigma, \nu) and the components of TT are T1,,TpT_1, \dots, T_p. Then each individual component TjT_j is also tt-distributed. Also every linear combination a0+a1T1+a2T2++apTpa_0 + a_1 T_1 + a_2 T_2 + \dots + a_p T_p is also tt-distributed. To see this, first write

    a0+a1T1++apTp=a0+aTTa_0 + a_1 T_1 + \dots + a_p T_p = a_0 + a^T T

    where aa is the p×1p \times 1 vector with components a1,,apa_1, \dots, a_p. Using the formula (6), we can write

    a0+aTT=(a0+aTμ)+(a0+aTX)(a0+aTμ)V/νa_0 + a^T T = (a_0 + a^T \mu) + \frac{(a_0 + a^TX) - (a_0 + a^T\mu)}{\sqrt{V/\nu}}

    Because a0+aTXN(a0+aTμ,aTΣa)a_0 + a^T X \sim N(a_0 + a^T \mu, a^T \Sigma a), the same fact (6) applied to this case gives:

    a0+aTTt1(a0+aTμ,aTΣa,ν).a_0 + a^T T \sim t_1(a_0 + a^T \mu, a^T \Sigma a, \nu).

    In particular, this implies that for each j=1,,pj = 1, \dots, p,

    Tjt1(μj,Σ(j,j),ν)T_j \sim t_1(\mu_j, \Sigma(j, j), \nu)

    where μj\mu_j is the jjth component of μ\mu and Σ(j,j)\Sigma(j, j) is the (j,j)(j, j)th entry of Σ\Sigma.

  3. When ν\nu is large, tt is very close to normal: This can intuitively be seen by noting that when ν\nu is large, the term (xμ)TΣ1(xμ)/ν(x - \mu)^T \Sigma^{-1} (x - \mu)/\nu is small so that

    1+1ν(xμ)TΣ1(xμ)exp(1ν(xμ)TΣ1(xμ)),1 + \frac{1}{\nu} (x - \mu)^T \Sigma^{-1} (x - \mu) \approx \exp \left(\frac{1}{\nu} (x - \mu)^T \Sigma^{-1} (x - \mu) \right),

    where we used the observation that 1+zez1 + z \approx e^z when zz is small. Thus the tt-density (5) for large ν\nu becomes approximately:

    exp(ν+p2ν(xμ)TΣ1(xμ))exp(12(xμ)TΣ1(xμ))\exp \left(-\frac{\nu + p}{2 \nu} (x - \mu)^T \Sigma^{-1} (x - \mu) \right) \approx \exp \left(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)

    because ν+pν1\frac{\nu + p}{\nu} \approx 1 when ν\nu is large. This gets us the normal density:

    tp(μ,Σ,ν)Np(μ,Σ)  if ν is large.t_{p}(\mu, \Sigma, \nu) \approx N_p(\mu, \Sigma) ~~\text{if $\nu$ is large}.

It turns out that (4) is a special case of (5) for some p,μ,Σ,νp, \mu, \Sigma, \nu. To see this, we need to first rewrite (4) using matrix notation which we do in the next section.

Matrix Notation for Multiple Linear Regression

y=(y1yn)   X=(1x11x1m1x21x2m1xn1xnm)   β=(β0β1βm)   β^=(β^0β^1β^m)y = \begin{pmatrix} y_1 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~~ X = \begin{pmatrix} 1 & x_{11} & \dots & x_{1m} \\ 1 & x_{21} & \dots & x_{2m} \\ \cdot & \cdot & \dots & \cdot \\ \cdot & \cdot & \dots & \cdot \\ \cdot & \cdot & \dots & \cdot \\ 1 & x_{n1} & \dots & x_{nm} \end{pmatrix} ~~~ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \cdot \\ \cdot \\ \cdot \\ \beta_m \end{pmatrix} ~~~ \hat{\beta} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \cdot \\ \cdot \\ \cdot \\ \hat{\beta}_m \end{pmatrix}

This notation is used not just to write formulae for linear regression, but also in code. For example, the OLS function in statsmodels uses the syntax sm.OLS(y, X).fit() to fit the linear regression model, where yy (n×1n \times 1 vector) and XX (n×(m+1)n \times (m+1) matrix) are defined above.

With this notation, one can write the sum of squares S(β0,,βm)S(\beta_0, \dots, \beta_m) as:

S(β)=S(β0,,βm)=yXβ2.S(\beta) = S(\beta_0, \dots, \beta_m) = \|y - X \beta\|^2.

There are two important facts about S(β)S(\beta):

  1. Fact 1: the least squares estimator β^\hat{\beta} is given by the formula:

    β^=(XTX)1XTy.\hat{\beta} = (X^T X)^{-1} X^T y.

    The proof of (7) is as follows. The gradient of S(β)S(\beta) is given by

    S(β)=[yXβ2]=[(yXβ)T(yXβ)]=[yTyβTXTyyTXβ+βTXTXβ]=2XTy2XTXβ.\begin{align*} \nabla S(\beta) &= \nabla \left[ \|y - X \beta\|^2 \right] \\ &= \nabla \left[ (y - X\beta)^T (y - X \beta) \right] \\ &= \nabla \left[ y^T y - \beta^T X^T y - y^T X \beta + \beta^T X^T X \beta\right] = 2 X^T y - 2 X^T X \beta. \end{align*}

    Because β^\hat{\beta} minimizes S(β)S(\beta), the gradient should equal zero when β=β^\beta = \hat{\beta}, and this leads to

    XT(yXβ^)=0    XTXβ^=XTy    β^=(XTX)1XTy.\begin{align} X^T (y - X \hat{\beta}) = 0 \implies X^T X \hat{\beta} = X^T y \implies \hat{\beta} = (X^T X)^{-1} X^T y. \end{align}
  2. Fact 2: The following Pythagorean identity holds:

    S(β)=S(β^)+XβXβ^2=S(β^)+(ββ^)TXTX(ββ^).\begin{align} S(\beta) = S(\hat{\beta}) + \|X \beta - X \hat{\beta}\|^2 = S(\hat{\beta}) + (\beta - \hat{\beta})^T X^T X (\beta - \hat{\beta}). \end{align}

    To prove (10), write

    S(β)=yXβ2=yXβ^+Xβ^Xβ2=yXβ^2+Xβ^Xβ2+2<yXβ^,Xβ^Xβ>.\begin{align*} S(\beta) &= \|y - X \beta\|^2 \\ &= \|y - X\hat{\beta} + X \hat{\beta} - X \beta\|^2 \\ &= \|y - X \hat{\beta}\|^2 + \|X \hat{\beta} - X \beta\|^2 + 2 \left<y - X \hat{\beta}, X \hat{\beta} - X \beta \right>. \end{align*}

    The cross product is zero (leading to (10)) because:

    <yXβ^,Xβ^Xβ>=(Xβ^Xβ)T(yXβ^)=(β^β)TXT(yXβ^)=(β^β)T(XTyXTXβ^)=0\begin{align*} \left<y - X \hat{\beta}, X \hat{\beta} - X \beta \right> &= \left(X \hat{\beta} - X \beta \right)^T \left(y - X \hat{\beta} \right) \\ &= \left( \hat{\beta} - \beta \right)^T X^T (y - X \hat{\beta}) = \left( \hat{\beta} - \beta \right)^T \left(X^T y - X^T X \hat{\beta} \right) = 0 \end{align*}

    where we used (9).

Using (10), we can write the posterior density (4) as

fβdata(β)(S(β^)S(β))n/2=(S(β^)S(β^)+(ββ^)TXTX(ββ^))n/2=(11+(ββ^)TXTXS(β^)(ββ^))n/2.\begin{align} f_{\beta \mid \text{data}}(\beta) &\propto \left( \frac{S(\hat{\beta})}{S(\beta)} \right)^{n/2} \nonumber \\ &= \left(\frac{S(\hat{\beta})}{S(\hat{\beta}) + \left(\beta - \hat{\beta} \right)^T X^T X \left(\beta - \hat{\beta} \right)} \right)^{n/2} \nonumber \\ &= \left(\frac{1}{1 + \left(\beta - \hat{\beta} \right)^T \frac{X^T X}{S(\hat{\beta})} \left(\beta - \hat{\beta} \right)} \right)^{n/2}. \end{align}

The above formula is a special case of (5) with

x=β,    p=m+1,    μ=β^,    ν+p=n,    Σ1ν=XTXS(β^)x = \beta, ~~~~ p = m+1, ~~~~ \mu = \hat{\beta}, ~~~~ \nu + p = n, ~~~~ \frac{\Sigma^{-1}}{\nu} = \frac{X^T X}{S(\hat{\beta})}

or equivalently

x=β,    p=m+1,    μ=β^,    ν=nm1,    Σ=S(β^)nm1(XTX)1.x = \beta, ~~~~ p = m+1, ~~~~ \mu = \hat{\beta}, ~~~~ \nu = n - m - 1, ~~~~ \Sigma = \frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}.

We thus have

β0,,βmdatatm+1(β^,S(β^)nm1(XTX)1,nm1).\beta_0, \dots, \beta_m \mid \text{data} \sim t_{m+1}\left(\hat{\beta},\frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}, n - m - 1 \right).

With the posterior density (14), one can do uncertainty quantification about the parameters β0,β1,,βm\beta_0, \beta_1, \dots, \beta_m. One can generate multiple samples from tm+1(β^,(S(β^)/(nm1))(XTX)1,nm1)t_{m+1}(\hat{\beta}, (S(\hat{\beta})/(n-m-1)) (X^T X)^{-1}, n-m-1) and plot the resulting fitted values to visualize the uncertainty in the coefficients.

In (14), the quantity S(β^)/(nm1)S(\hat{\beta})/(n-m-1) is the frequentist unbiased estimator for σ2\sigma^2, so we denote it by σ^2\hat{\sigma}^2:

σ^:=S(β^)nm1.\hat{\sigma} := \sqrt{\frac{S(\hat{\beta})}{n-m-1}}.

σ^\hat{\sigma} can also be justified as a Bayesian estimator of σ\sigma (this will be a question in Homework three). The terminology Residual Standard Error is sometimes used for σ^\hat{\sigma}.

With the notation for σ^\hat{\sigma}, the posterior (14) becomes:

β0,,βmdatatm+1(β^,σ^2(XTX)1,nm1).\beta_0, \dots, \beta_m \mid \text{data} \sim t_{m+1}\left(\hat{\beta},\hat{\sigma}^2 \left(X^T X \right)^{-1}, n - m - 1 \right).

By one of the facts mentioned about the tt-distribution, the posterior of each individual βj\beta_j is also tt:

βjdatat1(β^j,σ^2(XTX)j+1,j+1,nm1)\beta_j \mid \text{data} \sim t_{1}\left(\hat{\beta}_j, \hat{\sigma}^2 (X^T X)^{j+1, j+1}, n-m-1 \right)

where (XTX)j+1,j+1(X^T X)^{j+1, j+1} is the (j+1)th(j+1)^{th} diagonal entry of (XTX)1(X^T X)^{-1} (note that we are using the (j+1)(j+1)th diagonal entry of XTXX^TX because βj\beta_j is the (j+1)(j+1)th component of β\beta). Writing this density out, we have

fβjdata(βj)(11+1nm1(βjβ^j)2σ^2(XTX)j+1,j+1)n/2f_{\beta_j \mid \text{data}}(\beta_j) \propto \left(\frac{1}{1 + \frac{1}{n-m-1} \frac{(\beta_j - \hat{\beta}_j)^2}{\hat{\sigma}^2 (X^TX)^{j+1, j+1}}} \right)^{n/2}

which implies that

βjβ^jσ^(XTX)j+1,j+1univariate standard t with nm1 d.f.\frac{\beta_j - \hat{\beta}_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \sim \text{univariate standard } t \text{ with } n-m-1 \text{ d.f.}

This can be used to obtain uncertainty intervals for βj\beta_j. If tnm1,α/2t_{n-m-1, \alpha/2} is the point beyond which the tt-distribution (with nm1n-m-1 degrees of freedom) assigns probability α/2\alpha/2, then

P{tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2data}=1α\P \left\{-t_{n-m-1, \alpha/2} \leq \frac{\beta_j - \hat{\beta}_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \leq t_{n-m-1, \alpha/2} \mid \text{data} \right\} = 1 - \alpha

which is same as:

P{β^jσ^(XTX)j+1,j+1tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2data}=1α\P \left\{\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \leq \beta_j \leq \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \mid \text{data} \right\} = 1 - \alpha

This interval:

[β^jσ^(XTX)j+1,j+1tnm1,α/2,β^jσ^(XTX)j+1,j+1tnm1,α/2]\left[\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2}, \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \right]

is called the 100(1α)%100(1-\alpha)\% Bayesian Credible interval for βj\beta_j. It exactly coincides with the frequentist 100(1α)%100(1 - \alpha)\% confidence interval for βj\beta_j.

The degrees of freedom corresponding to the tt-density in (14) is nm1n - m - 1 where nn is the number of observations, and mm is the number of covariates. Thus if nm1n-m-1 is large, then the posterior distribution (which is actually tt) is approximately normal:

tm+1(β^,S(β^)nm1(XTX)1,nm1)Nm+1(β^,S(β^)nm1(XTX)1).t_{m+1}\left(\hat{\beta},\frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}, n - m - 1 \right) \approx N_{m+1}(\hat{\beta}, \frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}).

In other words, when nm1n-m-1 is large, the tt-density (14) is approximately equal to the Nm+1(β^,σ^2(XTX)1)N_{m+1}(\hat{\beta}, \hat{\sigma}^2 (X^T X)^{-1}). Further, when nm1n-m-1 is large, the distribution (16) will be close to the normal distribution N(β^j,σ^2(XTX)j+1,j+1)N(\hat{\beta}_j, \hat{\sigma}^2 (X^T X)^{j+1, j+1}). The quantity σ^(XTX)j+1,j+1\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} is known as the standard error corresponding to βj\beta_j.

Proof of (6)