Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Sixteen

Spring 2026, UC Berkeley

Linear Regression

In the last lecture, we discussed Bayesian inference for multiple linear regression. The setting is as follows: we have one response variable yy and mm covariates x1,,xmx_1, \dots, x_m (m=1m = 1 corresponds to simple linear regression). We observe data on nn instances or subjects for all these variables: (yi,xi1,,xim)(y_i, x_{i1}, \dots, x_{im}) for i=1,,ni = 1, \dots, n. The multiple linear regression model (with normal errors) is given by:

yi=β0+β1xi1++βmxim+ϵi  with ϵii.i.dN(0,σ2).y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} + \epsilon_i ~~ \text{with $\epsilon_i \overset{\text{i.i.d}}{\sim} N(0, \sigma^2)$}.

Using the matrix vector notation:

y=(y1yn)   X=(1x11x1m1x21x2m1xn1xnm)   β=(β0β1βm)   β^=(β^0β^1β^m)y = \begin{pmatrix} y_1 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~~ X = \begin{pmatrix} 1 & x_{11} & \dots & x_{1m} \\ 1 & x_{21} & \dots & x_{2m} \\ \cdot & \cdot & \dots & \cdot \\ \cdot & \cdot & \dots & \cdot \\ \cdot & \cdot & \dots & \cdot \\ 1 & x_{n1} & \dots & x_{nm} \end{pmatrix} ~~~ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \cdot \\ \cdot \\ \cdot \\ \beta_m \end{pmatrix} ~~~ \hat{\beta} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \cdot \\ \cdot \\ \cdot \\ \hat{\beta}_m \end{pmatrix}

the model can be written as:

y=Xβ+ϵ   where ϵN(0,σ2In).\begin{align*} y = X \beta + \epsilon ~~ \text{ where } \epsilon \sim N(0, \sigma^2 I_n). \end{align*}

Below we review standard frequentist and Bayesian approaches to linear regression. Both lead to identical answers.

Frequentist Approach

Here the focus is on the least squares estimate which is given by:

β^=(XTX)1XTy.\begin{align*} \hat{\beta} = (X^T X)^{-1} X^T y. \end{align*}

To perform inference, the key is to compute the distribution of β^\hat{\beta}. The dependence of the above on the XX matrix is quite complicated. On the other hand, the dependence on yy is linear. If we assume that XX is fixed (non-random), then it is easy to write the distribution of β^\hat{\beta}. Indeed because yN(Xβ,σ2In)y \sim N(X \beta, \sigma^2 I_n), it is easy to check that

β^N(β,σ2(XTX)1).\begin{align*} \hat{\beta} \sim N \left( \beta, \sigma^2 (X^T X)^{-1} \right). \end{align*}

This implies that

β^jN(βj,σ2(XTX)j+1,j+1)   for each j=0,1,,m,\begin{align*} \hat{\beta}_j \sim N(\beta_j, \sigma^2 (X^T X)^{j+1, j+1}) ~~ \text{ for each } j = 0, 1, \dots, m, \end{align*}

where (XTX)j+1,j+1(X^T X)^{j+1, j+1} is the (j+1,j+1)(j+1, j+1)-th diagonal entry of (XTX)1(X^T X)^{-1}. The above is equivalent to:

β^jβjσ(XTX)j+1,j+1N(0,1).\begin{align} \frac{\hat{\beta}_j - \beta_j}{\sigma \sqrt{(X^T X)^{j+1, j+1}}} \sim N(0, 1). \end{align}

Because σ\sigma is unknown, it is natural to replace it with a suitable estimate σ\sigma. The standard estimate used is the residual standard error:

σ^:=S(β^)nm1.\hat{\sigma} := \sqrt{\frac{S(\hat{\beta})}{n-m-1}}.

where S(β)=yXβ2S(\beta) = \|y - X \beta\|^2 is the sum of squares and S(β^)S(\hat{\beta}) is the smallest possible value of the sum of squares. Replacing σ\sigma by σ^\hat{\sigma} in (6) will change the distribution. In fact, it can be shown that N(0,1)N(0, 1) will be replaced by the standard tt distribution with nm1n-m-1 degrees of freedom:

β^jβjσ^(XTX)j+1,j+1tnm1.\begin{align} \frac{\hat{\beta}_j - \beta_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \sim t_{n-m-1}. \end{align}

This can be used to derive confidence intervals for βj\beta_j. If tnm1,α/2t_{n-m-1, \alpha/2} is the point beyond which the tt-distribution (with nm1n-m-1 degrees of freedom) assigns probability α/2\alpha/2, then

P{tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2}=1α\P \left\{-t_{n-m-1, \alpha/2} \leq \frac{\beta_j - \hat{\beta}_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \leq t_{n-m-1, \alpha/2} \right\} = 1 - \alpha

which is same as:

P{β^jσ^(XTX)j+1,j+1tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2}=1α\P \left\{\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \leq \beta_j \leq \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \right\} = 1 - \alpha

This interval:

[β^jσ^(XTX)j+1,j+1tnm1,α/2,β^jσ^(XTX)j+1,j+1tnm1,α/2]\left[\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2}, \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \right]

is the 100(1α)%100(1-\alpha)\% confidence interval for βj\beta_j.

The quantity σ^(XTX)j+1,j+1\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} is known as the standard error corresponding to βj\beta_j.

Bayesian Approach

The default prior for linear regression is:

β0,β1,,βm,logσi.i.dunif(,).\beta_0, \beta_1, \dots, \beta_m, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-\infty, \infty).

The likelihood is given by:

σnexp(12σ2yXβ2).\begin{align} \propto \sigma^{-n} \exp \left(-\frac{1}{2 \sigma^2} \|y - X \beta\|^2 \right). \end{align}

To obtain this likelihood, we can assume that XX is fixed (and that ϵN(0,σ2Id)\epsilon \sim N(0, \sigma^2 I_d)). Or we can assume that ϵ\epsilon is independent of XX and that the distribution of XX does not depend on the parameters β,σ\beta, \sigma. There could be other assumptions on ϵ,X\epsilon, X which can lead to the same (or very similar) likelihood (see Section ??).

Under this prior and likelihood, we calculated (see last lecture notes) that the posterior density of β\beta is given by:

β0,,βmdatatm+1(β^,S(β^)nm1(XTX)1,nm1).\beta_0, \dots, \beta_m \mid \text{data} \sim t_{m+1}\left(\hat{\beta},\frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}, n - m - 1 \right).

Note that, in (14), the quantity S(β^)/(nm1)S(\hat{\beta})/(n-m-1) is the frequentist unbiased estimator for σ2\sigma^2, which is denoted by σ^2\hat{\sigma}^2. σ^\hat{\sigma} can also be justified as a Bayesian estimator of σ\sigma (this will be a question in Homework three).

With the notation for σ^\hat{\sigma}, the posterior (14) becomes:

β0,,βmdatatm+1(β^,σ^2(XTX)1,nm1).\beta_0, \dots, \beta_m \mid \text{data} \sim t_{m+1}\left(\hat{\beta},\hat{\sigma}^2 \left(X^T X \right)^{-1}, n - m - 1 \right).

By one of the facts mentioned about the multivariate tt-distribution, the posterior of each individual βj\beta_j is also tt:

βjdatat1(β^j,σ^2(XTX)j+1,j+1,nm1)\beta_j \mid \text{data} \sim t_{1}\left(\hat{\beta}_j, \hat{\sigma}^2 (X^T X)^{j+1, j+1}, n-m-1 \right)

where (XTX)j+1,j+1(X^T X)^{j+1, j+1} is the (j+1)th(j+1)^{th} diagonal entry of (XTX)1(X^T X)^{-1} (note that we are using the (j+1)(j+1)th diagonal entry of XTXX^TX because βj\beta_j is the (j+1)(j+1)th component of β\beta). This implies that

βjβ^jσ^(XTX)j+1,j+1dataunivariate standard t with nm1 d.f.\frac{\beta_j - \hat{\beta}_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \mid \text{data} \sim \text{univariate standard } t \text{ with } n-m-1 \text{ d.f.}

This can be used to obtain uncertainty intervals for βj\beta_j. If tnm1,α/2t_{n-m-1, \alpha/2} is the point beyond which the tt-distribution (with nm1n-m-1 degrees of freedom) assigns probability α/2\alpha/2, then

P{tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2data}=1α\P \left\{-t_{n-m-1, \alpha/2} \leq \frac{\beta_j - \hat{\beta}_j}{\hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}}} \leq t_{n-m-1, \alpha/2} \mid \text{data} \right\} = 1 - \alpha

which is same as:

P{β^jσ^(XTX)j+1,j+1tnm1,α/2βjβ^jσ^(XTX)j+1,j+1tnm1,α/2data}=1α\P \left\{\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \leq \beta_j \leq \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \mid \text{data} \right\} = 1 - \alpha

This interval:

[β^jσ^(XTX)j+1,j+1tnm1,α/2,β^jσ^(XTX)j+1,j+1tnm1,α/2]\left[\hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2}, \hat{\beta}_j - \hat{\sigma} \sqrt{(X^T X)^{j+1, j+1}} t_{n-m-1, \alpha/2} \right]

is the 100(1α)%100(1-\alpha)\% Bayesian Credible interval for βj\beta_j.

From (8) and (14), it is clear that the Bayesian and frequentist 100(1α)%100(1 - \alpha)\% intervals for βj\beta_j coincide.

When nn is large

The degrees of freedom corresponding to the tt-density in (14) is nm1n - m - 1 where nn is the number of observations, and mm is the number of covariates. Thus if nm1n-m-1 is large, then the posterior distribution (which is actually tt) is approximately normal:

tm+1(β^,S(β^)nm1(XTX)1,nm1)Nm+1(β^,S(β^)nm1(XTX)1).t_{m+1}\left(\hat{\beta},\frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}, n - m - 1 \right) \approx N_{m+1}(\hat{\beta}, \frac{S(\hat{\beta})}{n - m - 1} \left(X^T X \right)^{-1}).

In other words, when nm1n-m-1 is large, the tt-density (14) is approximately equal to the Nm+1(β^,σ^2(XTX)1)N_{m+1}(\hat{\beta}, \hat{\sigma}^2 (X^T X)^{-1}). Further, when nm1n-m-1 is large, the distribution (16) will be close to the normal distribution N(β^j,σ^2(XTX)j+1,j+1)N(\hat{\beta}_j, \hat{\sigma}^2 (X^T X)^{j+1, j+1}). In such cases, tnm1,α/2t_{n-m-1, \alpha/2} can be replaced by zα/2z_{\alpha/2} in the intervals.

Bayesian vs Frequentist

We mentioned above that the Bayesian uncertainty interval for βj\beta_j exactly coincides with the standard frequentist confidence interval. But the Bayesian reasoning that led to the interval differs fundamentally from the frequentist reasoning. The Bayesian reasoning is based on the fact (13) while frequentist reasoning is based on (7).

The only difference between (7) and (13) is in the conditioning. In (13), the random variable is βj\beta_j and we are computing its distribution conditional on the observed data. In contrast, in the frequentist statement (7), the random variable is β^j\hat{\beta}_j (which is a function of y1,,yny_1, \dots, y_n) and we are computing its distribution the parameters β,σ\beta, \sigma are fixed. Note that XX is assumed to be fixed here.

The fact that both Bayesian and frequentist reasoning lead to the same interval is a matter of coincidence. By changing the setting slightly, this coincidence can be broken. One such example involves AR models and is discussed next.

AutoRegression

We observe a time series {yt,t=1,,n}\{y_t, t = 1, \dots, n\}. The AutoRegressive Model of order 1 (in short, AR(1)) is given by:

yt=β0+β1yt1+ϵt   where t=2,,n.\begin{align} y_t = \beta_0 + \beta_1 y_{t-1} + \epsilon_t ~~\text{ where $t = 2, \dots, n$}. \end{align}

In the above equation, y1y_1 does not appear on the right hand side, so we can treat y1y_1 as a constant. The equation (15) is simply a linear regression model with covariate given by xt=yt1x_t = y_{t-1}. Frequentist and Bayesian inference now give different answers (although in practice the answers will be quite similar) and with very different justifications. This is explained below.

Frequentist Inference

We start with the least squares estimate of β^\hat{\beta}:

β^=(XTX)1XTy\begin{align*} \hat{\beta} = (X^T X)^{-1} X^T y \end{align*}

where

y=(y2yn)   X=(1y11y21yn1)   β=(β0β1)   β^=(β^0β^1)y = \begin{pmatrix} y_2 \\ \cdot \\ \cdot \\ \cdot \\ y_n \end{pmatrix} ~~~ X = \begin{pmatrix} 1 & y_1\\ 1 & y_2 \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ 1 & y_{n-1} \end{pmatrix} ~~~ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} ~~~ \hat{\beta} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix}

Unlike in the previous case, we cannot compute the distribution of β^\hat{\beta} assuming that XX is fixed. This is because XX involves depends on y1,,yn1y_1, \dots, y_{n-1}. Instead, we need to use other arguments. For example, note that

β^1=t=2n(ytyˉ)(xtxˉ)t=2n(xtxˉ)2=t=2n(ytyˉ2:n)(yt1yˉ1:(n1))t=2n(yt1yˉ1:(n1))2\begin{align*} \hat{\beta}_1 = \frac{\sum_{t=2}^n (y_t - \bar{y})(x_t - \bar{x})}{\sum_{t=2}^n (x_t - \bar{x})^2} = \frac{\sum_{t=2}^n (y_t - \bar{y}_{2:n})(y_{t-1} - \bar{y}_{1:(n-1)})}{\sum_{t=2}^n (y_{t-1} - \bar{y}_{1:(n-1)})^2} \end{align*}

where yˉ=yˉ2:n:=(y2++yn)/(n1)\bar{y} = \bar{y}_{2:n} := (y_2 + \dots + y_n)/(n-1) and xˉ=yˉ1:(n1):=(y1++yn1)/(n1)\bar{x} = \bar{y}_{1:(n-1)} := (y_1 + \dots + y_{n-1})/(n-1). The distribution of β^1\hat{\beta}_1 is actually quite complicated. Asymptotic arguments can be used to derive a limiting normal distribution. For the full argument, see, for example, ShumwayStoffer which uses a dependent Central Limit Theorem.

Bayesian Inference

For Bayesian inference, we need to write the prior and likelihood. The prior will obviously be the same as in usual linear regression:

β0,β1,logσi.i.dunif(,).\begin{align*} \beta_0, \beta_1, \log \sigma \overset{\text{i.i.d}}{\sim} \text{unif}(-\infty, \infty). \end{align*}

For the likelihood, we write:

Likelihood=fy1,,ynβ,σ(y1,,yn)=fy1β,σ(y1)t=2nfyty1,,yt1,β,σ(yt)=fy1β,σ(y1)t=2nfβ0+β1yt1+ϵty1,,yt1,β,σ(yt)=fy1β,σ(y1)t=2nfϵty1,,yt1,β,σ(ytβ0β1yt1).\begin{align*} \text{Likelihood} &= f_{y_1, \dots, y_n \mid \beta, \sigma}(y_1, \dots, y_n) \\ &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^{n} f_{y_t \mid y_1, \dots, y_{t-1}, \beta, \sigma}(y_t) \\ &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^n f_{\beta_0 + \beta_1 y_{t-1} + \epsilon_t \mid y_1, \dots, y_{t-1}, \beta, \sigma} (y_t) \\ &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^n f_{\epsilon_t \mid y_1, \dots, y_{t-1}, \beta, \sigma} (y_t - \beta_0 - \beta_1 y_{t-1}). \end{align*}

We can now assume that ϵt\epsilon_t is independent of y1,,yt1y_1, \dots, y_{t-1} for each t=2,,nt = 2, \dots, n. This gives

Likelihood=fy1β,σ(y1)t=2nfϵt(ytβ0β1yt1)=fy1β,σ(y1)t=2n1σ2πexp((ytβ0β1yt1)22σ2).\begin{align*} \text{Likelihood} &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^n f_{\epsilon_t}(y_t - \beta_0 - \beta_1 y_{t-1}) \\ &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^n \frac{1}{\sigma\sqrt{2 \pi}} \exp \left(-\frac{(y_t - \beta_0 - \beta_1 y_{t-1})^2}{2 \sigma^2} \right). \end{align*}

For fy1β,σ(y1)f_{y_1 \mid \beta, \sigma}(y_1), the simplest thing is to assume that it does not depend on the parameters β,σ\beta, \sigma so that this term can be ignored. This assumption leads to

Likelihood=fy1β,σ(y1)t=2n1σ2πexp((ytβ0β1yt1)22σ2)σ(n1)exp(12σ2yXβ2).\begin{align} \text{Likelihood} &= f_{y_1 \mid \beta, \sigma}(y_1) \prod_{t=2}^n \frac{1}{\sigma\sqrt{2 \pi}} \exp \left(-\frac{(y_t - \beta_0 - \beta_1 y_{t-1})^2}{2 \sigma^2} \right) \nonumber \\ &\propto \sigma^{-(n-1)} \exp \left(-\frac{1}{2 \sigma^2} \|y - X \beta\|^2 \right). \end{align}

This is exactly the same likelihood as in usual linear regression (see (9)) except that nn in (9) is now replaced by n1n-1 which is the number of observations in the regression model (15). Because the likelihood is of the same form as in (9), the posterior of β\beta follows the same calculation as in usual linear regression and we obtain:

β0,βdatat2(β^,S(β^)n3(XTX)1,n3)\begin{align*} \beta_0, \beta \mid \text{data} \sim t_{2} \left(\hat{\beta}, \frac{S(\hat{\beta})}{n-3} (X^T X)^{-1}, n-3 \right) \end{align*}

This is the same as (14) with n1n-1 replacing nn and m=2m = 2.

Therefore, Bayesian inference gives the same posterior tt-distribution in AutoRegression as in usual linear regression. Unlike frequentist AutoRegression, there is no need for asymptotics assuming nn \rightarrow \infty. This example is useful for highlighting the differences between Bayesian and frequentist inference.