In the last lecture, we discussed Bayesian inference for multiple linear regression. The setting is as follows: we have one response variable y and m covariates x1,…,xm (m=1 corresponds to simple linear regression). We observe data on n instances or subjects for all these variables: (yi,xi1,…,xim) for i=1,…,n. The multiple linear regression model (with normal errors) is given by:
To perform inference, the key is to compute the distribution of β^. The dependence of the above on the X matrix is quite complicated. On the other hand, the dependence on y is linear. If we assume that X is fixed (non-random), then it is easy to write the distribution of β^. Indeed because y∼N(Xβ,σ2In), it is easy to check that
Because σ is unknown, it is natural to replace it with a suitable estimate σ. The standard estimate used is the residual standard error:
σ^:=n−m−1S(β^).
where S(β)=∥y−Xβ∥2 is the sum of squares and S(β^) is the smallest possible value of the sum of squares. Replacing σ by σ^ in (6) will change the distribution. In fact, it can be shown that N(0,1) will be replaced by the standard t distribution with n−m−1 degrees of freedom:
This can be used to derive confidence intervals for βj. If tn−m−1,α/2 is the point beyond which the t-distribution (with n−m−1 degrees of freedom) assigns probability α/2, then
To obtain this likelihood, we can assume that X is fixed (and that ϵ∼N(0,σ2Id)). Or we can assume that ϵ is independent of X and that the distribution of X does not depend on the parameters β,σ. There could be other assumptions on ϵ,X which can lead to the same (or very similar) likelihood (see Section ??).
Under this prior and likelihood, we calculated (see last lecture notes) that the posterior density of β is given by:
Note that, in (14), the quantity S(β^)/(n−m−1) is the frequentist unbiased estimator for σ2, which is denoted by σ^2. σ^ can also be justified as a Bayesian estimator of σ (this will be a question in Homework three).
With the notation for σ^, the posterior (14) becomes:
where (XTX)j+1,j+1 is the (j+1)th diagonal entry of (XTX)−1 (note that we are using the (j+1)th diagonal entry of XTX because βj is the (j+1)th component of β). This implies that
σ^(XTX)j+1,j+1βj−β^j∣data∼univariate standard t with n−m−1 d.f.
This can be used to obtain uncertainty intervals for βj. If tn−m−1,α/2 is the point beyond which the t-distribution (with n−m−1 degrees of freedom) assigns probability α/2, then
The degrees of freedom corresponding to the t-density in (14) is n−m−1 where n is the number of observations, and m is the number of covariates. Thus if n−m−1 is large, then the posterior distribution (which is actually t) is approximately normal:
In other words, when n−m−1 is large, the t-density (14) is approximately equal to the Nm+1(β^,σ^2(XTX)−1). Further, when n−m−1 is large, the distribution (16) will be close to the normal distribution N(β^j,σ^2(XTX)j+1,j+1). In such cases, tn−m−1,α/2 can be replaced by zα/2 in the intervals.
We mentioned above that the Bayesian uncertainty interval for βj exactly coincides with the standard frequentist confidence interval. But the Bayesian reasoning that led to the interval differs fundamentally from the frequentist reasoning. The Bayesian reasoning is based on the fact (13) while frequentist reasoning is based on (7).
The only difference between (7) and (13) is in the conditioning. In (13), the random variable is βj and we are computing its distribution conditional on the observed data. In contrast, in the frequentist statement (7), the random variable is β^j (which is a function of y1,…,yn) and we are computing its distribution the parameters β,σ are fixed. Note that X is assumed to be fixed here.
The fact that both Bayesian and frequentist reasoning lead to the same interval is a matter of coincidence. By changing the setting slightly, this coincidence can be broken. One such example involves AR models and is discussed next.
In the above equation, y1 does not appear on the right hand side, so we can treat y1 as a constant. The equation (15) is simply a linear regression model with covariate given by xt=yt−1. Frequentist and Bayesian inference now give different answers (although in practice the answers will be quite similar) and with very different justifications. This is explained below.
Unlike in the previous case, we cannot compute the distribution of β^ assuming that X is fixed. This is because X involves depends on y1,…,yn−1. Instead, we need to use other arguments. For example, note that
where yˉ=yˉ2:n:=(y2+⋯+yn)/(n−1) and xˉ=yˉ1:(n−1):=(y1+⋯+yn−1)/(n−1). The distribution of β^1 is actually quite complicated. Asymptotic arguments can be used to derive a limiting normal distribution. For the full argument, see, for example, ShumwayStoffer which uses a dependent Central Limit Theorem.
For fy1∣β,σ(y1), the simplest thing is to assume that it does not depend on the parameters β,σ so that this term can be ignored. This assumption leads to
This is exactly the same likelihood as in usual linear regression (see (9)) except that n in (9) is now replaced by n−1 which is the number of observations in the regression model (15). Because the likelihood is of the same form as in (9), the posterior of β follows the same calculation as in usual linear regression and we obtain:
This is the same as (14) with n−1 replacing n and m=2.
Therefore, Bayesian inference gives the same posterior t-distribution in AutoRegression as in usual linear regression. Unlike frequentist AutoRegression, there is no need for asymptotics assuming n→∞. This example is useful for highlighting the differences between Bayesian and frequentist inference.