Our next topic is (multiple) linear regression. Here, we have one response variable y and m covariates x1,…,xm (m=1 corresponds to simple linear regression). We observe data on n instances or subjects for all these variables: (yi,xi1,…,xim) for i=1,…,n. The multiple linear regression model (with normal errors) is given by:
p denotes dimension of the vector x (this is a p-variate joint density)
μ is a p×1 vector called the location
Σ is a p×p matrix called the scale matrix
ν>0 denotes the degrees of freedom.
Here is some more information about the t-density (5):
Connection to the Multivariate Normal Density: The most important term in the formula (5) is (x−μ)TΣ−1(x−μ). This exact term also appears in the multivariate normal density. If X∼N(μ,Σ), then the density of X is given by:
(2π)p/2detΣ1exp(−21(x−μ)TΣ−1(x−μ)).
This suggests that the t-density is closely related to the multivariate normal density. Here is the connection. Suppose X∼Np(μ,Σ) and V∼χν2 (this is the chi-squared distribution with ν degrees of freedom) are independent. Then
Thus, in the notation tp(μ,Σ,ν), ν denotes degrees of freedom, p denotes dimension, μ and Σ denote the mean vector and covariance matrix of the corresponding normal random vector X. For completeness, we include a proof of (6) in Section ??.
Individual Components as well as Linear Combinations of Components of T are also t-distributed: Suppose T∼tp(μ,Σ,ν) and the components of T are T1,…,Tp. Then each individual component Tj is also t-distributed. Also every linear combination a0+a1T1+a2T2+⋯+apTp is also t-distributed. To see this, first write
a0+a1T1+⋯+apTp=a0+aTT
where a is the p×1 vector with components a1,…,ap. Using the formula (6), we can write
a0+aTT=(a0+aTμ)+V/ν(a0+aTX)−(a0+aTμ)
Because a0+aTX∼N(a0+aTμ,aTΣa), the same fact (6) applied to this case gives:
a0+aTT∼t1(a0+aTμ,aTΣa,ν).
In particular, this implies that for each j=1,…,p,
Tj∼t1(μj,Σ(j,j),ν)
where μj is the jth component of μ and Σ(j,j) is the (j,j)th entry of Σ.
When ν is large, t is very close to normal: This can intuitively be seen by noting that when ν is large, the term (x−μ)TΣ−1(x−μ)/ν is small so that
1+ν1(x−μ)TΣ−1(x−μ)≈exp(ν1(x−μ)TΣ−1(x−μ)),
where we used the observation that 1+z≈ez when z is small. Thus the t-density (5) for large ν becomes approximately:
because νν+p≈1 when ν is large. This gets us the normal density:
tp(μ,Σ,ν)≈Np(μ,Σ)if ν is large.
It turns out that (4) is a special case of (5) for some p,μ,Σ,ν. To see this, we need to first rewrite (4) using matrix notation which we do in the next section.
This notation is used not just to write formulae for linear regression, but also in code. For example, the OLS function in statsmodels uses the syntax sm.OLS(y, X).fit() to fit the linear regression model, where y (n×1 vector) and X (n×(m+1) matrix) are defined above.
With this notation, one can write the sum of squares S(β0,…,βm) as:
S(β)=S(β0,…,βm)=∥y−Xβ∥2.
There are two important facts about S(β):
Fact 1: the least squares estimator β^ is given by the formula:
With the posterior density (14), one can do uncertainty quantification about the parameters β0,β1,…,βm. One can generate multiple samples from tm+1(β^,(S(β^)/(n−m−1))(XTX)−1,n−m−1) and plot the resulting fitted values to visualize the uncertainty in the coefficients.
In (14), the quantity S(β^)/(n−m−1) is the frequentist unbiased estimator for σ2, so we denote it by σ^2:
σ^:=n−m−1S(β^).
σ^ can also be justified as a Bayesian estimator of σ (this will be a question in Homework three). The terminology Residual Standard Error is sometimes used for σ^.
With the notation for σ^, the posterior (14) becomes:
where (XTX)j+1,j+1 is the (j+1)th diagonal entry of (XTX)−1 (note that we are using the (j+1)th diagonal entry of XTX because βj is the (j+1)th component of β). Writing this density out, we have
σ^(XTX)j+1,j+1βj−β^j∼univariate standard t with n−m−1 d.f.
This can be used to obtain uncertainty intervals for βj. If tn−m−1,α/2 is the point beyond which the t-distribution (with n−m−1 degrees of freedom) assigns probability α/2, then
is called the 100(1−α)% Bayesian Credible interval for βj. It exactly coincides with the frequentist 100(1−α)% confidence interval for βj.
The degrees of freedom corresponding to the t-density in (14) is n−m−1 where n is the number of observations, and m is the number of covariates. Thus if n−m−1 is large, then the posterior distribution (which is actually t) is approximately normal:
In other words, when n−m−1 is large, the t-density (14) is approximately equal to the Nm+1(β^,σ^2(XTX)−1). Further, when n−m−1 is large, the distribution (16) will be close to the normal distribution N(β^j,σ^2(XTX)j+1,j+1). The quantity σ^(XTX)j+1,j+1 is known as the standard error corresponding to βj.