We are again in the usual regression setting where we observe data (yi,xi1,xi2,…,xim) for i=1,…,n. There are m explanatory variables x1,…,xm and one response variable. xij denotes the value of the jth explanatory variable for the ith individual and yi is the value of the response variable for the ith individual. Suppose now that the response variable is binary i.e., y1,…,yn take values in {0,1}. In this case, the logistic regression model assumes that:
Suppose x1T,…,xnT form the rows of the n×p design matrix X (where p=m+1). The unknown parameters in the logistic regression model are β0,…,βm (note that, in contrast to the linear regression model, there is no σ parameter in logistic regression). As in linear regression, we use the prior:
Note that ℓ(β) is simply the log-likelihood in this problem. This posterior density is not in standard form involving some well-known densities. If p=1 or p=2, then this can be plotted. One can use various MCMC techniques to obtain samples from this posterior.
Instead of MCMC, we shall derive a closed form multivariate normal approximation that works quite well in practice. This method is called Laplace Approximation or Posterior Normal Approximation. It turns out that Bayesian inference from this normal approximation to the posterior coincides with usual frequentist inference for logistic regression.
To get the normal approximation, first let us drop the indicator which will be irrelevant when C is large to get
The normal approximation will be obtained by a second-order Taylor expansion of ℓ(β). We shall do the Taylor expansion around the MLE β^ because the posterior is peaked at β^ and the high regions of the posterior are most likely very close to β^. Recall that the MLE β^ is defined as the maximizer of the likelihood (or log-likelihood):
It is obtained by taking the gradient of the log-likelihood and solving the equation obtained by setting the gradient to zero. It is easy to check that the gradient of the log-likelihood is
To get the maximum likelihood estimator θ^, we need to set the gradient above to zero and solve the resulting equation for θ. This cannot be done in closed form and the usual method is to use an iterative scheme such as Newton’s algorithm. The answer can be obtained from inbuilt functions in R or Python. More details behind the Newton algorithm will be provided a bit later.
Coming back to the posterior exp(ℓ(β)), Taylor expansion of ℓ(β) around the MLE β^ gives
We have switched above to −Hℓ(β^) because this matrix is positive semi-definite as β^ maximizes ℓ(β). The above term is simply the unnormalized density of the multivariate normal distribution with mean β^ and covariance matrix (−Hℓ(β^))−1. Observe that
The standard errors corresponding to β0,…,βm can then be obtained by the square roots of the diagonal entries of (X′WX)−1.
It turns out that Bayesian inference done with the above normal posterior approximation (10) coincides with the frequentist inference in the logistic regression model. It is easy to check this, say, in Python using statsmodels (construct a 95% credible interval for, say, one of the components of β and then compare it with the frequentist interval). Thus the usual frequentist inference for the logistic regression model can be viewed from a Bayesian perspective.
Note that the above analysis relies on two assumptions: (a) the prior for β is assumed to be uniform on the large cube (−C,C)p, and (b) the posterior is approximated by a normal distribution. These assumptions may be of course not reasonable in a particular application. In such a situation, it is conceptually very clear as to how one would proceed: if the normal approximation to the posterior is not accurate, one needs to work with the actual posterior. If the uniform prior is not reasonable, one can do the full posterior analysis (or by taking a normal approximation to the posterior) for a more appropriate prior.
Details behind the Newton Algorithm for computing the MLE¶
The MLE β^ of β is the maximizer of ℓ(β). The maximizer of ℓ(β) cannot be computed in closed form. Newton’s method is commonly used for maximizing ℓ(β). Newton’s method uses the iterative scheme
Letting π be the n×1 vector with entries π1,…,πn, we can write ∇ℓ(β) in matrix notation as (note that X has rows x1T,…,xnT or, equivalently, XT has columns x1,…,xn):
∇ℓ(β)=XT(y−π)
where, as usual in regression, y denotes the n×1 vector of response values. Also from (9), we can write
Hℓ(β)=−XTWX
where W is the n×n diagonal matrix whose ith diagonal entry is πi(1−πi). Newton’s iterative scheme (11) therefore becomes
The method of obtaining the MLE β^ therefore proceeds iteratively as follows. First have an initial estimate of β. Call this initial estimator β^(0). Use this estimator to calculate πi via
Use these values of πi to create the response variable values Zi via (13) and also use values of πi to construct the matrix W. With Z and W, we can estimate β via
β^(1)=(XTWX)−1XTWZ.
Now replace the initial estimator β^(0) by β^(1) and repeat this process. Keep repeating this until two successive estimates β^(m) and β^(m+1) do not change much. At that point, stop and report the estimate of β in the logistic regression model as β^(m).
The expression (XTWX)−1XTWZ is reminiscent of the usual (XTX)−1XTy which is the usual estimate of β in the linear model. In fact, this is the least squares estimate in a weighted least squares model.