Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

STAT 238 - Bayesian Statistics Lecture Seventeen

Spring 2026, UC Berkeley

Logistic Regression

We are again in the usual regression setting where we observe data (yi,xi1,xi2,,xim)(y_i, x_{i1},x_{i2}, \dots, x_{im}) for i=1,,ni = 1, \dots, n. There are mm explanatory variables x1,,xmx_1, \dots, x_m and one response variable. xijx_{ij} denotes the value of the jthj^{th} explanatory variable for the ithi^{th} individual and yiy_i is the value of the response variable for the ithi^{th} individual. Suppose now that the response variable is binary i.e., y1,,yny_1, \dots, y_n take values in {0,1}\{0, 1\}. In this case, the logistic regression model assumes that:

yiindependentBernoulli(exp(β0+β1xi1++βmxim)1+exp(β0+β1xi1++βmxim))  for i=1,,n.y_i \overset{\text{independent}}{\sim} \text{Bernoulli} \left(\frac{\exp \left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} \right)}{1 + \exp \left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_m x_{im} \right)} \right) ~~\text{for $i = 1, \dots, n$}.

Letting xi=(1,xi1,,xim)Tx_i = (1, x_{i1}, \dots, x_{im})^T and β=(β0,β1,,βm)T\beta = (\beta_0, \beta_1, \dots, \beta_m)^T, we can write the model also as

yiindependentBernoulli(exp(xiTβ)1+exp(xiTβ))  for i=1,,ny_i \overset{\text{independent}}{\sim} \text{Bernoulli} \left(\frac{\exp \left(x_i^T \beta\right)}{1 + \exp \left(x_i^T \beta \right)} \right) ~~\text{for $i = 1, \dots, n$}

where β\beta is the (m+1)×1(m+1) \times 1 vector with components β0,β1,,βm\beta_0, \beta_1, \dots, \beta_m.

This gives the likelihood:

Likelihood=i=1n(exp(xiTβ)1+exp(xiTβ))yi(1exp(xiTβ)1+exp(xiTβ))1yi\begin{align*} \text{Likelihood} =\prod_{i=1}^n \left(\frac{\exp \left(x_i^T\beta \right)}{1 + \exp \left(x_i^T\beta \right)} \right)^{y_i} \left(1 - \frac{\exp \left(x_i^T\beta \right)}{1 + \exp \left(x_i^T\beta \right)} \right)^{1-y_i} \end{align*}

Suppose x1T,,xnTx_1^T, \dots, x_n^T form the rows of the n×pn \times p design matrix XX (where p=m+1p = m+1). The unknown parameters in the logistic regression model are β0,,βm\beta_0, \dots, \beta_m (note that, in contrast to the linear regression model, there is no σ\sigma parameter in logistic regression). As in linear regression, we use the prior:

β0,β1,,βmi.i.dUnif(C,C)\beta_0, \beta_1, \dots, \beta_m \overset{\text{i.i.d}}{\sim} \text{Unif} (-C, C)

for a large CC. The posterior of β\beta is then

fβdata(β)Likelihood×prior[i=1n(exp(xiTβ)1+exp(xiTβ))yi(1exp(xiTβ)1+exp(xiTβ))1yi]I{β0,β1,,βp(C,C)}=[i=1nexp(yixiTβ)1+exp(xiTβ)]I{β0,β1,,βp(C,C)}=[exp((β))]I{β0,β1,,βp(C,C)}\begin{align*} f_{\beta \mid \text{data}}(\beta) &\propto \text{Likelihood} \times \text{prior} \\ &\propto \left[\prod_{i=1}^n \left(\frac{\exp \left(x_i^T\beta \right)}{1 + \exp \left(x_i^T\beta \right)} \right)^{y_i} \left(1 - \frac{\exp \left(x_i^T\beta \right)}{1 + \exp \left(x_i^T\beta \right)} \right)^{1-y_i} \right] I\{\beta_0, \beta_1, \dots, \beta_p \in (-C, C)\} \\ &= \left[\prod_{i=1}^n \frac{\exp(y_ix_i^T\beta) }{1 + \exp(x_i^T\beta)} \right]I\{\beta_0, \beta_1, \dots, \beta_p \in (-C, C)\} \\ &= \left[\exp \left(\ell(\beta) \right) \right]I\{\beta_0, \beta_1, \dots, \beta_p \in (-C, C)\} \end{align*}

where

(β):=i=1n[yi(xiTβ)log(1+exp(xiTβ))].\ell(\beta) := \sum_{i=1}^n \left[ y_i (x_i^T\beta) - \log \left(1 + \exp(x_i^T \beta) \right) \right].

Note that (β)\ell(\beta) is simply the log-likelihood in this problem. This posterior density is not in standard form involving some well-known densities. If p=1p = 1 or p=2p = 2, then this can be plotted. One can use various MCMC techniques to obtain samples from this posterior.

Instead of MCMC, we shall derive a closed form multivariate normal approximation that works quite well in practice. This method is called Laplace Approximation or Posterior Normal Approximation. It turns out that Bayesian inference from this normal approximation to the posterior coincides with usual frequentist inference for logistic regression.

To get the normal approximation, first let us drop the indicator which will be irrelevant when CC is large to get

fβY=y(β)exp((β)).\begin{align*} f_{\beta | Y = y}(\beta) \propto \exp(\ell(\beta)). \end{align*}

The normal approximation will be obtained by a second-order Taylor expansion of (β)\ell(\beta). We shall do the Taylor expansion around the MLE β^\hat{\beta} because the posterior is peaked at β^\hat{\beta} and the high regions of the posterior are most likely very close to β^\hat{\beta}. Recall that the MLE β^\hat{\beta} is defined as the maximizer of the likelihood (or log-likelihood):

β^:=argmaxβRp (β).\begin{align*} \hat{\beta} := \underset{\beta \in \R^p}{\text{argmax}} ~ \ell(\beta). \end{align*}

It is obtained by taking the gradient of the log-likelihood and solving the equation obtained by setting the gradient to zero. It is easy to check that the gradient of the log-likelihood is

(β)=i=1n(yiexp(xiβ)1+exp(xiβ))xi.\nabla \ell(\beta) = \sum_{i=1}^n \left(y_i - \frac{\exp \left(x_i'\beta \right)}{1 + \exp \left(x_i'\beta \right)} \right) x_i.

To get the maximum likelihood estimator θ^\hat{\theta}, we need to set the gradient above to zero and solve the resulting equation for θ\theta. This cannot be done in closed form and the usual method is to use an iterative scheme such as Newton’s algorithm. The answer can be obtained from inbuilt functions in R or Python. More details behind the Newton algorithm will be provided a bit later.

Coming back to the posterior exp((β))\exp(\ell(\beta)), Taylor expansion of (β)\ell(\beta) around the MLE β^\hat{\beta} gives

fβY=y(β)exp((β))exp((β^)+<(β^),ββ^>+12(ββ^)TH(β^)(ββ^))\begin{align*} f_{\beta | Y = y}(\beta) \propto \exp(\ell(\beta)) \approx \exp \left(\ell(\hat{\beta}) + \left<\nabla \ell(\hat{\beta}), \beta - \hat{\beta}\right> + \frac{1}{2} (\beta - \hat{\beta})^T H\ell(\hat{\beta}) (\beta - \hat{\beta}) \right) \end{align*}

where H(β)H\ell(\beta) denotes the Hessian of (β)\ell(\beta):

H(β)=i=1nexp(xiβ)(1+exp(xiβ))2xixi.H \ell(\beta) = - \sum_{i=1}^n \frac{\exp(x_i' \beta)}{\left(1 + \exp(x_i' \beta)\right)^2} x_i x_i'.

Because the (β^)\ell(\hat \beta) term is a constant, it can be ignored in proportionality. Also (β^)\nabla \ell(\hat{\beta}) equals zero. We thus have

fβY=y(β)exp(12(ββ^)TH(β^)(ββ^))=exp(12(ββ^)T(H(β^))(ββ^))\begin{align*} f_{\beta \mid Y = y} (\beta) \propto \exp \left(\frac{1}{2} (\beta - \hat{\beta})^T H\ell(\hat{\beta}) (\beta - \hat{\beta}) \right) = \exp \left(-\frac{1}{2} (\beta - \hat{\beta})^T \left(-H\ell(\hat{\beta}) \right) (\beta - \hat{\beta}) \right) \end{align*}

We have switched above to H(β^)-H\ell(\hat{\beta}) because this matrix is positive semi-definite as β^\hat{\beta} maximizes (β)\ell(\beta). The above term is simply the unnormalized density of the multivariate normal distribution with mean β^\hat{\beta} and covariance matrix (H(β^))1(-H \ell(\hat{\beta}))^{-1}. Observe that

H(β^)=i=1nexp(xiβ^)(1+exp(xiβ^))2xixi.- H \ell(\hat \beta) = \sum_{i=1}^n \frac{\exp(x_i' \hat{\beta})}{\left(1 + \exp(x_i' \hat{\beta})\right)^2} x_i x_i'.

Now let WW denote the n×nn \times n diagonal matrix whose ithi^{th} diagonal entry is

exp(xiβ^)(1+exp(xiβ^))2\frac{\exp(x_i' \hat{\beta})}{\left(1 + \exp(x_i' \hat{\beta})\right)^2}

and also recall again that XX is the n×pn \times p matrix with rows x1,,xnx_1', \dots, x_n'. It is then easy to check that

H(β^)=XWX.- H \ell(\hat \beta) = X' W X.

The posterior normal approximation is thus

N(β^,(XWX)1).N(\hat{\beta}, (X' W X)^{-1}).

The standard errors corresponding to β0,,βm\beta_0, \dots, \beta_m can then be obtained by the square roots of the diagonal entries of (XWX)1(X' W X)^{-1}.

It turns out that Bayesian inference done with the above normal posterior approximation (10) coincides with the frequentist inference in the logistic regression model. It is easy to check this, say, in Python using statsmodels (construct a 95% credible interval for, say, one of the components of β\beta and then compare it with the frequentist interval). Thus the usual frequentist inference for the logistic regression model can be viewed from a Bayesian perspective.

Note that the above analysis relies on two assumptions: (a) the prior for β\beta is assumed to be uniform on the large cube (C,C)p(-C, C)^p, and (b) the posterior is approximated by a normal distribution. These assumptions may be of course not reasonable in a particular application. In such a situation, it is conceptually very clear as to how one would proceed: if the normal approximation to the posterior is not accurate, one needs to work with the actual posterior. If the uniform prior is not reasonable, one can do the full posterior analysis (or by taking a normal approximation to the posterior) for a more appropriate prior.

Details behind the Newton Algorithm for computing the MLE

The MLE β^\hat{\beta} of β\beta is the maximizer of (β)\ell(\beta). The maximizer of (β)\ell(\beta) cannot be computed in closed form. Newton’s method is commonly used for maximizing (β)\ell(\beta). Newton’s method uses the iterative scheme

β(m+1)=β(m)(H(β(m)))1(β(m))\beta^{(m+1)} = \beta^{(m)} - \left(H\ell(\beta^{(m)})\right)^{-1} \nabla \ell(\beta^{(m)})

As we saw in (6),

(β)=i=1n(yiπi)xi\nabla \ell(\beta) = \sum_{i=1}^n \left(y_i - \pi_i \right) x_i

where πi\pi_i is given by

πi=exp(xiβ)1+exp(xiβ)\pi_i = \frac{\exp(x_i' \beta)}{1 + \exp(x_i' \beta)}

Letting π\pi be the n×1n \times 1 vector with entries π1,,πn\pi_1, \dots, \pi_n, we can write (β)\nabla \ell(\beta) in matrix notation as (note that XX has rows x1T,,xnTx_1^T, \dots, x_n^T or, equivalently, XTX^T has columns x1,,xnx_1, \dots, x_n):

(β)=XT(yπ)\nabla \ell(\beta) = X^T \left(y - \pi \right)

where, as usual in regression, yy denotes the n×1n \times 1 vector of response values. Also from (9), we can write

H(β)=XTWXH\ell(\beta) = - X^T W X

where WW is the n×nn \times n diagonal matrix whose ithi^{th} diagonal entry is πi(1πi)\pi_i(1 - \pi_i). Newton’s iterative scheme (11) therefore becomes

β(m+1)=β(m)+(XTWX)1XT(yπ).\beta^{(m+1)} = \beta^{(m)} + (X^T W X)^{-1} X^T (y - \pi).

This can be rewritten as

β(m+1)=(XTWX)1XTWZ\beta^{(m+1)} = (X^T W X)^{-1} X^T W Z

where

Z=Xβ(m)+W1(yπ).Z = X \beta^{(m)} + W^{-1}(y - \pi).

The method of obtaining the MLE β^\hat{\beta} therefore proceeds iteratively as follows. First have an initial estimate of β\beta. Call this initial estimator β^(0)\hat{\beta}^{(0)}. Use this estimator to calculate πi\pi_i via

πi=exp(β^0(0)+β^1(0)xi1++β^p(0)xip)1+exp(β^0(0)+β^1(0)xi1+β^p(0)xip).\pi_i = \frac{\exp(\hat{\beta}^{(0)}_0 + \hat{\beta}^{(0)}_1 x_{i1} + \dots + \hat{\beta}^{(0)}_p x_{ip})}{1 + \exp(\hat{\beta}^{(0)}_0 + \hat{\beta}^{(0)}_1 x_{i1} \dots + \hat{\beta}^{(0)}_p x_{ip})}.

Use these values of πi\pi_i to create the response variable values ZiZ_i via (13) and also use values of πi\pi_i to construct the matrix WW. With ZZ and WW, we can estimate β\beta via

β^(1)=(XTWX)1XTWZ.\hat{\beta}^{(1)} = (X^T W X)^{-1} X^T W Z.

Now replace the initial estimator β^(0)\hat{\beta}^{(0)} by β^(1)\hat{\beta}^{(1)} and repeat this process. Keep repeating this until two successive estimates β^(m)\hat{\beta}^{(m)} and β^(m+1)\hat{\beta}^{(m+1)} do not change much. At that point, stop and report the estimate of β\beta in the logistic regression model as β^(m)\hat{\beta}^{(m)}.

The expression (XTWX)1XTWZ(X^T W X)^{-1} X^T W Z is reminiscent of the usual (XTX)1XTy(X^T X)^{-1} X^T y which is the usual estimate of β\beta in the linear model. In fact, this is the least squares estimate in a weighted least squares model.