STAT 238 - Bayesian Statistics Lecture Twenty Seven
Spring 2026, UC Berkeley
Gibbs Sampler for Probit Regression ¶ Our next application of the Gibbs sampler is to Probit Regression which was introduced in Homework Four. The model is given by the following. We observe data ( y 1 , x 1 ) , … , ( y n , x n ) (y_1, x_1), \dots, (y_n, x_n) ( y 1 , x 1 ) , … , ( y n , x n ) with y 1 , … , y n y_1, \dots, y_n y 1 , … , y n binary and x 1 , … , x n ∈ R p x_1, \dots, x_n \in \R^p x 1 , … , x n ∈ R p . x 1 , … , x n x_1,
\dots, x_n x 1 , … , x n are treated non-random and y 1 , … , y n y_1, \dots, y_n y 1 , … , y n random. The parameter is β ∈ R p \beta \in \R^p β ∈ R p . The probit regression model assumes that y 1 … , y n y_1 \dots, y_n y 1 … , y n are independent given β \beta β with
y i ∣ β ∼ Bernoulli ( Φ ( x i ′ β ) ) y_i | \beta \sim \text{Bernoulli} \left(\Phi(x_i' \beta) \right) y i ∣ β ∼ Bernoulli ( Φ ( x i ′ β ) ) where Φ ( ⋅ ) \Phi(\cdot) Φ ( ⋅ ) is the standard normal cdf. Let π ( ⋅ ) \pi(\cdot) π ( ⋅ ) be our prior for β \beta β . The posterior is then given by:
π ( β ∣ data ) ∝ π ( β ) ∏ i = 1 n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . \pi(\beta | \text{data}) \propto \pi(\beta) \prod_{i=1}^n
\left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta)
\right)^{1-y_i}. π ( β ∣ data ) ∝ π ( β ) i = 1 ∏ n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . Suppose we take the (improper) uniform prior for β \beta β so that the posterior becomes
π ( β ∣ data ) ∝ ∏ i = 1 n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . \pi(\beta | \text{data}) \propto \prod_{i=1}^n
\left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta)
\right)^{1-y_i}. π ( β ∣ data ) ∝ i = 1 ∏ n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . It is not actually clear how to use the Gibbs sampler on the above posterior. We can try decomposing β \beta β as ( β 1 , … , β p ) (\beta_1, \dots,
\beta_p) ( β 1 , … , β p ) and then attempt to obtain samples from the conditional:
π ( β j ∣ β − j , data ) ∝ ∏ i = 1 n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . \pi(\beta_j | \beta_{-j}, \text{data}) \propto \prod_{i=1}^n
\left(\Phi(x_i' \beta) \right)^{y_i} \left(1 - \Phi(x_i' \beta)
\right)^{1-y_i}. π ( β j ∣ β − j , data ) ∝ i = 1 ∏ n ( Φ ( x i ′ β ) ) y i ( 1 − Φ ( x i ′ β ) ) 1 − y i . I am not sure how to obtain samples from the above conditional.
There is a trick that people use to run the Gibbs sampler for probit regression. Let us introduce random variables w 1 , … , w n w_1, \dots, w_n w 1 , … , w n which, conditional on β \beta β , are independent with
w i ∣ β ∼ N ( x i ′ β , 1 ) . w_i | \beta \sim N(x_i' \beta, 1). w i ∣ β ∼ N ( x i ′ β , 1 ) . We then let
y i = I { w i > 0 } . y_i = I\{w_i > 0\}. y i = I { w i > 0 } . It is then clear that y 1 , … , y n y_1, \dots, y_n y 1 , … , y n defined thus have the same likelihood as in the probit regression model. Let w = ( w 1 , … , w n ) w = (w_1, \dots,
w_n) w = ( w 1 , … , w n ) . The posterior of ( w , β ) (w, \beta) ( w , β ) given the data (data = y = ( y 1 , … , y n ) \text{data} =
y = (y_1, \dots, y_n) data = y = ( y 1 , … , y n ) ) is
π ( w , β ∣ data ) ∝ π ( β ) π ( w ∣ β ) π ( y ∣ w , β ) ∝ ∏ i = 1 n exp ( − 1 2 ( w i − x i ′ β ) 2 ) ∏ i = 1 n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ exp ( − 1 2 ∥ w − X β ∥ 2 ) ∏ i = 1 n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) . \begin{align*}
\pi(w, \beta | \text{data}) &\propto \pi(\beta) \pi(w | \beta)
\pi(y | w, \beta) \\
&\propto \prod_{i=1}^n \exp \left(\frac{-1}{2} (w_i - x_i'
\beta)^2\right) \prod_{i=1}^n \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right) \\
&\propto \exp \left(\frac{-1}{2} \|w - X \beta\|^2 \right) \prod_{i=1}^n \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right).
\end{align*} π ( w , β ∣ data ) ∝ π ( β ) π ( w ∣ β ) π ( y ∣ w , β ) ∝ i = 1 ∏ n exp ( 2 − 1 ( w i − x i ′ β ) 2 ) i = 1 ∏ n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ exp ( 2 − 1 ∥ w − Xβ ∥ 2 ) i = 1 ∏ n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) . where X X X is, as usual, the design matrix with rows x 1 ′ , … , x n ′ x_1', \dots,
x_n' x 1 ′ , … , x n ′ .
We shall now run the Gibbs sampler with the decomposition θ = ( w , β ) \theta =
(w, \beta) θ = ( w , β ) . For this, we need to sample from the conditionals β ∣ w , data \beta|w, \text{data} β ∣ w , data and w ∣ β , data w | \beta, \text{data} w ∣ β , data .
π ( β ∣ w , data ) ∝ π ( w , β ∣ data ) ∝ exp ( − 1 2 ∥ w − X β ∥ 2 ) ∏ i = 1 n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ exp ( − 1 2 ∥ w − X β ∥ 2 ) \begin{align*}
\pi(\beta | w, \text{data}) &\propto \pi(w, \beta | \text{data})
\\
&\propto \exp \left(\frac{-1}{2} \|w - X \beta\|^2 \right) \prod_{i=1}^n \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right) \\
&\propto \exp \left(\frac{-1}{2} \|w - X \beta\|^2 \right)
\end{align*} π ( β ∣ w , data ) ∝ π ( w , β ∣ data ) ∝ exp ( 2 − 1 ∥ w − Xβ ∥ 2 ) i = 1 ∏ n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ exp ( 2 − 1 ∥ w − Xβ ∥ 2 ) For a fixed w ∈ R n w \in \R^n w ∈ R n , let
β ^ w : = argmin β ∈ R p ∥ w − X β ∥ 2 = ( X ′ X ) − 1 X ′ w \hat{\beta}_w := \underset{\beta \in \R^p}{\text{argmin}} \|w - X
\beta\|^2 = (X' X)^{-1} X' w β ^ w := β ∈ R p argmin ∥ w − Xβ ∥ 2 = ( X ′ X ) − 1 X ′ w so that
∥ w − X β ∥ 2 = ∥ w − X β ^ w ∥ 2 + ∥ X β ^ w − X β ∥ 2 = ∥ w − X β ^ w ∥ 2 + ( β − β ^ w ) ′ ( X ′ X ) ( β − β ^ w ) . \|w - X \beta\|^2 = \|w - X \hat{\beta}_w\|^2 + \|X \hat{\beta}_w -
X \beta\|^2 = \|w - X \hat{\beta}_w\|^2 + \left(\beta -
\hat{\beta}_w \right)' (X' X) \left(\beta -
\hat{\beta}_w \right). ∥ w − Xβ ∥ 2 = ∥ w − X β ^ w ∥ 2 + ∥ X β ^ w − Xβ ∥ 2 = ∥ w − X β ^ w ∥ 2 + ( β − β ^ w ) ′ ( X ′ X ) ( β − β ^ w ) . Thus
π ( β ∣ w , data ) ∝ exp ( − 1 2 ( β − β ^ w ) ′ ( X ′ X ) ( β − β ^ w ) ) . \pi(\beta | w, \text{data}) \propto \exp \left(\frac{-1}{2} \left(\beta -
\hat{\beta}_w \right)' (X' X) \left(\beta -
\hat{\beta}_w \right) \right). π ( β ∣ w , data ) ∝ exp ( 2 − 1 ( β − β ^ w ) ′ ( X ′ X ) ( β − β ^ w ) ) . In other words,
β ∣ w , data ∼ N ( β ^ w , ( X ′ X ) − 1 ) . \beta|w, \text{data} \sim N \left(\hat{\beta}_w, (X' X)^{-1}
\right). β ∣ w , data ∼ N ( β ^ w , ( X ′ X ) − 1 ) . The next task is to figure out the conditional w ∣ β , data w | \beta, \data w ∣ β , data :
π ( w ∣ β , data ) ∝ π ( w , β ∣ data ) ∝ exp ( − 1 2 ∥ w − X β ∥ 2 ) ∏ i = 1 n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ ∏ i = 1 n exp ( − 1 2 ( w i − x i ′ β ) 2 ) ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) \begin{align*}
\pi(w | \beta, \text{data}) &\propto \pi(w, \beta | \text{data})
\\
&\propto \exp \left(\frac{-1}{2} \|w - X \beta\|^2 \right) \prod_{i=1}^n \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right) \\
& \propto \prod_{i=1}^n \exp \left(\frac{-1}{2} (w_i - x_i' \beta)^2
\right) \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right)
\end{align*} π ( w ∣ β , data ) ∝ π ( w , β ∣ data ) ∝ exp ( 2 − 1 ∥ w − Xβ ∥ 2 ) i = 1 ∏ n ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) ∝ i = 1 ∏ n exp ( 2 − 1 ( w i − x i ′ β ) 2 ) ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) Thus w 1 , … , w n w_1, \dots, w_n w 1 , … , w n are independent conditional on β \beta β and data \data data . Further
π ( w i ∣ β , data ) = π ( w i ∣ β , y i ) ∝ exp ( − 1 2 ( w i − x i ′ β ) 2 ) ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) . \pi(w_i|\beta, \data) = \pi(w_i | \beta, y_i) \propto \exp \left(\frac{-1}{2} (w_i - x_i' \beta)^2
\right) \left(I\{w_i > 0, y_i = 1\} +
I\{w_i \leq 0, y_i = 0\} \right). π ( w i ∣ β , data ) = π ( w i ∣ β , y i ) ∝ exp ( 2 − 1 ( w i − x i ′ β ) 2 ) ( I { w i > 0 , y i = 1 } + I { w i ≤ 0 , y i = 0 } ) . It makes sense now to separate the two cases y i = 1 y_i = 1 y i = 1 and y i = 0 y_i =
0 y i = 0 . For y i = 1 y_i = 1 y i = 1 , we get
π ( w i ∣ β , y i = 1 ) ∝ exp ( − 1 2 ( w i − x i ′ β ) 2 ) I { w i > 0 } . \pi(w_i | \beta, y_i = 1) \propto \exp \left(\frac{-1}{2} (w_i - x_i' \beta)^2
\right) I\{w_i > 0\}. π ( w i ∣ β , y i = 1 ) ∝ exp ( 2 − 1 ( w i − x i ′ β ) 2 ) I { w i > 0 } . Thus, conditional on β \beta β and y i = 1 y_i = 1 y i = 1 , the random variable w i w_i w i has the N ( x i ′ β , 1 ) N(x_i' \beta, 1) N ( x i ′ β , 1 ) distribution trucated to the positive half-line.
Here is a fact (from the wikipedia article for truncated normal distributions: Truncated normal distribution#Generating values from the truncated normal distribution ) that is useful for simulating from truncated normal distributions.
Consider a random variable X trunc X_{\text{trunc}} X trunc having the truncated N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) truncated to the interval ( a , b ) (a, b) ( a , b ) (here a < b a < b a < b ; a a a is allowed to be − ∞ -\infty − ∞ and b b b is allowed to be + ∞ +\infty + ∞ ). The density of X trunc X_{\text{trunc}} X trunc is equal to the conditional density of X X X conditioned on a < X < b a < X < b a < X < b where X ∼ N ( μ , σ 2 ) X \sim
N(\mu, \sigma^2) X ∼ N ( μ , σ 2 ) .
X trunc X_{\text{trunc}} X trunc can be simulated using the equation:
X trunc = μ + σ Φ − 1 ( ( 1 − U ) Φ ( a 0 ) + U Φ ( b 0 ) ) where a 0 : = a − μ σ and b 0 : = b − μ σ X_{\text{trunc}} = \mu + \sigma \Phi^{-1} \left((1 - U)\Phi(a_0) + U
\Phi(b_0) \right) ~~\text{where $a_0 := \frac{a - \mu}{\sigma}$
and $b_0 := \frac{b - \mu}{\sigma}$} X trunc = μ + σ Φ − 1 ( ( 1 − U ) Φ ( a 0 ) + U Φ ( b 0 ) ) where a 0 := σ a − μ and b 0 := σ b − μ and U U U has the uniform distribution on ( 0 , 1 ) (0, 1) ( 0 , 1 ) . I will leave as an exercise for you to verify that the right hand side above indeed has the required truncated normal distribution.
Using this fact, it is straightforward to verify that w i ∣ { β , y i = 1 } w_i|\{\beta,
y_i = 1\} w i ∣ { β , y i = 1 } can be simulated by
x i ′ β + Φ − 1 ( ( 1 − U ) Φ ( − x i ′ β ) + U ) . x_i' \beta + \Phi^{-1} \left((1 - U) \Phi(-x_i' \beta) + U \right). x i ′ β + Φ − 1 ( ( 1 − U ) Φ ( − x i ′ β ) + U ) . This follows directly from the fact above by taking μ = x i ′ β \mu = x_i'
\beta μ = x i ′ β , σ = 1 \sigma = 1 σ = 1 , a = 0 a = 0 a = 0 and b = ∞ b = \infty b = ∞ (this implies that a 0 = − x i ′ β a_0
= -x_i' \beta a 0 = − x i ′ β and b 0 = ∞ b_0 = \infty b 0 = ∞ ).
One can argue analogously to deduce that w i ∣ { β , y i = 0 } w_i | \{\beta, y_i = 0\} w i ∣ { β , y i = 0 } can be simulated by
x i ′ β + Φ − 1 ( U Φ ( − x i ′ β ) ) . x_i' \beta + \Phi^{-1} \left(U \Phi(-x_i' \beta) \right). x i ′ β + Φ − 1 ( U Φ ( − x i ′ β ) ) . The Gibbs sampler for the probit model can be therefore implemented by the following model:
Initialize at β ( 0 ) \beta^{(0)} β ( 0 ) and w ( 0 ) w^{(0)} w ( 0 ) .
Repeat the following for t = 1 , … , N t = 1, \dots, N t = 1 , … , N
Pick J J J to be either 1 or 2 with equal probability.
If J = 1 J = 1 J = 1 , take β ( t + 1 ) ∼ N ( ( X ′ X ) − 1 X ′ w ( t ) , ( X ′ X ) − 1 ) \beta^{(t+1)} \sim N((X'X)^{-1} X'
w^{(t)}, (X' X)^{-1}) β ( t + 1 ) ∼ N (( X ′ X ) − 1 X ′ w ( t ) , ( X ′ X ) − 1 ) and w ( t + 1 ) = w ( t ) w^{(t+1)} = w^{(t)} w ( t + 1 ) = w ( t ) .
If J = 2 J = 2 J = 2 , take β ( t + 1 ) = β ( t ) \beta^{(t+1)} = \beta^{(t)} β ( t + 1 ) = β ( t ) and
w i = { x i ′ β ( t ) + Φ − 1 ( ( 1 − U i ) Φ ( − x i ′ β ( t ) ) + U i ) if y i = 1 x i ′ β ( t ) + Φ − 1 ( U i Φ ( − x i ′ β ( t ) ) ) if y i = 0 w_i =
\begin{cases}
x_i' \beta^{(t)} + \Phi^{-1} \left((1 - U_i) \Phi(-x_i' \beta^{(t)}) + U_i \right) &
\text{if } y_i = 1 \\
x_i' \beta^{(t)} + \Phi^{-1} \left(U_i \Phi(-x_i' \beta^{(t)}) \right) &
\text{if } y_i = 0
\end{cases} w i = { x i ′ β ( t ) + Φ − 1 ( ( 1 − U i ) Φ ( − x i ′ β ( t ) ) + U i ) x i ′ β ( t ) + Φ − 1 ( U i Φ ( − x i ′ β ( t ) ) ) if y i = 1 if y i = 0 where U i ∼ U n i f ( 0 , 1 ) U_i \sim Unif(0, 1) U i ∼ U ni f ( 0 , 1 ) .