In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit.[1] The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.
A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function.[2] It is most often estimated using the maximum likelihood procedure,[3] such an estimation being called a probit regression.
Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form
P(Y=1\midX)=\Phi(X\operatorname{T}\beta),
\Phi
It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable
Y\ast=XT\beta+\varepsilon,
Y=\left.\begin{cases}1&Y*>0\\ 0&otherwise\end{cases}\right\}=\left.\begin{cases}1&X\operatorname{T}\beta+\varepsilon>0\\ 0&otherwise\end{cases}\right\}
The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount.
To see that the two models are equivalent, note that
\begin{align} P(Y=1\midX) &=P(Y\ast>0)\\ &=P(X\operatorname{T}\beta+\varepsilon>0)\\ &=P(\varepsilon>-X\operatorname{T}\beta)\\ &=P(\varepsilon<X\operatorname{T}\beta)&bysymmetryofthenormaldistribution\\ &=\Phi(X\operatorname{T}\beta) \end{align}
Suppose data set
\{yi,xi\}
n | |
i=1 |
For the single observation, conditional on the vector of inputs of that observation, we have:
P(yi=1|xi)=\Phi(xi'\beta)
P(yi=0|xi)=1-\Phi(xi'\beta)
xi
K x 1
\beta
K x 1
The likelihood of a single observation
(yi,xi)
l{L}(\beta;yi,xi)=
\operatorname{T}\beta) | |
\Phi(x | |
i |
yi | |
\operatorname{T}\beta)] | |
[1-\Phi(x | |
i |
(1-yi) | |
In fact, if
yi=1
l{L}(\beta;yi,xi)=
\operatorname{T}\beta) | |
\Phi(x | |
i |
yi=0
l{L}(\beta;yi,xi)=
\operatorname{T}\beta) | |
1-\Phi(x | |
i |
Since the observations are independent and identically distributed, then the likelihood of the entire sample, or the joint likelihood, will be equal to the product of the likelihoods of the single observations:
l{L}(\beta;Y,X)=
n | |
\prod | |
i=1 |
\left(
\operatorname{T}\beta) | |
\Phi(x | |
i |
yi | |
\operatorname{T}\beta)] | |
[1-\Phi(x | |
i |
(1-yi) | |
\right)
The joint log-likelihood function is thus
lnl{L}(\beta;Y,X)=
n | |
\sum | |
i=1 |
(yiln\Phi(x
\operatorname{T}\beta) | |
i |
+(1-yi)ln(1-\Phi(x
\operatorname{T}\beta)) | |
i |
)
\hat\beta
\operatorname{E}[XX\operatorname{T}]
\beta
Asymptotic distribution for
\hat\beta
\sqrt{n}(\hat\beta-\beta) \xrightarrow{d} l{N}(0,\Omega-1),
\Omega=\operatorname{E}[
\varphi2(X\operatorname{T | |
\beta)}{\Phi(X |
\operatorname{T}\beta)(1-\Phi(X\operatorname{T}\beta))}XX\operatorname{T}], \hat\Omega=
1 | |
n |
n | |
\sum | |
i=1 |
\varphi2(x\operatorname{T | |
i\hat\beta)}{\Phi(x |
\operatorname{T} | |
i\hat\beta))}x |
\operatorname{T} | |
i, |
\varphi=\Phi'
Semi-parametric and non-parametric maximum likelihood methods for probit-type and other related models are also available.[4]
See main article: Minimum chi-square estimation.
This method can be applied only when there are many observations of response variable
yi
xi
Suppose among n observations
\{yi,xi\}
n | |
i=1 |
\{x(1),\ldots,x(T)\}
nt
xi=x(t),
rt
yi=1
t,\limnnt/n=ct>0
Denote
\hat{p}t=rt/nt
2 | |
\hat\sigma | |
t |
=
1 | |
nt |
\hat{p | |
t(1-\hat{p} |
2(\Phi | |
t)}{\varphi |
-1(\hat{p}t))}
Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of
\Phi-1(\hat{p}t)
x(t)
-2 | |
\hat\sigma | |
t |
\hat\beta=(
T | |
\sum | |
t=1 |
-2 | |
\hat\sigma | |
t |
x(t)
\operatorname{T} | |
x | |
(t) |
)-1
T | |
\sum | |
t=1 |
-2 | |
\hat\sigma | |
t |
x(t)\Phi-1(\hat{p}t)
It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts
rt
nt
x(t)
Gibbs sampling of a probit model is possible because regression models typically use normal prior distributions over the weights, and this distribution is conjugate with the normal distribution of the errors (and hence of the latent variables Y*). The model can be described as
\begin{align} \boldsymbol\beta&\siml{N}(b0,B0)
\ast\midx | |
\\[3pt] y | |
i,\boldsymbol\beta |
&\sim
\operatorname{T} | |
l{N}(x | |
i\boldsymbol\beta, |
1)\\[3pt] yi&=\begin{cases}1&if
\ast | |
y | |
i |
>0\ 0&otherwise\end{cases} \end{align}
From this, we can determine the full conditional densities needed:
\begin{align} B&=
-1 | |
(B | |
0 |
+X\operatorname{T}X)-1\\[3pt] \boldsymbol\beta\midy\ast&\sim
-1 | |
l{N}(B(B | |
0 |
b0+X\operatorname{T}y\ast),B)
\ast\mid | |
\\[3pt] y | |
i |
yi=0,xi,\boldsymbol\beta&\sim
\operatorname{T} | |
l{N}(x | |
i\boldsymbol\beta, |
\ast | |
1)[y | |
i |
<0]
\ast\mid | |
\\[3pt] y | |
i |
yi=1,xi,\boldsymbol\beta&\sim
\operatorname{T} | |
l{N}(x | |
i\boldsymbol\beta, |
\ast | |
1)[y | |
i |
\ge0] \end{align}
The result for
\boldsymbol\beta
The only trickiness is in the last two equations. The notation
\ast | |
[y | |
i |
<0]
\ast | |
l{I}(y | |
i |
<0)
\operatorname{T} | |
x | |
i\boldsymbol\beta |
rtnorm
for generating truncated-normal samples.The suitability of an estimated binary model can be evaluated by counting the number of true observations equaling 1, and the number equaling zero, for which the model assigns a correct predicted classification by treating any estimated probability above 1/2 (or, below 1/2), as an assignment of a prediction of 1 (or, of 0). See for details.
Consider the latent variable model formulation of the probit model. When the variance of
\varepsilon
x
x
y*=\beta0+B1x1+\varepsilon
\varepsilon\midx\simN
2 | |
(0,x | |
1) |
x1
\beta
P(y=1\midx)
1[\beta0+\beta1x1+\varepsilon>0]
1[\beta0/x1+\beta1+\varepsilon/x1>0]
\varepsilon/x1\midx\simN(0,1)
P(y=1\midx)=\Phi(\beta1+\beta0/x1)
(1,1/x1)
P(y=1\midx).
When the assumption that
\varepsilon
\beta
\varepsilon
\partialP(y=1\midx)/\partialxi'
To avoid the issue of distribution misspecification, one may adopt a general distribution assumption for the error term, such that many different types of distribution can be included in the model. The cost is heavier computation and lower accuracy for the increase of the number of parameter.[6] In most of the cases in practice where the distribution form is misspecified, the estimators for the coefficients are inconsistent, but estimators for the conditional probability and the partial effects are still very good.
One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions on a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).[7]
The probit model is usually credited to Chester Bliss, who coined the term "probit" in 1934,[8] and to John Gaddum (1933), who systematized earlier work. However, the basic model dates to the Weber–Fechner law by Gustav Fechner, published in, and was repeatedly rediscovered until the 1930s; see and .
A fast method for computing maximum likelihood estimates for the probit model was proposed by Ronald Fisher as an appendix to Bliss' work in 1935.[9]