Conditional logistic regression is an extension of logistic regression that allows one to account for stratification and matching. Its main field of application is observational studies and in particular epidemiology. It was devised in 1978 by Norman Breslow, Nicholas Day, Katherine Halvorsen, Ross L. Prentice and C. Sabai.[1] It is the most flexible and general procedure for matched data.
Observational studies use stratification or matching as a way to control for confounding.
Logistic regression can account for stratification by having a different constant term for each stratum. Let us denote
Yi\ell\in\{0,1\}
\ell
i
Xi\ell\inRp
P(Yi\ell=1|Xi\ell)=
\exp(\alphai+\boldsymbol\beta\topXi\ell) | |
1+\exp(\alphai+\boldsymbol\beta\topXi\ell) |
where
\alphai
i
For example, consider estimating the impact of exercise on the risk of cardiovascular disease. If people who exercise more are younger, have better access to healthcare, or have other differences that improve their health, then a logistic regression of cardiovascular disease incidence on minutes spent exercising may overestimate the impact of exercise on health. To address this, we can group people based on demographic characteristics like age and zip code of their home residence. Each stratum
\ell
Xi\ell
i
\ell
\alphai
Yi\ell
\boldsymbol\beta
Xi\ell
Logistic regression as described above works satisfactorily when the number of strata is small relative to the amount of data. If we hold the number of strata fixed and increase the amount of data, estimates of the model parameters (
\alphai
\boldsymbol\beta
Pathological behavior, however, occurs when we have many small strata because the number of parameters grow with the amount of data. For example, if each stratum contains two datapoints, then the number of parameters in a model with
N
N/2+p
In addition to tests based on logistic regression, several other tests existed before conditional logistic regression for matched data as shown in related tests. However, they did not allow for the analysis of continuous predictors with arbitrary stratum size. All of those procedures also lack the flexibility of conditional logistic regression and in particular the possibility to control for covariates.
Conditional logistic regression uses a conditional likelihood approach that deals with the above pathological behavior by conditioning on the number of cases in each stratum. This eliminates the need to estimate the strata parameters.
When the strata are pairs, where the first observation is a case and the second is a control, this can be seen as follows
\begin{align} &P(Yi1=1,Yi2=0|Xi1,Xi2,Yi1+Yi2=1)\\ &=
P(Yi1=1|Xi1)P(Yi2=0|Xi2) | |
P(Yi1=1|Xi1)P(Yi2=0|Xi2)+P(Yi1=0|Xi1)P(Yi2=1|Xi2) |
\\[6pt] &=
| |||||
1+\exp(\alphai+\boldsymbol{\beta |
\topXi1)} x
1 | |
1+\exp(\alphai+\boldsymbol{\beta |
\topXi2)}}{
\exp(\alphai+\boldsymbol{\beta | |
\top |
Xi1
\top | |
)}{1+\exp(\alpha | |
i+\boldsymbol{\beta} |
Xi1)} x
1 | |
1+\exp(\alphai+\boldsymbol{\beta |
\topXi2)}+
1 | |
1+\exp(\alphai+\boldsymbol{\beta |
\topXi1)} x
\exp(\alphai+\boldsymbol{\beta | |
\top |
Xi2
\top | |
)}{1+\exp(\alpha | |
i+\boldsymbol{\beta} |
Xi2)}}\\[6pt] &=
\exp(\boldsymbol{\beta | |
\top |
Xi1)}{\exp(\boldsymbol{\beta}\topXi1)+\exp(\boldsymbol{\beta}\topXi2)}.\\[6pt] \end{align}
With similar computations, the conditional likelihood of a stratum of size
m
k
P(Yij=1forj\leqk,Yij=0fork<j\leqm|Xi1,...,Xim
m | |
,\sum | |
j=1 |
Yij=k)=
| ||||||||||
\top |
Xij)}{\sumJ\in
m | |
k |
m | |
l{C} | |
k |
k
\{1,...,m\}
The full conditional log likelihood is then simply the sum of the log likelihoods for each stratum. The estimator is then defined as the
\beta
Conditional logistic regression is available in R as the function clogit
in the survival
package. It is in the survival
package because the log likelihood of a conditional logistic model is the same as the log likelihood of a Cox model with a particular data structure.[3]
It is also available in python through the statsmodels
package starting with version 0.14.[4]