In machine learning, local case-control sampling [1] is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.
In classification, a dataset is a set of N data points
(xi,yi)
N | |
i=1 |
xi\inRd
yi\in\{0,1\}
Formally, an imbalanced dataset exhibits one or more of the following properties:
P(Y=1) ≈ 0
X\in\{0,1\}
P(Y=1\midX=0) ≈ 0
P(Y=1\midX=1) ≈ 1
In logistic regression, given the model
\theta=(\alpha,\beta)
P(Y=1\midX;\theta)=\tilde{p}\theta(x)=
\exp(\alpha+\betaTx) | |
1+\exp(\alpha+\betaTx) |
\tilde{\theta}=(\tilde{\alpha},\tilde{\beta})
(x,y)
a(x,y)=|y-\tilde{p}\tilde{\theta
zi\simBernoulli(a(xi,yi))
i\in\{1,\ldots,N\}
S=\{(xi,yi):zi=1\}
\hat{\theta}S=(\hat{\alpha}S,\hat{\beta}S)
\hat{\theta}=(\hat{\alpha},\hat{\beta})
\hat{\alpha}\leftarrow\hat{\alpha}S+\tilde{\alpha}
\hat{\beta}\leftarrow\hat{\beta}S+\tilde{\beta}
The algorithm can be understood as selecting samples that surprises the pilot model. Intuitively these samples are closer to the decision boundary of the classifier and is thus more informative.
In practice, for cases where a pilot model is naturally available, the algorithm can be applied directly to reduce the complexity of training. In cases where a natural pilot is nonexistent, an estimate using a subsample selected through another sampling technique can be used instead. In the original paper describing the algorithm, the authors propose to use weighted case-control sampling with half the assigned sampling budget. For example, if the objective is to use a subsample with size
N=1000
\tilde{\theta}
Nh=500
Nh=500
It is possible to control the sample size by multiplying the acceptance probability with a constant
c
c>1
min(ca(xi,yi),1)
The algorithm has the following properties. When the pilot is consistent, the estimates using the samples from local case-control sampling is consistent even under model misspecification. If the model is correct then the algorithm has exactly twice the asymptotic variance of logistic regression on the full data set. For a larger sample size with
c>1
1+ | 1 |
c |