The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in design of experiments.[1]
There is a group of subjects. Each subject has several features, which are considered binary. For example: each subject can be either young or old; either black or white; either tall or short; etc. The goal is to divide the subjects to two sub-groups: treatment group (T) and control group (C), such that for each feature, the number of subjects that have this feature in T is roughly equal to the number of subjects that have this feature in C. E.g., both groups should have roughly the same number of young people, the same number of black people, the same number of tall people, etc.
Formally, the set balancing problem can be described as follows.
m
n
The subjects are described by
A
n x m
{0,1}
ai,j=1
j
i
ai,j=0
j
i
The partition to groups is described by
b
m x 1
{-1,1}
bj=1
j
bj=-1
j
The balance of features is described by
c=A ⋅ b
n x 1
ci
i
ci>0
i
ci<0
i
The imbalance of a given partition is defined as:
I(b)=||A ⋅ b||infty=maxi\in|ci|
The set balancing problem is to find a vector
b
I(b)
An approximate solution can be found with the following very simple randomized algorithm:
Send each subject to the treatment group with probability 1/2.In matrix formulation:
Choose the elements of
b
Surprisingly, although this algorithm completely ignores the matrix
A
b
Prob\left[I(b)\geq\sqrt{4mlnn}\right]\leq
2 | |
n |
PROOF:
Let
ki
i
i
A
Easy case:
ki\leq\sqrt{4mlnn}
i
ci
\sqrt{4mlnn}
Hard case:
ki>\sqrt{4mlnn}
j
Xj=ai,jbj
Xj
i
ci=
m{X | |
\sum | |
j} |
Xj
a>0
Prob\left[|ci|\geqa\right]\leq2\exp(-a2/2m)
a=\sqrt{4mlnn}
Prob\left[|ci|\geq\sqrt{4mlnn}\right]\leq
2 | |
n2 |
By the union bound,
Prob\left[\existsi:|ci|\geq\sqrt{4mlnn}\right]\leq
2 | |
n |