The iterative proportional fitting procedure (IPF or IPFP, also known as biproportional fitting or biproportion in statistics or economics (input-output analysis, etc.), RAS algorithm[1] in economics, raking in survey statistics, and matrix scaling in computer science) is the operation of finding the fitted matrix
X
Z
Y
Y
X=PZQ
P
Q
X
Y
IPF has been "re-invented" many times, the earliest by Kruithof in 1937 [6] in relation to telephone traffic ("Kruithof’s double factor method"), Deming and Stephan in 1940[7] for adjusting census crosstabulations, and G.V. Sheleikhovskii for traffic as reported by Bregman.[8] (Deming and Stephan proposed IPFP as an algorithm leading to a minimizer of the Pearson X-squared statistic, which Stephan later reported it does not).[9] Early proofs of uniqueness and convergence came from Sinkhorn (1964),[10] Bacharach (1965),[11] Bishop (1967),[12] and Fienberg (1970).[13] Bishop's proof that IPFP finds the maximum likelihood estimator for any number of dimensions extended a 1959 proof by Brown for 2x2x2... cases. Fienberg's proof by differential geometry exploits the method's constant crossproduct ratios, for strictly positive tables. Csiszár (1975).[14] found necessary and sufficient conditions for general tables having zero entries. Pukelsheim and Simeone (2009)[15] give further results on convergence and error behavior.
An exhaustive treatment of the algorithm and its mathematical foundations can be found in the book of Bishop et al. (1975).[16] Idel (2016)[17] gives a more recent survey.
Other general algorithms can be modified to yield the same limit as the IPFP, for instance the Newton–Raphson method andthe EM algorithm. In most cases, IPFP is preferred due to its computational speed, low storage requirements, numerical stability and algebraic simplicity.
Applications of IPFP have grown to include trip distribution models, Fratar or Furness and other applications in transportation planning (Lamond and Stewart), survey weighting, synthesis of cross-classified demographic data, adjusting input–output models in economics, estimating expected quasi-independent contingency tables, biproportional apportionment systems of political representation, and for a preconditioner in linear algebra.[18]
Biproportion, whatever the algorithm used to solve it, is the following concept:
Z
Y
X
n,m
Y
X
X
Y
Xs=Ys
s'X=s'Y
s
X
Z
X=K(Z,Y)=PZQ
P
Q
min\sumi\sumjxijlog(xij/zij)
\sumjxij=yi.
i
\sumixij=y.j
j
L=\sumi\sumjxijlog(xij/zij)-\sumipi(yi.-\sumjxij)-\sumjqj(y.j-\sumixij)
Thus
xij=zij\exp-(1+pi+qj)
i,j
which, after posing
Pi=\exp-(1+pi)
Qj=\exp-qj
xij=PizijQj
i,j
X=PZQ
with
Pi=zi.(\sumjzij
-1 | |
Q | |
j) |
i
Qj=z.j(\sumizij
-1 | |
P | |
i) |
j
Pi
Qj
Pi=z
(t+1) | |
i. |
(\sumjzij
(t) | |
Q | |
j |
)-1
i
(t+1) | |
Q | |
j |
=z.j(\sumizij
(t+1) | |
P | |
i |
)-1
j
The solution
X
(0) | |
q | |
j |
=1
j
(0) | |
p | |
i |
=1
i
Z
Some properties (see de Mesnard (1994)):
Lack of information: if
Z
zij=z
i,j
X=PQ
Idempotency:
X=K(Z,Y)=Z
Y
Z
Composition of biproportions:
K(K(Z,Y1),Y2=K(Z,Y2)
K(...K(Z,Y1),Y2)...ZN)=K(Z,YN)
Zeros: a zero in
Z
X
Theorem of separable modifications: if
Z
Theorem of "unicity": If
Kq
\hat{X}=Kq(Z,Y)=UZV
U
V
U
V
P
Q
Given a two-way (I × J)-table
xij
\hat{m}ij=aibjxij
\sumj\hat{m}ij =ui,
\sumi\hat{m}ij =vj
Choose initial values
(0) | |
\hat{m} | |
ij |
:=xij
η\geq1
(2η-1) | |
\hat{m} | |
ij |
=
\hat{m | |
ij |
(2η-2)ui
(2η) | |
\hat{m} | |
ij |
=
\hat{m | |
ij |
(2η-1)vj
Repeat these steps until row and column totals are sufficiently close to u and v.
Notes:
diag:Rk\longrightarrowRk x
Rη=diag(
ui | ||||||||||||
|
)
M2η=RηM2η-2
Sη=diag(
vi | ||||||||||||
|
)
M2η=M2η-1Sη
Assume the same setting as in the classical IPFP.Alternatively, we can estimate the row and column factors separately: Choose initial values
(0) | |
\hat{b} | |
j |
:=1
η\geq1
(η) | |
\hat{a} | |
i |
=
ui | |
\sumj xij\hat{b |
(η-1) | |
j |
(η) | |
\hat{b} | |
j |
=
vj | |
\sumi xij\hat{a |
(η) | |
i |
Repeat these steps until successive changes of a and b are sufficiently negligible (indicating the resulting row- and column-sums are close to u and v).
Finally, the result matrix is
\hat{m}ij=
(η) | |
\hat{a} | |
i |
(η) | |
\hat{b} | |
j |
xij
Notes:
(η) | |
\hat{m} | |
ij |
mij=aibjxij=(\gamma
a | ||||
|
bj)xij
\gamma>0
The vaguely demanded 'similarity' between M and X can be explained as follows: IPFP (and thus RAS)maintains the crossproduct ratios, i.e.
| ||||||||||||||||
|
=
xijxhk | |
xikxhj |
\forall η\geq0andi ≠ h, j ≠ k
since
(η) | |
m | |
ij |
=
(η) | |
a | |
i |
(η) | |
b | |
j |
xij.
This property is sometimes called structure conservation and directly leads to the geometrical interpretation of contingency tables and the proof of convergence in the seminal paper of Fienberg (1970).
Direct factor estimation (algorithm 2) is generally the more efficient way to solve IPF: Whereas a form of the classical IPFP needs
IJ(2+J)+IJ(2+I)=I2J+IJ2+4IJ
elementary operations in each iteration step (including a row and a column fitting step), factor estimation needs only
I(1+J)+J(1+I)=2IJ+I+J
operations being at least one order in magnitude faster than classical IPFP.
IPFP can be used to estimate expected quasi-independent (incomplete) contingency tables, with
ui=xi+,vj=x+j
0 | |
m | |
ij |
=1
0 | |
m | |
ij |
=0
Similar to the IPF, the NM-method is also an operation of finding a matrix
X
Z
Z\inNn
Y
(Y\inNn)
However, there are differences between the NM-method and the IPF. For instance, the NM-method defines closeness of matrices of the same size differently from the IPF.[19] Also, the NM-method was developed to solve for matrix
X
\boldsymbol{Z}
Y
\boldsymbol{Z}
Macdonald (2023)[20] is at ease with the conclusion by Naszodi (2023)[21] that the IPF is suitable for sampling correction tasks, but not for generation of counterfactuals. Similarly to Naszodi, Macdonald also questions whether the row and column proportional transformations of the IPF preserve the structure of association within a contingency table that allows us to study social mobility.
Necessary and sufficient conditions for the existence and uniqueness of MLEs are complicated in the general case (see[22]), but sufficient conditions for 2-dimensional tables are simple:
xi+>0, x+j>0
If unique MLEs exist, IPFP exhibits linear convergence in the worst case (Fienberg 1970), but exponential convergence has also been observed (Pukelsheim and Simeone 2009). If a direct estimator (i.e. a closed form of
(\hat{m}ij)
If all observed values are strictly positive, existence and uniqueness of MLEs and therefore convergence is ensured.
Consider the following table, given with the row- and column-sums and targets.
1 | 2 | 3 | 4 | TOTAL | TARGET | |
1 | 40 | 30 | 20 | 10 | 100 | 150 |
2 | 35 | 50 | 100 | 75 | 260 | 300 |
3 | 30 | 80 | 70 | 120 | 300 | 400 |
4 | 20 | 30 | 40 | 50 | 140 | 150 |
TOTAL | 125 | 190 | 230 | 255 | 800 | |
TARGET | 200 | 300 | 400 | 100 | 1000 | |
For executing the classical IPFP, we first adjust the rows:
1 | 2 | 3 | 4 | TOTAL | TARGET | |
1 | 60.00 | 45.00 | 30.00 | 15.00 | 150.00 | 150 |
2 | 40.38 | 57.69 | 115.38 | 86.54 | 300.00 | 300 |
3 | 40.00 | 106.67 | 93.33 | 160.00 | 400.00 | 400 |
4 | 21.43 | 32.14 | 42.86 | 53.57 | 150.00 | 150 |
TOTAL | 161.81 | 241.50 | 281.58 | 315.11 | 1000.00 | |
TARGET | 200 | 300 | 400 | 100 | 1000 | |
The first step exactly matched row sums, but not the column sums. Next we adjust the columns:
1 | 2 | 3 | 4 | TOTAL | TARGET | |
1 | 74.16 | 55.90 | 42.62 | 4.76 | 177.44 | 150 |
2 | 49.92 | 71.67 | 163.91 | 27.46 | 312.96 | 300 |
3 | 49.44 | 132.50 | 132.59 | 50.78 | 365.31 | 400 |
4 | 26.49 | 39.93 | 60.88 | 17.00 | 144.30 | 150 |
TOTAL | 200.00 | 300.00 | 400.00 | 100.00 | 1000.00 | |
TARGET | 200 | 300 | 400 | 100 | 1000 | |
Now the column sums exactly match their targets, but the row sums no longer match theirs. After completing three cycles, each with a row adjustment and a column adjustment, we get a closer approximation:
1 | 2 | 3 | 4 | TOTAL | TARGET | |
1 | 64.61 | 46.28 | 35.42 | 3.83 | 150.13 | 150 |
2 | 49.95 | 68.15 | 156.49 | 25.37 | 299.96 | 300 |
3 | 56.70 | 144.40 | 145.06 | 53.76 | 399.92 | 400 |
4 | 28.74 | 41.18 | 63.03 | 17.03 | 149.99 | 150 |
TOTAL | 200.00 | 300.00 | 400.00 | 100.00 | 1000.00 | |
TARGET | 200 | 300 | 400 | 100 | 1000 | |
The R package mipfp (currently in version 3.2) provides a multi-dimensional implementation of the traditional iterative proportional fitting procedure.[23] The package allows the updating of a N-dimensional array with respect to given target marginal distributions (which, in turn can be multi-dimensional).
Python has an equivalent package, ipfn[24] [25] that can be installed via pip. The package supports numpy and pandas input objects.