Cramér–von Mises criterion explained

F^*

compared to a given empirical distribution function

F_n

, or for comparing two empirical distributions. It is also used as a part of other algorithms, such as minimum distance estimation. It is defined as

\omega²=

	infty
\int
	-infty

[F_n(x)-F^*(x)]^2dF^*(x)

In one-sample applications

F^*

is the theoretical distribution and

F_n

is the empirically observed distribution. Alternatively the two distributions can both be empirically estimated ones; this is called the two-sample case.

The criterion is named after Harald Cramér and Richard Edler von Mises who first proposed it in 1928–1930.^[1] ^[2] The generalization to two samples is due to Anderson.^[3]

The Cramér–von Mises test is an alternative to the Kolmogorov–Smirnov test (1933).^[4]

Cramér–von Mises test (one sample)

Let

x_1,x_2,\ldots,x_n

be the observed values, in increasing order. Then the statistic is^[3] ^[5]

T=n\omega²=

	1
	12n

	n
\sum
	i=1

\left[

	2i-1
	2n

-F(x_i)\right]^2.

If this value is larger than the tabulated value, then the hypothesis that the data came from the distribution

can be rejected.

Watson test

A modified version of the Cramér–von Mises test is the Watson test^[6] which uses the statistic U², where^[5]

U²⁼T-n(\bar{F}-\tfrac{1}{2})^2,

where

\bar{F}=	1
	n

	n
\sum
	i=1

F(x_i).

Cramér–von Mises test (two samples)

Let

x_1,x_2,\ldots,x_N

and

y_1,y_2,\ldots,y_M

be the observed values in the first and second sample respectively, in increasing order. Let

r_1,r_2,\ldots,r_N

be the ranks of the xs in the combined sample, and let

s_1,s_2,\ldots,s_M

be the ranks of the ys in the combined sample. Anderson^[3] shows that

	NM
	N+M

\omega²=

	U
	NM(N+M)

	4MN-1
	6(M+N)

where U is defined as

U=N

	N
\sum
	i=1

	2
(r
	i-i)

	M
\sum
	j=1

	2
(s
	j-j)

If the value of T is larger than the tabulated values,^[3] the hypothesis that the two samples come from the same distribution can be rejected. (Some books give critical values for U, which is more convenient, as it avoids the need to compute T via the expression above. The conclusion will be the same.)

The above assumes there are no duplicates in the

, and

sequences. So

x_i

is unique, and its rank is

in the sorted list

x_1,\ldots,x_N

. If there are duplicates, and

x_i

through

x_j

are a run of identical values in the sorted list, then one common approach is the midrank^[7] method: assign each duplicate a "rank" of

(i+j)/2

. In the above equations, in the expressions

	2
(r
	i-i)

and

	2
(s
	j-j)

, duplicates can modify all four variables

r_i

s_j