In population genetics, Ewens's sampling formula describes the probabilities associated with counts of how many different alleles are observed a given number of times in the sample.
Ewens's sampling formula, introduced by Warren Ewens, states that under certain conditions (specified below), if a random sample of n gametes is taken from a population and classified according to the gene at a particular locus then the probability that there are a1 alleles represented once in the sample, and a2 alleles represented twice, and so on, is
\operatorname{Pr}(a1,...,an;\theta)={n!\over
n{\theta | |
\theta(\theta+1) … (\theta+n-1)}\prod | |
j=1 |
aj | |
\over
aj | |
j |
aj!},
for some positive number θ representing the population mutation rate, whenever
a1,\ldots,an
a1+2a2+3a3+ … +nan=\sum
n | |
i=1 |
iai=n.
The phrase "under certain conditions" used above is made precise by the following assumptions:
See also: Infinite-alleles model.
This is a probability distribution on the set of all partitions of the integer n. Among probabilists and statisticians it is often called the multivariate Ewens distribution.
When θ = 0, the probability is 1 that all n genes are the same. When θ = 1, then the distribution is precisely that of the integer partition induced by a uniformly distributed random permutation. As θ → ∞, the probability that no two of the n genes are the same approaches 1.
This family of probability distributions enjoys the property that if after the sample of n is taken, m of the n gametes are chosen without replacement, then the resulting probability distribution on the set of all partitions of the smaller integer m is just what the formula above would give if m were put in place of n.
The Ewens distribution arises naturally from the Chinese restaurant process.