Lindley's paradox explained

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook;^[1] it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.^[2]

Although referred to as a paradox, the differing results from the Bayesian and frequentist approaches can be explained as using them to answer fundamentally different questions, rather than actual disagreement between the two methods.

Nevertheless, for a large class of priors the differences between the frequentist and Bayesian approach are caused by keeping the significance level fixed: as even Lindley recognized, "the theory does not justify the practice of keeping the significance level fixed" and even "some computations by Prof. Pearson in the discussion to that paper emphasized how the significance level would have to change with the sample size, if the losses and prior probabilities were kept fixed". In fact, if the critical value increases with the sample size suitably fast, then the disagreement between the frequentist and Bayesian approaches becomes negligible as the sample size increases.^[3]

The paradox continues to be a source of active discussion.^[4] ^[5] ^[6]

Description of the paradox

The result

of some experiment has two possible explanations hypotheses

H₀

and

H₁

and some prior distribution

\pi

representing uncertainty as to which hypothesis is more accurate before taking into account

Lindley's paradox occurs when

The result

is "significant" by a frequentist test of

H_0,

indicating sufficient evidence to reject

H_0,

say, at the 5% level, and

The posterior probability of

H₀

given

is high, indicating strong evidence that

H₀

is in better agreement with

than

H_1.

These results can occur at the same time when

H₀

is very specific,

H₁

more diffuse, and the prior distribution does not strongly favor one or the other, as seen below.

Numerical example

The following numerical example illustrates Lindley's paradox. In a certain city 49,581 boys and 48,870 girls have been born over a certain time period. The observed proportion

of male births is thus / ≈ 0.5036. We assume the fraction of male births is a binomial variable with parameter

\theta.

We are interested in testing whether

\theta

is 0.5 or some other value. That is, our null hypothesis is

H_0:\theta=0.5,

and the alternative is

H_1:\theta ≠ 0.5.

Frequentist approach

The frequentist approach to testing

H₀

is to compute a p-value, the probability of observing a fraction of boys at least as large as

assuming

H₀

is true. Because the number of births is very large, we can use a normal approximation for the fraction of male births

X\simN(\mu,\sigma^2),

with

\mu=np=n\theta=98451 x 0.5=49225.5

and

\sigma²=n\theta(1-\theta)=98451 x 0.5 x 0.5=24612.75,

to compute

\begin{align} P(X\geqx\mid\mu=49225.5)=

	98451
\int
	x=49581

	1
	\sqrt{2\pi\sigma²

} e^ \,du \\ = \int_^ \frac e^ \,du \approx 0.0117.\end

We would have been equally surprised if we had seen female births, i.e.

x ≈ 0.4964,

so a frequentist would usually perform a two-sided test, for which the p-value would be

p ≈ 2 x 0.0117=0.0235.

In both cases, the p-value is lower than the significance level α = 5%, so the frequentist approach rejects

H_0,

as it disagrees with the observed data.

Bayesian approach

Assuming no reason to favor one hypothesis over the other, the Bayesian approach would be to assign prior probabilities

\pi(H₀₎=\pi(H₁₎=0.5

and a uniform distribution to

\theta

under

H_1,

and then to compute the posterior probability of

H₀

using Bayes' theorem:

P(H₀\midk)=

	P(k\midH₀₎\pi(H₀₎
	P(k\midH₀₎\pi(H₀₎+P(k\midH₁₎\pi(H₁₎

After observing

k=49581

boys out of

n=98451

births, we can compute the posterior probability of each hypothesis using the probability mass function for a binomial variable:

\begin{align} P(k\midH₀₎&={n\choosek}(0.5)^k(1-0.5)^n-k ≈ 1.95 x 10^-4,\\ P(k\midH₁₎&=

	1
\int
	0

{n\choosek}\theta^k(1-\theta)^n-kd\theta={n\choosek}\operatorname{\Beta}(k+1,n-k+1)=1/(n+1) ≈ 1.02 x 10^-5, \end{align}

where

\operatorname{\Beta}(a,b)

is the Beta function.

From these values, we find the posterior probability of

P(H₀\midk) ≈ 0.95,

which strongly favors

H₀

over

H₁

The two approaches—the Bayesian and the frequentist—appear to be in conflict, and this is the "paradox".

Reconciling the Bayesian and frequentist approaches

Naaman proposed an adaption of the significance level to the sample size in order to control false positives:, such that with .At least in the numerical example, taking, results in a significance level of 0.00318, so the frequentist would not reject the null hypothesis, which is in agreement with the Bayesian approach.

Uninformative priors

If we use an uninformative prior and test a hypothesis more similar to that in the frequentist approach, the paradox disappears.

For example, if we calculate the posterior distribution

P(\theta\midx,n)

, using a uniform prior distribution on

\theta

(i.e.

\pi(\theta\in[0,1])=1

), we find

P(\theta\midk,n)=\operatorname{\Beta}(k+1,n-k+1).

If we use this to check the probability that a newborn is more likely to be a boy than a girl, i.e.

P(\theta>0.5\midk,n),

we find

	1
\int
	0.5

\operatorname{\Beta}(49582,48871) ≈ 0.983.

In other words, it is very likely that the proportion of male births is above 0.5.

Neither analysis gives an estimate of the effect size, directly, but both could be used to determine, for instance, if the fraction of boy births is likely to be above some particular threshold.

The lack of an actual paradox

The apparent disagreement between the two approaches is caused by a combination of factors. First, the frequentist approach above tests

H₀

without reference to

H₁

. The Bayesian approach evaluates

H₀

as an alternative to

H₁

and finds the first to be in better agreement with the observations. This is because the latter hypothesis is much more diffuse, as

\theta

can be anywhere in

[0,1]

, which results in it having a very low posterior probability. To understand why, it is helpful to consider the two hypotheses as generators of the observations:

Under

H₀

, we choose

\theta ≈ 0.500

and ask how likely it is to see boys in births.

Under

H₁

, we choose

\theta

randomly from anywhere within 0 to 1 and ask the same question.Most of the possible values for

\theta

under

H₁

are very poorly supported by the observations. In essence, the apparent disagreement between the methods is not a disagreement at all, but rather two different statements about how the hypotheses relate to the data:

The frequentist finds that

H₀

is a poor explanation for the observation.

The Bayesian finds that

H₀

is a far better explanation for the observation than

H_1.

The ratio of the sex of newborns is improbably 50/50 male/female, according to the frequentist test. Yet 50/50 is a better approximation than most, but not all, other ratios. The hypothesis

\theta ≈ 0.504

would have fit the observation much better than almost all other ratios, including

\theta ≈ 0.500.

For example, this choice of hypotheses and prior probabilities implies the statement "if

\theta

> 0.49 and

\theta

< 0.51, then the prior probability of

\theta

being exactly 0.5 is 0.50/0.51 ≈ 98%". Given such a strong preference for

\theta=0.5,

it is easy to see why the Bayesian approach favors

H₀

in the face of

x ≈ 0.5036,

even though the observed value of

lies

2.28\sigma

away from 0.5. The deviation of over 2σ from

H₀

is considered significant in the frequentist approach, but its significance is overruled by the prior in the Bayesian approach.

Looking at it another way, we can see that the prior distribution is essentially flat with a delta function at

\theta=0.5.

Clearly, this is dubious. In fact, picturing real numbers as being continuous, it would be more logical to assume that it would be impossible for any given number to be exactly the parameter value, i.e., we should assume

P(\theta=0.5)=0.

A more realistic distribution for

\theta

in the alternative hypothesis produces a less surprising result for the posterior of

H_0.

For example, if we replace

H₁

with

H_2:\theta=x,

i.e., the maximum likelihood estimate for

\theta,

the posterior probability of

H₀

would be only 0.07 compared to 0.93 for

H₂

(of course, one cannot actually use the MLE as part of a prior distribution).

Notes and References

Book: Jeffreys, Harold . Harold Jeffreys
. Harold Jeffreys. Theory of Probability. Oxford University Press. 1939. 924.
Lindley . D. V. . Dennis Lindley. A statistical paradox . Biometrika. 44 . 1–2 . 187–192 . 1957. 10.1093/biomet/44.1-2.187. 2333251.
Naaman . Michael . 2016-01-01 . Almost sure hypothesis testing and a resolution of the Jeffreys–Lindley paradox . . en . 10 . 1 . 1526–1550 . 10.1214/16-EJS1146 . 1935-7524 . free.
Spanos . Aris . 2013 . Who should be afraid of the Jeffreys-Lindley paradox? . . 80 . 1 . 73–93 . 10.1086/668875 . 85558267.
Sprenger . Jan . Jan Michael Sprenger . 2013 . Testing a precise null hypothesis: The case of Lindley's paradox . . 80 . 5 . 733–744 . 10.1086/673730 . 27444939 . free . 2318/1657960.
Robert . Christian P. . 2014 . On the Jeffreys-Lindley paradox . . 81 . 2 . 216–232 . 1303.5973 . 10.1086/675729 . 120002033.

Lindley's paradox explained

Description of the paradox

Numerical example

Frequentist approach

Bayesian approach

Reconciling the Bayesian and frequentist approaches

Uninformative priors

The lack of an actual paradox

See also

Further reading

Notes and References