In statistics, the likelihood principle is the proposition that, given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.
A likelihood function arises from a probability density function considered as a function of its distributional parameterization argument. For example, consider a model which gives the probability density function
fX(x\mid\theta)
X
\theta~
x
X~
l{L}(\theta\midx)=fX(x\mid\theta)
\theta~
\theta
X
x~
Two likelihood functions are equivalent if one is a scalar multiple of the other.The likelihood principle is this: All information from the data that is relevant to inferences about the value of the model parameters is in the equivalence class to which the likelihood function belongs. The strong likelihood principle applies this same criterion to cases such as sequential experiments where the sample of data that is available results from applying a stopping rule to the observations earlier in the experiment.[1]
Suppose
X
\theta
Y
\theta
\theta=\tfrac{ 1 }{2}
Then the observation that
X=3
\operatorname{lL}\left( \theta \mid X=3 \right)=\binom{12}{3}~\theta3 (1-\theta)9=220 \theta3 (1-\theta)9 ,
while the observation that
Y=12
\operatorname{lL}\left( \theta \mid Y=12 \right)=\binom{11}{2}~\theta3 (1-\theta)9=55 \theta3 (1-\theta)9~.
The likelihood principle says that, as the data are the same in both cases, the inferences drawn about the value of
\theta
\theta
X=3
Y=12
Specifically, in one case, the decision in advance was to try twelve times, regardless of the outcome; in the other case, the advance decision was to keep trying until three successes were observed. If you support the likelihood principle then inference about
\theta
This equivalence is not always the case, however. The use of frequentist methods involving leads to different inferences for the two cases above,[2] showing that the outcome of frequentist methods depends on the experimental procedure, and thus violates the likelihood principle.
The law of likelihoodΛ={lL(a\midX=x)\overlL(b\midX=x)}={P(X=x\mida)\overP(X=x\midb)}
In Bayesian statistics, this ratio is known as the Bayes factor, and Bayes' rule can be seen as the application of the law of likelihood to inference.
In frequentist inference, the likelihood ratio is used in the likelihood-ratio test, but other non-likelihood tests are used as well. The Neyman–Pearson lemma states the likelihood-ratio test is equally statistically powerful as the most powerful test for comparing two simple hypotheses at a given significance level, which gives a frequentist justification for the law of likelihood.
Combining the likelihood principle with the law of likelihood yields the consequence that the parameter value which maximizes the likelihood function is the value which is most strongly supported by the evidence. This is the basis for the widely used method of maximum likelihood.
The likelihood principle was first identified by that name in print in 1962 (Barnard et al., Birnbaum, and Savage et al.), but arguments for the same principle, unnamed, and the use of the principle in applications goes back to the works of R.A. Fisher in the 1920s. The law of likelihood was identified by that name by I. Hacking (1965). More recently the likelihood principle as a general principle of inference has been championed by A.W.F. Edwards. The likelihood principle has been applied to the philosophy of science by R. Royall.[3]
Birnbaum (1962) initially argued that the likelihood principle follows from two more primitive and seemingly reasonable principles, the conditionality principle and the sufficiency principle:
\theta ,
\theta~.
T(X)
\theta ,
x1
x2
T(x1)=T(x2) ,
\theta
However, upon further consideration Birnbaum rejected both his conditionality principle and the likelihood principle.[4] The adequacy of Birnbaum's original argument has also been contested by others (see below for details).
Some widely used methods of conventional statistics, for example many significance tests, are not consistent with the likelihood principle.
Let us briefly consider some of the arguments for and against the likelihood principle.
According to Giere (1977),[5] Birnbaum rejected both his own conditionality principle and the likelihood principle because they were both incompatible with what he called the “confidence concept of statistical evidence”, which Birnbaum (1970) describes as taking “from the Neyman-Pearson approach techniques for systematically appraising and bounding the probabilities (under respective hypotheses) of seriously misleading interpretations of data” (p. 1033). The confidence concept incorporates only limited aspects of the likelihood concept and only some applications of the conditionality concept. Birnbaum later notes that it was the unqualified equivalence formulation of his 1962 version of the conditionality principle that led “to the monster of the likelihood axiom” ([6] p. 263).
Birnbaum's original argument for the likelihood principle has also been disputed by other statisticians including Akaike,[7] Evans[8] and philosophers of science, including Deborah Mayo.[9] [10] Dawid points out fundamental differences between Mayo's and Birnbaum's definitions of the conditionality principle, arguing Birnbaum's argument cannot be so readily dismissed.[11] A new proof of the likelihood principle has been provided by Gandenberger that addresses some of the counterarguments to the original proof.[12]
Unrealized events play a role in some common statistical methods. For example, the result of a significance test depends on the -value, the probability of a result as extreme or more extreme than the observation, and that probability may depend on the design of the experiment. To the extent that the likelihood principle is accepted, such methods are therefore denied.
Some classical significance tests are not based on the likelihood. The following are a simple and more complicated example of those, using a commonly cited example called the optional stopping problem.
Suppose now I tell that I tossed the coin until I observed 3 heads, and I tossed it 12 times. Will you now make some different inference?
The likelihood function is the same in both cases: It is proportional to
p3(1-p)9~
So according to the likelihood principle, in either case the inference should be the same.
Bill, a colleague in the same lab, continued Adam's work and published Adam's results, along with a significance test. He tested the null hypothesis that, the success probability, is equal to a half, versus . If we ignore the information that the third success was the 12th and last observation, the probability of the observed result that out of 12 trials 3 or something fewer (i.e. more extreme) were successes, if is true, is
\left[{12\choose3}+{12\choose2}+{12\choose1}+{12\choose0}\right]\left({1\over2}\right)12~
which is . Thus the null hypothesis is not rejected at the 5% significance level if we ignore the knowledge that the third success was the 12th result.
However observe that this first calculation also includes 12 token long sequences that end in tails contrary to the problem statement!
If we redo this calculation we realize the likelihood according to the null hypothesis must be the probability of a fair coin landing 2 or fewer heads on 11 trials multiplied with the probability of the fair coin landing a head for the 12th trial:
\left[{11\choose2}+{11\choose1}+{11\choose0}\right]\left({1\over2}\right)11{1\over2}~
which is . Now the result is statistically significant at the level.
Charlotte, another scientist, reads Bill's paper and writes a letter, saying that it is possible that Adam kept trying until he obtained 3 successes, in which case the probability of needing to conduct 12 or more experiments is given by
\left[{11\choose2}+{11\choose1}+{11\choose0}\right]\left({1\over2}\right)11{1\over2}~
which is . Now the result is statistically significant at the level. Note that there is no contradiction between the latter two correct analyses; both computations are correct, and result in the same p-value.
To these scientists, whether a result is significant or not does not depend on the design of the experiment, but does on the likelihood (in the sense of the likelihood function) of the parameter value being .
Similar themes appear when comparing Fisher's exact test with Pearson's chi-squared test.
An argument in favor of the likelihood principle is given by Edwards in his book Likelihood. He cites the following story from J.W. Pratt, slightly condensed here. Note that the likelihood function depends only on what actually happened, and not on what could have happened.
An engineer draws a random sample of electron tubes and measures their voltages. The measurements range from 75 to 99 Volts. A statistician computes the sample mean and a confidence interval for the true mean. Later the statistician discovers that the voltmeter reads only as far as 100 Volts, so technically, the population appears to be “censored”. If the statistician is orthodox this necessitates a new analysis.
However, the engineer says he has another meter reading to 1000 Volts, which he would have used if any voltage had been over 100. This is a relief to the statistician, because it means the population was effectively uncensored after all. But later, the statistician discovers that the second meter had not been working when the measurements were taken. The engineer informs the statistician that he would not have held up the original measurements until the second meter was fixed, and the statistician informs him that new measurements are required. The engineer is astounded. “Next you'll be asking about my oscilloscope!”