In statistics, the score (or informant) is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular value of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.
Since the score is a function of the observations, which are subject to sampling error, it lends itself to a test statistic known as score test in which the parameter is held at a particular value. Further, the ratio of two likelihood functions evaluated at two distinct parameter values can be understood as a definite integral of the score function.
The score is the gradient (the vector of partial derivatives) of
logl{L}(\theta;x)
\theta
s(\theta;x)\equiv
\partiallogl{L | |
(\theta;x)}{\partial |
\theta}
(1 x m)
\theta
x
In older literature, "linear score" may refer to the score with respect to infinitesimal translation of a given density. This convention arises from a time when the primary parameter of interest was the mean or median of a distribution. In this case, the likelihood of an observation is given by a density of the form
lL(\theta;X)=f(X+\theta)
s\rm=
\partial | |
\partialX |
logf(X)
While the score is a function of
\theta
x=(x1,x2,\ldots,xT)
\theta
lL
lL(\theta;x)=f(x;\theta)
l{X}
\begin{align} \operatorname{E}(s\mid\theta) &=\intl{X
The assumed regularity conditions allow the interchange of derivative and integral (see Leibniz integral rule), hence the above expression may be rewritten as
\partial | |
\partial\theta |
\intl{X
It is worth restating the above result in words: the expected value of the score, at true parameter value
\theta
See main article: Fisher information. The variance of the score,
\operatorname{Var}(s(\theta))=\operatorname{E}(s(\theta)s(\theta)T)
\begin{align} 0 &=
\partial | |
\partial\thetaT |
\operatorname{E}(s\mid\theta)\\[6pt] &=
\partial | |
\partial\thetaT |
\intl{X
\operatorname{E}(s(\theta)s(\theta)T)=-\operatorname{E}\left(
\partial2logl{L | |
l{I}(\theta)
X
Consider observing the first n trials of a Bernoulli process, and seeing that A of them are successes and the remaining B are failures, where the probability of success is θ.
Then the likelihood
lL
lL(\theta;A,B)= | (A+B)! |
A!B! |
\thetaA(1-\theta)B,
so the score s is
s= | \partialloglL | = |
\partial\theta |
1 | |
lL |
\partiallL | |
\partial\theta |
=
A | - | |
\theta |
B | |
1-\theta |
.
We can now verify that the expectation of the score is zero. Noting that the expectation of A is nθ and the expectation of B is n(1 - θ) [recall that ''A'' and ''B'' are random variables], we can see that the expectation of s is
E(s) =
n\theta | |
\theta |
-
n(1-\theta) | |
1-\theta |
=n-n=0.
We can also check the variance of
s
\begin{align} \operatorname{var}(s)&=\operatorname{var}\left(
A | - | |
\theta |
n-A | \right) =\operatorname{var}\left(A\left( | |
1-\theta |
1 | + | |
\theta |
1 | |
1-\theta |
\right)\right)\\ &=\left(
1 | + | |
\theta |
1 | |
1-\theta |
| ||||
\right) |
. \end{align}
For models with binary outcomes (Y = 1 or 0), the model can be scored with the logarithm of predictions
S=Ylog(p)+(1-Y)(log(1-p))
where p is the probability in the model to be estimated and S is the score.[4]
See main article: Scoring algorithm. The scoring algorithm is an iterative method for numerically determining the maximum likelihood estimator.
See main article: Score test. Note that
s
\theta
x=(x1,x2,\ldots,xT)
\theta
Further note that the likelihood-ratio test is given by
-2\left[logl{L}(\theta0)-logl{L}(\hat{\theta})\right]=2
\hat{\theta | |
\int | |
\theta0 |
\theta0
\hat{\theta}
See also: Diffusion model. Score matching describes the process of applying machine learning algorithms (commonly neural networks) to approximate the score function
s\theta ≈ \nablaxlogp(x)
\pi(x)
s\theta
\pi(x)
It might seem confusing that the word score has been used for
\nablaxlogp(x)
The term "score function" may initially seem unrelated to its contemporary meaning, which centers around the derivative of the log-likelihood function in statistical models. This apparent discrepancy can be traced back to the term's historical origins. The concept of the "score function" was first introduced by British statistician Ronald Fisher in his 1935 paper titled "The Detection of Linkage with 'Dominant' Abnormalities."[9] Fisher employed the term in the context of genetic analysis, specifically for families where a parent had a dominant genetic abnormality. Over time, the application and meaning of the "score function" have evolved, diverging from its original context but retaining its foundational principles.[10] [11]
Fisher's initial use of the term was in the context of analyzing genetic attributes in families with a parent possessing a genetic abnormality. He categorized the children of such parents into four classes based on two binary traits: whether they had inherited the abnormality or not, and their zygosity status as either homozygous or heterozygous. Fisher devised a method to assign each family a "score," calculated based on the number of children falling into each of the four categories. This score was used to estimate what he referred to as the "linkage parameter," which described the probability of the genetic abnormality being inherited. Fisher evaluated the efficacy of his scoring rule by comparing it with an alternative rule and against what he termed the "ideal score." The ideal score was defined as the derivative of the logarithm of the sampling density, as mentioned on page 193 of his work.[9]
The term "score" later evolved through subsequent research, notably expanding beyond the specific application in genetics that Fisher had initially addressed. Various authors adapted Fisher's original methodology to more generalized statistical contexts. In these broader applications, the term "score" or "efficient score" started to refer more commonly to the derivative of the log-likelihood function of the statistical model in question. This conceptual expansion was significantly influenced by a 1948 paper by C. R. Rao, which introduced "efficient score tests" that employed the derivative of the log-likelihood function.[12]
Thus, what began as a specialized term in the realm of genetic statistics has evolved to become a fundamental concept in broader statistical theory, often associated with the derivative of the log-likelihood function.