The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.
The Brier score is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes or classes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). It was proposed by Glenn W. Brier in 1950.[1]
The Brier score can be thought of as a cost function. More precisely, across all items
i\in{1...N}
oi
Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the square of the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1). In the original (1950) formulation of the Brier score, the range is double, from zero to two.
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but it is inappropriate for ordinal variables which can take on three or more values.
The most common formulation of the Brier score is
BS=
1 | |
N |
\sum\limits
N | |
t=1 |
(ft-o
2 | |
t) |
in which
ft
ot
t
0
1
N
Suppose that one is forecasting the probability
P
P
P
P
P
P
Although the above formulation is the most widely used, the original definition by Brier is applicable to multi-category forecasts as well as it remains a proper scoring rule, while the binary form (as used in the examples above) is only proper for binary events. For binary forecasts, the original formulation of Brier's "probability score" has twice the value of the score currently known as the Brier score.
BS=
1 | |
N |
\sum\limits
N | |
t=1 |
\sum\limits
R | |
i=1 |
(fti-oti)2
In which
R
N
fti
i.oti
1
i
t
0
R=2
R=3
There are several decompositions of the Brier score which provide a deeper insight on the behavior of a binary classifier.
The Brier score can be decomposed into 3 additive components: Uncertainty, Reliability, and Resolution. (Murphy 1973)[2]
BS=REL-RES+UNC
Each of these components can be decomposed further according to the number of possible classes in which the event can fall. Abusing the equality sign:
BS= | 1 |
N |
\sum\limits
K | |
k=1 |
{nk
(fk-\bar{o |
With
styleN
styleK
\bar{o
nk
\overline{o
fk |
{f
{o
The reliability term measures how close the forecast probabilities are to the true probabilities, given that forecast. Reliability is defined in the contrary direction compared to English language. If the reliability is 0, the forecast is perfectly reliable. For example, if we group all forecast instances where 80% chance of rain was forecast, we get a perfect reliability only if it rained 4 out of 5 times after such a forecast was issued.
The resolution term measures how much the conditional probabilities given by the different forecasts differ from the climatic average. The higher this term is, the better. In the worst case, when the climatic probability is always forecast, the resolution is zero. In the best case, when the conditional probabilities are zero and one, the resolution is equal to the uncertainty.
The uncertainty term measures the inherent uncertainty in the outcomes of the event. For binary events, it is at a maximum when each outcome occurs 50% of the time, and is minimal (zero) if an outcome always occurs or never occurs.
An alternative (and related) decomposition generates two terms instead of three.
BS=CAL+REF
BS= | 1 |
N |
\sum\limits
K | |
k=1 |
{nk
(fk-\bar{o |
The first term is known as calibration (and can be used as a measure of calibration, see statistical calibration), and is equal to reliability. The second term is known as refinement, and it is an aggregation of resolution and uncertainty, and is related to the area under the ROC Curve.
The Brier Score, and the CAL + REF decomposition, can be represented graphically through the so-called Brier Curves,[3] where the expected loss is shown for each operating condition. This makes the Brier Score a measure of aggregated performance under a uniform distribution of class asymmetries.[4]
A skill score for a given underlying score is an offset and (negatively-) scaled variant of the underlying score such that a skill score value of zero means that the score for the predictions is merely as good as that of a set of baseline or reference or default predictions, while a skill score value of one (100%) represents the best possible score. A skill score value less than zero means that the performance is even worse than that of the baseline or reference predictions. When the underlying score is the Brier score (BS), the Brier skill score (BSS) is calculated as
BSS=1-
BS | |
BSref |
where
BSref
BSref=
1 | |
N |
\sum\limits
N | |
t=1 |
2 | |
(\bar{o}-o | |
t) |
where
\bar{o}
\bar{o}=
1 | |
N |
\sum\limits
N | |
t=1 |
ot.
With a Brier score, lower is better (it is a loss function) with 0 being the best possible score. But with a Brier skill score, higher is better with 1 (100%) being the best possible score.
The Brier skill score can be more interpretable than the Brier score because the BSS is simply the percentage improvement in the BS compared to the reference model, and a negative BSS means you are doing even worse than the reference model, which may not be obvious from looking at the Brier score itself. However, a BSS near 100% should not typically be expected because this would require that every probability prediction was nearly 0 or 1 (and was correct of course).
Even if the Brier score is a strictly proper scoring rule, the BSS is not strictly proper: indeed, skill scores are generally non-proper even if the underlying scoring rule is proper.[7] Still, Murphy (1973)[8] proved that the BSS is asymptotically proper with a large number of samples.
You might notice that classification's (probability estimation's) BSS is to its BS, as regression's coefficient of determination (
R2
The Brier score becomes inadequate for very rare (or very frequent) events, because it does not sufficiently discriminate between small changes in forecast that are significant for rare events.[9] Wilks (2010) has found that "[Q]uite largesample sizes, i.e. n > 1000, are required for higher-skill forecasts of relatively rare events, whereas only quite modest sample sizes are needed for low-skill forecasts of common events."[10]