BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" - this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account.
BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.
A basic, first attempt at defining the BLEU score would take two arguments: a candidate string
\haty
(y(1),...,y(N))
BLEU(\haty;y(1),...,y(N))
\haty
y(1),...,y(N)
As an analogy, the BLEU score is like a language teacher trying to score the quality of a student translation
\haty
y(1),...,y(N)
Since in natural language processing, one should evaluate a large set of candidate strings, one must generalize the BLEU score to the case where one has a list of M candidate strings (called a "corpus")
(\haty(1), … ,\haty(M))
\haty(i)
Si:=(y(i,,...,
(i,Ni) | |
y |
)
Given any string
y=y1y2 … yK
n\geq1
G2(abab)=\{ab,ba\}
Given any two strings
s,y
C(s,y)
s
y
C(ab,abcbab)=2
Now, fix a candidate corpus
\hatS:=(\haty(1), … ,\haty(M))
S=(S1, … ,SM)
Si:=(y(i,,...,
(i,Ni) | |
y |
)
Define the modified n-gram precision function to be The modified n-gram, which looks complicated, is merely a straightforward generalization of the prototypical case: one candidate sentence and one reference sentence. In this case, it is To work up to this expression, we start with the most obvious n-gram count summation:This quantity measures how many n-grams in the reference sentence are reproduced by the candidate sentence. Note that we count the n-substrings, not n-grams. For example, when
\haty=aba,y=abababa,n=2
\haty
y
In the above situation, however, the candidate string is too short. Instead of 3 appearances of
ab
[0,1]
The modified n-gram precision unduly gives a high score for candidate strings that are "telegraphic", that is, containing all the n-grams of the reference strings, but for as few times as possible.
In order to punish candidate strings that are too short, define the brevity penalty to be where
(r/c-1)+=max(0,r/c-1)
r/c-1
r\leqc
BP=1
r>c
BP=e1-r/c
c
|y|
y
r
y(i,=
\argmin | |
y\inSi |
||y|-|\haty(i)||
Si
|\haty(i)|
There is not a single definition of BLEU, but a whole family of them, parametrized by the weighting vector
w:=(w1,w2, … )
\{1,2,3, … \}
infty | |
\sum | |
i=1 |
wi=1
\foralli\in\{1,2,3, … \},wi\in[0,1]
With a choice of
w
The most typical choice, the one recommended in the original paper, is
w1= … =w4=
1 | |
4 |
This is illustrated in the following example from Papineni et al. (2002):
Candidate | the | the | the | the | the | the | the |
---|---|---|---|---|---|---|---|
Reference 1 | the | cat | is | on | the | mat | |
Reference 2 | there | is | a | cat | on | the | mat |
Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of,
P=
m | |
wt |
=
7 | |
7 |
=1
where
~m
~wt
The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count,
~mmax
~mmax=2
For the candidate translation, the count
mw
mmax
~mw=7
~mmax=2
~mw
~mw
P=
2 | |
7 |
In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using n-grams. The length which has the "highest correlation with monolingual human judgements" was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information is retained. The longer -gram scores account for the fluency of the translation, or to what extent it reads like "good English".
Model | Set of grams | Score | |||||||
---|---|---|---|---|---|---|---|---|---|
Unigram | "the", "the", "cat" |
=1 | |||||||
Grouped Unigram | "the"*2, "cat"*1 |
=
| |||||||
Bigram | "the the", "the cat" |
=
|
An example of a candidate translation for the same references as above might be:
the cat
In this example, the modified unigram precision would be,
P=
1 | |
2 |
+
1 | |
2 |
=
2 | |
2 |
as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be
1/1
3/6
2/7
To produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Let be the total length of the reference corpus, and the total length of the translation corpus. If
c\leqr
e(1-r/c)
iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations. It also allows comparing two different systems in a visual and interactive manner which is useful for system development.
BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. Designed to be used for several reference translation, in practice it's used with only the single one.[2] BLEU is infamously dependent on the tokenization technique, and scores achieved with different ones are incomparable (which is often overlooked); in order to improve reproducibility and comparability, SacreBLEU variant was designed.
It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.