1. Introduction
In this tutorial, we explain the Bleu score in Natural Language Processing (NLP).
2. Why Do We Need the BLEU Score?
BLEU (Bilingual Evaluation Understudy) is a quantitative metric for measuring the quality of an output text based on multiple reference texts.
We need it in NLP tasks for estimating the performance of systems with textual output, e.g., tools for image summarization, question-answering systems, and chatbots. We train them using datasets in which inputs (such as questions) are paired with the reference texts we expect at the output (such as the correct answers to those questions). So, a metric for estimating output quality is necessary for training and evaluating such models.
3. The BLEU Score
To calculate the BLEU score, we need to be familiar with N-grams, precision, and clipped precision.
3.1. Precision
Precision refers to the number of words in the output text that are also in the reference text:
For example:
- Output = She drinks the milk.
- Reference = She drank the milk.
The words “She”, “the”, and “milk” in the output text occurred in the reference text. Therefore, the precision is .
But, there are two downsides to the above precision measure. First, it doesn’t catch repetition. For example, if the output text was “milk milk milk”, the precision would be .
Second, it doesn’t take multiple reference texts into account. To address these issues, we use clipped precision.
3.2. Clipped Precision
Let’s take a look at an example of an output text with repetition that also has multiple reference texts:
- Output text: She She She eats a sour cherry.
- Reference text 1: She is eating a blueberry as she loves it.
- Reference text 2: She eats a fruit of her favorite.
In this example, the words “She”, “eats”, and “a” in the output text occur in at least one reference text. However, the word “she” is repeated three times.
In clipped precision, we bound the word count in the output text from above to the maximum count of the corresponding word in any of the reference texts:
In our example, the maximum count of “She” in the reference texts is 2 (in the reference text 1). Therefore, the clipped number of “She” becomes 2. If we had more than three occurrences of “She” in the reference text, the clipped number would be cut to 3. Similarly, the output words “eats” and “a” have only one occurrence in the reference texts. So, their clipped number is 1.
Since there are seven words in the output text, the clipped precision is:
3.3. BLEU Score Calculation
The BLEU score is always defined with a particular in mind. It uses -grams with , and we denote it as :
The geometric mean precision score is a weighted geometric mean of the clipped -gram precisions for , and the brevity penalty favors longer output texts.
The formula for the clipped precision of -grams is:
So, the weighted geometric mean is:
The values are weights we give to each clipped precision. Typically, we use uniform weights. So, for with 4 clipped precisions, we’ll use ().
3.4. Brevity Penalty
We penalize short output text to avoid high scores when they don’t make sense. For example, if we have an output text with just one word that also occurs in the reference text, we’ll end up with . To solve this issue, we need a new factor to penalize short output texts. That’s what the brevity penalty does:
In the formula, is the number of words in the output text, and is the number of words in the reference text.
4. Example
Let’s calculate for the following example:
- Output Text: The match was postponed because of the snow.
- Reference Text: The match was postponed because it was snowing.
First, we calculate the clipped precisions:
Then, we compute the weighted geometric mean precision and get 0.516. Next, we compute the brevity score with and , which results in 1. Finally, the .
5. Pros and Cons of BLEU Score
Due to its simplicity and being explainable and language-independent, the BLEU score has been widely used in NLP.
However, this metric neither considers the meanings of the word nor understands the significance of the words in the text. For example, the propositions usually have the lowest level of importance. However, BLEU sees them as important as noun and verb keywords.
To add to these downsides, BLEU doesn’t understand the variants of the words and can’t take the word order into account.
6. Conclusion
In this article, we examined the BLEU score as a widely used metric for evaluating text outputs in NLP. It’s simple to compute but falls short when it comes to text semantics.