BLEU (Bilingual Evaluation Understudy) is an algorithm used for evaluating the quality of text that has been translated from one language to another. It is a metric for evaluating a generated sentence to a reference sentence. The BLEU score is commonly used in natural language processing to assess the performance of machine translation models.

How BLEU works

A higher BLEU score represents a better match to the source translation. BLEU considers precision, but not recall: it counts the maximal number of times a word in the generated sentence appears in any reference sentence, but it doesn't penalize the generated sentence for missing words from the reference.

BLEU also uses a modified form of precision to ensure that the translated text does not simply consist of common words. This modification is done by applying a penalty to the score, which is proportional to the sentence length.

Further, BLEU takes into account the sequence of words or n-grams (group of 'n' words). For example, comparing two sequences of words of length n, by considering the match of individual words (1-gram), pairs of consecutive words (2-gram), triplets of consecutive words (3-grams), and so on. This enables to capture more contextual and grammatical information.

Despite its simplicity and popularity, BLEU has limitations. It does not consider semantics or meaning: a translation could be nonsensical and still receive a high BLEU score if the words match the reference in some order. Also, it does not scale well with longer sentences due to its precision-oriented nature. Finally, it doesn't handle synonyms or paraphrasing well, as it compares the exact words and not their meanings.

Regardless, BLEU remains a popular choice for quick automated evaluation, especially in machine translation tasks due to its simplicity and speed. It's important to consider other metrics or human evaluation for a comprehensive assessment of translation quality.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Related terms
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.