LLM Summarization Metrics

June 4, 2024
Authored by
Franklin Cardenoso Fernandez
Researcher at Holistic AI
LLM Summarization Metrics

Within the wide range of actual applications of the Large Language Models (LLM), we can find text summarization. This task aims to take a large text as input and return a summary containing the most important information of the given text condensed into a smaller paragraph. Although this task is trivial for these powerful text-processing tools, some questions arise: How can we trust the information we receive? Can the quality of the summarized text be measured? These questions bring to us an important topic widely studied in natural language processing (NLP): the use and development of metrics to assess generated text.

This blog will show some of the metrics used in text summarization and how they can be used within our code implementations. Fortunately, most metrics are implemented within the Hugging Face ecosystem, so their use is straightforward. In addition, you can test their implementation through this Kaggle notebook if you want.


During this blog we will review about:

  • Summarization metrics
  • Traditional or statistical metrics such as BLEU, ROUGE and METEOR, and
  • LLM-based metrics like BERTScore, Harim+ and TrueTeacher

Text summarization metrics

Text summarization is one of the most common tasks studied by researchers within the field of NLP, allowing them to process large amounts of information to obtain smaller pieces of text that contain the main ideas from the given input. In this context, with the recent explosion of LLMs, this task has become more popular and accessible to common people every time. However, although LLMs have demonstrated to be one of the most potent tools in NLP, some concerns are faced by researchers and developers mainly aimed at the evaluation of the generated text with the development of different metrics that could provide a general view with respect to the quality, consistent and coherence of the summarizations.

The development of metrics is not a recent topic; since the appearance of text generation models, different quantitative metrics have been developed to assess the outcomes of these models. Within the most popular we can find traditional metrics like BLEU, ROUGE, and METEOR. On the other hand, with the spread of LLMs, another recent metrics categorized as model-based have been appearing, such as the popular BERTScore, among others.

Next, we will describe some of the metrics that can help us assess the outcomes of our summarization models. For now, we will focus on those that we can find in the Hugginface library because of their easy code implementation and because they are ready to use.

Traditional or statistical metrics

BLEU (BiLingual Evaluation Understudy)

BLEU was introduced by Papineni et. Al. In 2002, as a method for quantitative evaluation of generated text. Although this metric was initially proposed for machine translation tasks, it can be also used for evaluating text summaries or even text completion.

Its basic functioning relies on calculating the precision of the n-grams generated by the model against the reference text, assigning a score based on the overlap of the n-grams, and penalizing shorter generations compared to the reference text to ensure accurate quality measurement.

Between its limitations, this metric doesn’t consider the semantic meaning or coherence of the given text since it primarily focuses on n-gram overlap; furthermore, it may penalize synonyms or paraphrases. A basic interpretation is that good text generations will likely have more overlapped n-grams with the reference text, resulting in a higher BLEU score. This score ranges between 0 and 1, where 1 indicates a perfect text summarization for our case.

Here we can see a look of its implementation in a Python script:

import evaluate 

bleu = evaluate.load("bleu") 

predictions = ["hello there general kenobi", "foo bar foobar"] 

references = [ 
... ["hello there general kenobi", "hello there !"], 
... ["foo bar foobar"] 
... ] 

results = bleu.compute(predictions=predictions, references=references) 

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6} 


ROUGE- (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is another metric that evaluates text generation, specifically automatic summarizations. (It is a metric specially designed for summarization tasks.) Unlike BLEU, which measures precision based on the n-gram overlap, ROUGE measures recall by checking how many n-grams from the reference text also appear in the generation. Another remarkable difference is that although ROUGE, like BLEU, calculates the overlap of n-grams, it also evaluates the overlap of individual words, capturing word-level similarity.

Although this metric neither captures the semantic similarity of the generated text, it provides a powerful tool for automatic quality evaluation of text summaries through its different variants:

  • ROUGE-N: Measures overlap of n-grams.  
  • ROUGE-L: Measures longest common subsequences.  
  • ROUGE-W: Measures weighted longest common subsequences, giving higher importance to contiguous matches.  
  • ROUGE-S: Measures skip-bigram co-occurrences, capturing word order information.

Although they work similarly, their use will depend on the length of the evaluated summary.

Here, we can observe a basic implementation of the metric:

import evaluate  

rouge = evaluate.load('rouge') 
predictions = ["hello there", "general kenobi"] 
references = ["hello there", "general kenobi"] 
results = rouge.compute(predictions=predictions, 
... references=references) 
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0} 


METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR was initially proposed for measuring the quality of generated text on translation tasks; however, because of its versatility, it can also be used for summarization evaluation. Unlike previous metrics, METEOR considers not only exact word matches but also semantic similarity and lexical variation between the generated text and its reference.

To assign a score, METEOR calculates the precision and recall of exact word matches, uses a word-to-word alignment mechanism to find correspondences between words and measures the lexical overlap between the two texts. It also incorporates synonyms and morphological variants to assess semantic equivalence. Computes individual scores for precision, recall, and alignment and combines them using a formula incorporating parameters to balance their contributions. The final METEOR score reflects the overall quality of the translation, considering both exact matches and semantic similarity.

Its performance is language-dependent and may vary across different languages and domains. Here, we can observe a basic implementation of the metric:

import evaluate  

meteor = evaluate.load('meteor') 
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"] 
references = ["It is a guide to action that ensures that the military will forever heed Party commands"] 
results = meteor.compute(predictions=predictions, references=references) 




Unlike the previously described metrics, which primarily rely on exact word matches or overlap, BERTScore uses contextual embeddings from pre-trained language models, such as BERT, to calculate semantic similarity between the generated text and the reference.

This measurement is done by calculating the embedding vectors of the generated text and the reference with the pre-trained LLM model. Then, a metric such as cosine similarity is used to capture semantic similarity and contextual understanding of the text to finally calculate the precision and recall between the embedded vectors. The resulting values are used to compute the F1 BERTScore measure, which makes the metric more robust to word variations and syntactic structure.

Although it is a robust measurement that captures a contextual understanding, it can be computationally expensive since it involves computing embeddings with a pre-trained LLM model.

Here, we can observe a basic implementation of the metric:

from evaluate import load 
bertscore = load("bertscore") 
predictions = ["hello there", "general kenobi"] 
references = ["hello there", "general kenobi"] 
results = bertscore.compute(predictions=predictions, references=references, lang="en") 



This is another LLM-based metric that uses a modified summarization model to read and estimate how good is the quality of generated summary pairs given the original text. This measurement is done by slightly modifying a previous approach that introduced a regularization term as hallucination risk into the objective function and modifying the encoder-decoder LLM structure by replacing the auxiliary language model with an empty-source encoder-decoder that allows the calculation of the hallucination risk term.

We can consider Harim+ a reference-free metric because it estimates overall generation quality by using token likelihood over input and exploiting the power of the summarization model with respect to the given outputs. An interpretation of this metric is that a higher score indicates a better quality of the summarization; consequently, we could say that less hallucinated (because of the hallucination risk measured).

An implementation of this metric is the following:

import evaluate 

art = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose ....'""" 

summaries = [ 
"holland beat spain 2-0 at the amsterdam arena on tuesday night . ....", 
"holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque ....", 
"del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and ....", 
"holland could not beat spain 2-0 at the amsterdam arena on tuesday night .....", 
articles = [art] * len(summaries) 

scorer = evaluate.load('NCSOFT/harim_plus') 
scores = scorer.compute(predictions = summaries, references = articles) # use_aggregator=False, bsz=32, return_details=False, tokenwise_score=False) 
pprint([round(s,4) for s in scores]) 
>>> [2.7096, 3.7338, 2.669, 2.4039, 2.3759]

You can find the complete code of this example here: Reference


TrueTeacher is another model-based metric that evaluates factual consistency. The quality measurement is done using a trained LLM model, using summaries generated by multiple models and annotated by LLMs to ensure fact consistency. The resulting model can then be used to evaluate summarization consistency.

Although the original paper mainly presents a synthetic data generation approach, the authors release the synthetically generated dataset and the consistency evaluation student model trained on this data.

To create the dataset and the released model, this approach first trained a set of diverse capacity summarization models that were then used to summarise some documents. Next, the resulting pairs of documents and summaries were labelled by another LLM that was used to predict the factual consistency label. Finally, the generated dataset was used to train a student model that can measure the consistency of a given document-summary pair.

Between its advantages we can say that because of the large-scale dataset used to train the model, TrueTeacher is able to generalize well in multilingual scenarios. On the other hand, the student model is too heavy and requires too much computational resources to use it, although the authors encourage researchers to use the released dataset and model checkpoint to mitigate this problem.

An extra consideration is that the model and the dataset were released for research and non-commercial purposes only.

To use the scoring metric, you can use the following code:

from transformers import T5ForConditionalGeneration 

from transformers import T5Tokenizer 

import torch 


model_path = 'google/t5_11b_trueteacher_and_anli' 

tokenizer = T5Tokenizer.from_pretrained(model_path) 

model = T5ForConditionalGeneration.from_pretrained(model_path) 


premise = 'the sun is shining' 

for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'),  

                             ('the cat is shiny', '<< 0.5')]: 

  input_ids = tokenizer( 

      f'premise: {premise} hypothesis: {hypothesis}', 




  decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]]) 

  outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids) 

  logits = outputs.logits 

  probs = torch.softmax(logits[0], dim=-1) 

  one_token_id = tokenizer('1').input_ids[0] 

  entailment_prob = probs[0, one_token_id].item() 

  print(f'premise: {premise}') 

  print(f'hypothesis: {hypothesis}') 

  print(f'score: {entailment_prob:.3f} (expected: {expected})\n') 



As we can see, there exist different metrics that can be used for summarization tasks, among traditional metrics and model-based, every metric with its particularities, while traditional metrics are easy to implement and ready to use, such as BLEU, ROUGE and METEOR; model-based metrics, such as BERTScore, Harim+ or TrueTeacher, use pre-trained LLMs at their core, which can lead to considerable computation time and some of them can be computationally expensive. However, the main advantage of model-based metrics is that they consider the contextual understanding of the text, which is not considered by traditional metrics since they mainly use only words or pieces of text to calculate their scores.

In conclusion, their use will depend on your necessities and your availability of computational resources. As presented in the first part, the objective of this blog is to describe briefly the different metrics and present their basic functioning, so that your choice will rely on your use case and what you want to evaluate. For a better understanding of the metrics' functioning, we highly recommend going to the original publication. In a future blog, we will review about other proposed metrics in the literature.

DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.

Subscriber to our Newsletter
Join our mailing list to receive the latest news and updates.
We’re committed to your privacy. Holistic AI uses this information to contact you about relevant information, news, and services. You may unsubscribe at anytime. Privacy Policy.

Discover how we can help your company

Schedule a call with one of our experts

Schedule a call