Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification
This page captures the work on hypothesis WE.3.1.3 as part of Product & Tech’s Annual Plan for Fiscal year 24-25:
If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.
Summary
editThe hypothesis was confirmed.
- We implemented an LLM in our infrastructure to generate simple summaries of sections of Wikipedia articles
Main deliverables:
- We identified a suitable model based on 5 criteria (multilingual, openness, resources, use-case, quality): Aya-expanse-32b
- Example code to run model in ML-Lab: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_aya_example.ipynb
- Test-deployment in LiftWing: phab:T379052
- Defined and implemented a set of guardrail metrics to ensure quality of simple summaries: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_eval-guardrail_example.ipynb
- Generated simple summaries and calculated guardrail metrics for 8148 articles (lead section) in English Wikipedia from Web Team’s experiments with two different prompts simple-summaries-experiment01_aya-expanse
Major lessons
- Our new ML-Lab servers support state-of-the-art multilingual models for text generation. Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification#Simple Summaries
- Evaluation of the model output is challenging. The lack of simple metrics to judge the quality of the simples summaries (and any generated text) makes it difficult to iteratively improve the model via offline experiments (without asking human raters). In order to address this, we developed a set of 5 interpretable metrics to evaluate/monitor the quality of the simple summaries. Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification#Evaluation
- However, additional work on optimizing these models to reduce latency and memory footprint. Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification#Open problems and next steps
Next steps
- The crucial next step is to optimize the model latency (memory footprint and inference time) via quantization or other suitable approaches.
Current status
edit2024-07-04: set up page
2024-07-15: Identified model requirements for respective tasks
2024-08-05: Testing candidate models
2024-09: Decision for use-case to generate simple summaries based on feedback from Web Team's experiments
2024-10: Identification of suitable metrics for evaluating simple summaries via a set of guard rail metrics
2024-11: Implementing model for simple summaries in ML-Lab servers and LiftWing
2024-12: Documentation
2025-07: Updating and expanding evaluation of simple summaries
Background
editOne of the objective's in the Annual Plan is around the Reader experience (WE3): A new generation of consumers arrives at Wikipedia to discover a preferred destination for discovering, engaging, and building a lasting connection with encyclopedic content. The goals are to:
- Retain existing and new generations of consumers and donors.
- Increase relevance to existing and new generations of consumers by making our content more easy to discover and interact with.
- Work across platforms to adapt our experiences and existing content, so that encyclopedic content can be explored and curated by and to a new generation of consumers and donors.
As part of the Key Result WE.3.1 towards this goal, we want to explore opportunities for readers to more easily discover and learn from content they are interested in. In this project, we focus on models for simplifying the existing content on Wikipedia.
The Readability Gap:
- We have shown in previous work that content in Wikipedia is generally very difficult to read. [1] This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults).
- There are some Wikipedias with articles using a decidedly simpler language such as Simple English Wikipedia or children’s encyclopdias (Vikidia, Txikipedia, Klexikon, Wikikids). However, they exist only in few languages (compared to the more than 300 languages in Wikipedia) and cover only a much smaller number of articles (for example, as of July 2024 Simple English Wikipedia contains around 250K articles vs 6.8M in English Wikipedia)
Automatic Simplification:
- In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.
- In previous exploratory work, we showed that its possible to automatically generate simplified versions of text with some success even across languages beyond English
Goals
edit- [Done] Identify requirements (infrastructure, performance, quality, languages, context, etc)
- [Done] Review candidate models compatible with requirements
- [Done] Implement one or more candidate models
Defining model requirements
editIn order to decide which model to use for the corresponding tasks, I identified the following requirements:
- Multilingual: The model should support at least some languages other than English; ideally, as many languages as possible from the more than 300 languages in Wikipedia.
- Openness: The model should be open so we can deploy it as a production service in our own infrastructure (LiftWing). Which definition of open needs to be determined.
- Resources: We need to be able to host the model in our infrastructure in LiftWing. This sets a limit on the model size (e.g. number of parameters). Additional constraints come from performance, e.g., the time to return results should be limited.
- Use-case: Does the model have a chance to be effective for the respective task (based on Research and what we know)? Has the model been used for this task or similar tasks before?
- Quality: The output of the model needs to be useful, e.g. the quality should pass some threshold. This requires some evaluation of the model output (automated and/or manual etc.).
Candidate models
editIn the first step, we identified to potential candidate models: text simplification and section gists.
Simplification
editText simplification aims to rephrase the text to make it easier to read and easier to understand while retaining the content (and meaning) of the original text.
Motivation. We have shown in previous work that content in Wikipedia is generally very difficult to read. This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults). In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text (i.e. the same text but using simpler language such as simple English) that could be surfaced to readers. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task. Implementation. We train a sequence-to-sequence language model following recent approaches in document-level text simplification[2].
As training data, we use an annotated reference dataset (WikiReaD, see below) which contains pairs of articles (original and a simplified version) obtained from matching Wikipedia with a simplified or children encyclopedia across 14 languages. We then fine-tune a pre-trained language model using the pairs of articles as samples for the model’s input (original) and output (simplified). Specifically, we fine-tune two recent models: Flan-T5 (large) and mt0 (base). We chose these models based on an evaluation of the requirements defined above
- Multilingual: These models are multilingual supporting many more languages than English based on the documentation in the respective models card.
- Openness: The models are available under an open license (Apache2.0)
- Resources: We are able to train (i.e fine-tune) the models inside our own infrastructure, specifically on the analytics-clients (stat-boxes) which have a (single) GPU. It is possible to host the trained models in the current infrastructure in LiftWing. If, in the future, our infrastructure improved to allow for training/hosting larger models, we can easily adapt these models using larger model variants of the same model family.
- Use-case: The model families Flan-T5 and mt0 (based on the mT5 family) have been previously used for training multilingual text simplification models[3].
- Quality: Previous works have reported good performance of these models in text simplification (though mostly on sentences). Exact evaluation still needs to be done.
Code for implementing the model can be found in this repository: https://gitlab.wikimedia.org/repos/research/text-simplification/-/tree/main
Example. As an example, lets take the first paragraph of the lead section of the article on Hypatia
Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.
The resulting simplified version from the model is:
Hypatia (c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician. She lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. She is the first female mathematician whose life is fairly well recorded. Hypatia was known in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus' thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus' original text. She also wrote a commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest
The Flesch Kincaid grade level of the latter is 10.2 (vs 12.3 of the original), showing that its easier to read based on the readability score.
Section gists
editSection gists are plain language summaries of sections of articles. They thus combine simplification with summarization of content.
Motivation. The idea of section-gists is taken from the paper Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing[4]. It aims to improve access to medical papers for readers (outside of Wikipedia). Based on interviews with readers about barriers for interacting with content and usability testing, they identify section-gists as a valuable and most frequently-used feature by non-expert readers. Specifically, they generate section gists automatically by prompting an LLM to create "a summary for a 5th-grader" (i.e. combining summarization and simplification).
Here, we adapt the same framework to Wikipedia articles. Based on initial discussions with folks in the Web Team, section-gists could align very well with some of the ideas the team is thinking about exploring as experiments with readers in Wikipedia. Implementation. We use the Aya 23 model to generate section gists. This model is a good candidate because of the following aspects:
- Multilingual: The main advantage is that it supports 23 languages supposedly covering half the world's population in terms of speakers (more than any comparable LLM that I am aware of).
- Openness: It is an open-weight model with CC-BY-NC license. It can be used via huggingface.
- Resources: We will likely be able to host the model in our own infrastructure based on recent experiments with similarly-sized models (T369055).
- Use-case: Previous works (such as the Paper Plain paper mentioned above) generated section using similar LLMs via prompt for text generation asking for a summary at a certain grade level. Thus, the Aya model seems suitable for the task at hand.
- Quality: In the technical report it is shown that the model outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. Specifically, it is shown to perform well on summarization task (Sec. 5.3). With the task operationalized as a variant of summarization, we can expect that the model can, in principle, yield good results. Though, in practice, this is difficult to evaluate automatically.
We can run the model on the text of individual sections by prompting the model in the following way:
## Instructions Summarize the text below for a 7-th grader in {LANGUAGE}. Just return the summary. ## Input text {TEXT}
There are different options to adapt the section gist in terms of
- Length (specify the maximum number of tokens)
- Readability level (e.g. specify a different grade level for the summary)
- Etc.
Example. As an example, lets take the lead section of the article on Hypatia
Hypatia (born c. 350–370; died 415 AD) was a Neoplatonist philosopher, astronomer, and mathematician who lived in Alexandria, Egypt, then part of the Eastern Roman Empire. She was a prominent thinker in Alexandria where she taught philosophy and astronomy. Although preceded by Pandrosion, another Alexandrian female mathematician, she is the first female mathematician whose life is reasonably well recorded. Hypatia was renowned in her own lifetime as a great teacher and a wise counselor. She wrote a commentary on Diophantus's thirteen-volume Arithmetica, which may survive in part, having been interpolated into Diophantus's original text, and another commentary on Apollonius of Perga's treatise on conic sections, which has not survived. Many modern scholars also believe that Hypatia may have edited the surviving text of Ptolemy's Almagest, based on the title of her father Theon's commentary on Book III of the Almagest.
Hypatia constructed astrolabes and hydrometers, but did not invent either of these, which were both in use long before she was born. She was tolerant toward Christians and taught many Christian students, including Synesius, the future bishop of Ptolemais. Ancient sources record that Hypatia was widely beloved by pagans and Christians alike and that she established great influence with the political elite in Alexandria. Toward the end of her life, Hypatia advised Orestes, the Roman prefect of Alexandria, who was in the midst of a political feud with Cyril, the bishop of Alexandria. Rumors spread accusing her of preventing Orestes from reconciling with Cyril and, in March 415 AD, she was murdered by a mob of Christians led by a lector named Peter.
Hypatia's murder shocked the empire and transformed her into a "martyr for philosophy", leading future Neoplatonists such as the historian Damascius (c. 458 – c. 538) to become increasingly fervent in their opposition to Christianity. During the Middle Ages, Hypatia was co-opted as a symbol of Christian virtue and scholars believe she was part of the basis for the legend of Saint Catherine of Alexandria. During the Age of Enlightenment, she became a symbol of opposition to Catholicism. In the nineteenth century, European literature, especially Charles Kingsley's 1853 novel Hypatia, romanticized her as "the last of the Hellenes". In the twentieth century, Hypatia became seen as an icon for women's rights and a precursor to the feminist movement. Since the late twentieth century, some portrayals have associated Hypatia's death with the destruction of the Library of Alexandria, despite the historical fact that the library no longer existed during Hypatia's lifetime.
We would get the following section gist:
Hypatia was a smart and brave woman who lived a long time ago in a place called Alexandria, Egypt. She was a teacher and a philosopher, and she knew a lot about math and the stars. She was liked by many people, both Christians and non-Christians. She taught Christian students and was friends with a man named Orestes, who was in charge of Alexandria. Some people were jealous of her, and they spread rumors about her. One day, a group of people attacked her, and she was killed. Her death made a lot of people very sad, and she became famous for standing up for what she believed in. People still remember her today as a brave woman who fought for her beliefs and helped others learn and understand the world.
You can use the tutorial-notebook to run the model yourself on PAWS. Currently, the model uses Cohere’s API for which you need to create a free trial key. The notebook contains some examples. You can adapt the notebook to get section gists for different articles and in different languages.
Simple Summaries
editOur main goal for the hypothesis is to implement a model that can generate simple summaries of sections of articles. This is very similar to the concept of section gists discussed above. The main reason to focus on this model is that the Web Team has identified this as a relevant use-case as part of their experiments.
Task
editConsidering the text of a section of an article, a simple summary has the following features:
- Summary: It is substantially shorter than the original section but still capturing the main information.
- Simplicity: It is substantially easier to read (e.g. it improves the readability score).
- Meaning preservation: Its content is factually consistent with the information contained in the text of the article.
Implementation
editWe use the Aya-expanse model, specifically the Aya-expanse-32b. The model is an improvement over the Aya-23 model that we considered in earlier exploratory research (see above). It is an open-weights model and is the state-of-the-art for multilingual AI supporting 23 languages. Given the comparably moderate size of the model (32b parameters) allows us to implement and host it in our own infrastructure.
ML-lab
editWe have implemented the model on the ML-Lab servers using the transformer library. Note that, we use a different datatype (float16 instead of the default flat32) in order to reduce the memory footprint.
# Loading the model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # check if GPU is available
model_id = "CohereForAI/aya-expanse-32b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)
Using a prompt (and optionally a preamble), we can generate a summaries in the following way:
def generate_aya(
model,
tokenizer,
data_in,
temperature=0.3,
top_p=1.0,
top_k=0,
max_new_tokens=256,
do_sample = True
):
# format the input
preamble = data_in["preamble"]
prompt = data_in["prompt"]
messages = [
{"role": "system", "content": preamble},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# generate summary
gen_tokens = model.generate(
input_ids,
top_p=top_p,
top_k=top_k,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=do_sample,
)
# format the output
gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens = False)
return gen_text
An example prompt and template could look like this (where language is the language in which the article is written, and input_text is the text from the original article section)
preamble = """You are writing encyclopedic articles in the style of Wikipedia using simplified language and a neutral tone."""
Prompt = """## Instructions
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. Return only the summary.
## Input text
{input_text}"""
With these setup, the model’s memory footprint is 60GB and thus fits into memory of a single GPU. Generating a simple summary for a single section of an article takes around 10s. These numbers can be further reduced through additional optimization (see Open problems and next steps below)
Example notebook: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_aya_example.ipynb
LiftWing
editWe successfully built a test-deployment of the model in a staging environment (only accessible internally from, e.g., the stat-machines).
Example query to the (smaller) Aya-expanse-8b model:
$ curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: aya.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "aya-expanse-8B", "prompt": ".", "max_tokens": 100}'
Implementation details:
This is deployed using the huggingface runtime available in kserve which has an OpenAI API integrated ("openai/v1/completions").
Evaluation
editIn order to assess whether the model works in practice and to iteratively improve it, it is crucial to evaluate the performance using some evaluation metric.
With recent Large Language Models, the evaluation of natural language generation (NLG) or text generation is a difficult and unsolved task[5]. For example, a recent paper Toward an Evaluation Science for Generative AI Systems[6] argues that "There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems."
Specifically, in the case of summarization/simplification, commonly-used automatic benchmark metrics are Rouge, BLEU, SARI, etc. However, these metrics suffer from many drawbacks:[7]
- The metrics typically require a ground-truth or reference dataset. In practice, for many tasks such data is not easily available, especially when looking at languages beyond English. For example, we do not have readily available and verified simple summaries of Wikipedia articles.
- The metrics have been shown to correlate poorly with human judgement. That is, while they are convenient to calculate, they do not necessarily align with how humans would rate the quality of the generated text. For example, for some text simplification metrics it has been shown that “low scores” indicate “bad quality”; whereas, in contrast, “high scores” do not necessarily imply “good quality” of the simplification.
- They are often not easily interpretable. For example, the SARI score is an average of 2 F1-scores (for add and keep operations) and a precision score (delete operations). As a result, it is not clear what value of SARI should be considered acceptable or good enough.
Overall, this renders the common summarization/simplification not very useful when making decisions about whether to or which model to deploy in practice.
While this work focuses on the case of evaluating simplification/summarization of the simple summaries model, the insights will also be informative for other tasks where we use generative output of LLMs, in contrast to, e.g., classification.
Metrics
editAs an alternative approach, we can use a set of simpler, easy-to-interpret guardrail metrics to assess specific aspects about the quality of the generated simple summaries.
First, we focus on three aspects that are typically considered when asking human judges to rate simplifications[8]; and define an automatic metric as a proxy:
- Simplicity captures the readability of the generated text (i.e. how easy it is to read). We calculate the readability score, such as the Flesch-Kincaid grade level (for English) or the multilingual readability score (beyond English)[1]. Ideally, the grade-level of the simple summary is lower than for the original article. For this, we consider the change in readability score (i.e. a negative score means the readability score of the simple summary is lower/better than the original).
- Fluency captures the degree to which the generated text is grammatical. We calculate the number of grammar and spelling errors, e.g., using LanguageTool. Ideally, the simple summary does not have any grammatical or spelling errors.
- Meaning preservation captures whether the generated text is factually consistent with the original text. We calculate the score (probability) that the generated text is entailed by the original text using the SummaC model [9]. This score is between 0 (no entailment/inconsistent) and 1 (high entailment/consistent). Ideally, the score is above some threshold (say 0.4) to make sure that information in the simple summary is consistent with the original article.
In addition, we consider the following aspects:
- Language confusion captures whether the output of the model is in the correct language. Qualitatively, we also observed that the output of the simple summary model was in a different language than the input. Recent work [10] has identified language confusion, a model’s inability to consistently generate text in a user’s desired language, as a general limitation of multilingual LLMs (including the Aya model). In a related work[11], it was shown that that quantization can significantly affect performance in this aspect. We use a language identification model developed which supports 201 languages[12] (code). The model is hosted on LiftWing so can be used off-the-shelve. For each simple summary we identify the language and compare whether it matches the expected language (1) or not (0).
- Tone captures whether the simple summary is written in an encyclopedic tone. Qualitatively, we observed that some simple summaries contained non-encyclopedic language (example: “super tall buildings”). We use the peacock detection model to detect policy violations supporting Edit Check (T368274). This is based on the peacock template indicating an article "contains wording that promotes the subject in a subjective manner without imparting real information". We can use the model to detect similar issues in the simple summaries generated by the model. For each simple summary, we obtain a score between 0 (tone is good) to 1 (tone is not encyclopedic).
The advantage of these guardrail metrics are:
- they are reference-free, i.e. they do not require ground truth for evaluation
- they are interpretable, i.e. they provide interpretable information about specific aspects about the quality of the generated simple summaries.
- they can thus help identify potential issues with individual simple summaries, so they can be checked and filtered (if needed) during post-processing. For example, simple summaries with low scores on meaning preservation (say below 0.25) are likely to contain information that is not contained in the original text (e.g. hallucinations). Naturally, there are other aspects of quality, such as Neutral point of view or the tone, for which we do not have readily available metrics (see Open problems below).
Summary of metrics
Metric | Explanation | Score Range | What is better | Implementation |
Simplicity | Change in readability score (e.g. the Flesch-Kincaid grade level) between simple summary and original as readability_summary - readability_original. Thus, a negative score indicates that the simple summary is easier to read. In contrast, a neutral or positive score will indicate that the simple summary is not easier to read than the original. | -10…10 | ↓ Lower scores (negative) | Multilingual readability model |
Fluency | Number of grammatical errors in the simple summary. The fewer, the better. | 0...integer | ↓ Lower scores (0) | LanguageTool |
Meaning preservation | Confidence score that content in the simple summary is supported (i.e. entailed) by the text of the original. Low scores might indicate hallucinations. | 0...1 | ↑ Higher scores (1) | SummaC |
Language confusion | Confidence score that the simple summary is written in the expected language using a language detection model. A low score indicates that simple summary is not in the correct language. | 0...1 | ↑ Higher scores (1) | Language detection model |
Tone | Confidence score that the tone is not encyclopedic using the peacock detection model. A high score indicates that the tone is not encyclopedic. | 0...1 | ↓ Lower scores (0) | Tone check model |
Examples:
1) Tone: Detecting non-encyclopedic tone in the simple summary (Tone score = 0.91)
- Model output: São Paulo is Brazil's biggest city and an important place in the world for business, art, and fun. It's named after Paul the Apostle and has people from many countries living there. The city started with Jesuit priests in 1554 and grew strong during the coffee trade. Now, it's a huge economic center with lots of big companies and a cool cultural scene. São Paulo hosts big events like the World Cup and has awesome museums, parks, and festivals. It's also home to super tall buildings!
2) Meaning preservation: Detecting hallucinations in the simple summary (Meaning preservation score = 0.17)
- Original article: 2 (two) is a number, numeral and digit. It is the natural number following 1 and preceding 3. It is the smallest and the only even prime number. Because it forms the basis of a duality, it has religious and spiritual significance in many cultures.
- Model output: Two is a number that comes between 1 and 3. It's special because it's the only even prime number, meaning it can only be divided by 1 and itself. In many cultures, the number two represents duality and has religious importance. Think of ideas like "good and evil" or "light and dark" - these are examples of duality. So, the number two is pretty significant and not just a simple number!
Which languages are supported for the evaluation metrics:
- Fluency is implemented using LanguageTool, which supports 31 languages: ar ast be br ca crh da de el en eo es fa fr ga gl it ja km nl pl pt ro ru sk sl sv ta tl uk zh. If additional languages are required, one relatively straightforward option would be to use spellcheckers for that specific language using libraries such as pyenchant. Open spellcheckers can be found via, e.g., LibreOffice Language Support (bn, cs, etc.)
- Language confusion is implemented using a Language identification model which supports 188 languages (that are explicitly matched with a Wikipedia project). So this should capture most relevant cases for now.
- Simplicity, Meaning preservation, and Tone are implemented using different fine-tuned smaller multilingual language models. They do not have a well-defined language coverage. Their backbone models support ~100 languages. They have been explicitly validated for 10-20 languages in zero-shot settings with good results. Therefore, they likely also generalize well for other languages but it often depends on how well that language is captured by the model. A good compromise here is to work with the top-20 languages (ar,cs,de,en,es,fa,fr,he,id,it,ja,nl,no,ro,ru,pl,pt,tr,uk,zh)
- Simplicity. This uses the readability model from Trokhymovich et al 2024. Its been explicitly validated to work in zero-shot scenario (i.e. without being fine-tuned on those languages explicitly) for: ca,de,el,es,eu,fr,hy,it,nl,oc,pt,ru,scn (it was trained on English).
- Meaning preservation. This uses the summaC model from Laban et al. 2022. In the original paper it has been only validated on English data. A recent paper by Kang et al. 2024 evaluated summaC (among other methods) in a multilingual setting to detect hallucinations in text generation. They conclude that i) “[summaC] effectively detect sentence-level hallucinations in high-resource languages when compared to human evaluations” and ii) “[summaC] outperform supervised approaches at detecting hallucinations that can be verified or refuted by the reference text”. They mention “NLI metrics” but use summaC for the implementation (“we adopt the NLI-based zero-shot sentence-level SUMMAC (SummaCzs) scoring system (Laban et al., 2021) to evaluate hallucinations.”). The high-resource languages considered are: en, es, fr, id, vi, zh. In addition, I qualitatively checked the examples in German from multi-core-01 benchmark (generated with prompt_id=01) and found that high scores (>0.5) constituted simple summaries with preserved meaning, while low scores (<0.5) showed some form of hallucinations.
- Tone. This uses the peacock detection model developed in T368274: Detecting Peacock behavior with LLMs. The multilingual version of the model has been validated for 10 languages: ar, de, en, es, fr , ja, nl, pt, ru, zh. It is suspected that the model works well for other languages as well (especially those among the top-20 or so language versions in Wikipedia) but evaluation is currently ongoing in T387925: Determine language support for Peacock Check (v1).
Resources:
Benchmark datasets
editWe put together a set of benchmark datasets for consistent evaluation of different models. We considered the following aspects:
- size: dataset should be large enough to have good statistics on metrics but also not be too large so that running time is reasonable (probably something between 100 and 10,000 articles)
- representativeness: should the dataset be a random subsample or contain known edge cases (e.g. in terms of readability)?
- languages: it should contain articles from other languages for multilingual evaluation
From these criteria, we generated the following benchmark datasets:
- en-random: 100 randomly selected from first round of experiment in English Wikipedia
- en-edgecases: 100 with lowest scores from first round of experiment in English Wikipedia (with 5 metrics, choose 20 lowest-scoring cases from each dimension)
- en-cutoff: 100 random articles from English Wikipedia that were created after the model was released
- multi-core: random sample of articles from Wikipedia language versions that are explicitly supported by the Aya model (23 languages). Specifically, we select the same set of articles as “en-random” but for each article we select one of the language versions randomly in which it is available.
- multi-ext: random sample of articles from Wikipedia language versions that are not explicitly supported by the Aya model (23 languages). Specifically, we select the same set of articles as “en-random” but for each article we select one of the language versions randomly in which it is available
- <lang>-random: corresponding articles from en-random in <lang> if it exists (i.e. the dataset might contain less than 100 articles). <lang> = "ar","cs","de","el","es","fa","fr","he","hi","id","it","ja","ko","nl","ro","ru","pl","pt","tr","uk","vi","zh"
Resources:
- Code to generate data:
- Data as csv files in repo: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/tree/main/benchmarks
Application: Improving the prompt
editI tested the quality metrics for systematically improving the prompt that we use for generating the simple summary. Using the benchmark data and evaluation metrics we can quantitatively compare different prompts instead of manually spot-checking individual samples.
I tested 8 different prompts on one of the English benchmark datasets calculating all 5 quality metrics for evaluation.
Prompt_id | Prompt | Preamble (optional) |
1 | ## Instructions
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. Return only the summary. ## Input text {input_text} |
You are writing encyclopedic articles in the style of Wikipedia using simplified language and a neutral tone. |
3 | ## Task and Context
Summarize the text below into a clear and simple paragraph that is easy to understand for a general audience. You will always generate the text in the source language of the input text. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Focus on readability, accessibility, and neutrality. ## Tone Write the summary in a style and tone appropriate for Wikipedia. Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. ## Summarization Guidelines - The source language of the input text is in {language}. Please always respond ONLY in the source language. - Provide an accessible, straightforward summary suitable for readers seeking a quick, understandable overview. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Summarize the text above in simple language, using 100 words or less. Do not expand content beyond its original length if it is already brief. - Return only the summary. Avoid any in-text citations, footnotes or reference links in the summary. ## Input Text {input_text} |
- |
4 | ## Core Requirement
YOU MUST GENERATE THE SUMMARY IN THE EXACT SAME LANGUAGE AS THE INPUT TEXT. This is a strict requirement that cannot be violated under any circumstances. ## Task and Context Summarize the text below into a clear and simple paragraph that is easy to understand for a general audience. You will always generate the text in the source language of the input text. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Focus on readability, accessibility, and neutrality. ## Language Control 1. First, identify the input text's language: {language} 2. You must generate the summary in {language} 3. If you detect yourself starting to switch languages, stop and restart in {language} 4. The entire response must be in {language} only ## Summarization Guidelines - Write the summary in a style and tone appropriate for Wikipedia. - Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. - Provide an accessible, straightforward summary suitable for readers seeking a quick, understandable overview. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Summarize the text above in simple language, using 100 words or less. Do not expand content beyond its original length if it is already brief. - Return only the summary. Avoid any in-text citations, footnotes or reference links in the summary. ## Input Text {input_text} |
- |
05a | ## Core Requirement
YOU MUST GENERATE THE SUMMARY IN THE EXACT SAME LANGUAGE AS THE INPUT TEXT. This is a strict requirement that cannot be violated under any circumstances. ## Task and Context Summarize the text below into a clear and simple paragraph that is easy to understand at a 7-th grade reading level. You will always generate the text in the source language of the input text. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Focus on readability, accessibility, and neutrality. ## Language Control 1. First, identify the input text's language: {language} 2. You must generate the summary in {language} 3. If you detect yourself starting to switch languages, stop and restart in {language} 4. The entire response must be in {language} only ## Summarization Guidelines - Write the summary in a style and tone appropriate for Wikipedia. - Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. - Provide an accessible, straightforward summary suitable for readers seeking a quick, understandable overview. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Summarize the text above in simple language at a 7-th grade reading level, using 100 words or less. Do not expand content beyond its original length if it is already brief. - Return only the summary. Avoid any in-text citations, footnotes or reference links in the summary. ## Input Text {input_text} |
- |
05b | ## Core Requirement
YOU MUST GENERATE THE SUMMARY IN THE EXACT SAME LANGUAGE AS THE INPUT TEXT. This is a strict requirement that cannot be violated under any circumstances. ## Task and Context Summarize the text below into a clear and simple paragraph that is easy to understand at a 7-th grade reading level. You will always generate the text in the source language of the input text. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Focus on readability, meaning preservation, as well as neutrality and an encyclopedic tone. ## Language Control 1. First, identify the input text's language: {language} 2. You must generate the summary in {language} 3. If you detect yourself starting to switch languages, stop and restart in {language} 4. The entire response must be in {language} only ## Summarization Guidelines - Write the summary in a style and tone appropriate for Wikipedia. - Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. - Provide an accessible, straightforward summary suitable for readers seeking a quick, understandable overview. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Summarize the text below in simple language at a 7-th grade reading level, using 100 words or less. Do not expand content beyond its original length if it is already brief. - Return only the summary. Avoid any in-text citations, footnotes or reference links in the summary. ## Input Text {input_text} |
- |
05c | ## Core Requirement
YOU MUST GENERATE THE SUMMARY IN THE EXACT SAME LANGUAGE AS THE INPUT TEXT. This is a strict requirement that cannot be violated under any circumstances. ## Task and Context Summarize the text below into a clear and simple paragraph that is easy to understand at a 7-th grade reading level. You will always generate the text in the source language of the input text. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Focus on readability, accessibility, and neutrality. ## Language Control 1. First, identify the input text's language: {language} 2. You must generate the summary in {language} 3. If you detect yourself starting to switch languages, stop and restart in {language} 4. The entire response must be in {language} only ## Summarization Guidelines - Write the summary in a style and tone appropriate for Wikipedia. - Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. - Provide an accessible, straightforward summary suitable for readers seeking a quick, understandable overview. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Return only the summary. Avoid any in-text citations, footnotes or reference links in the summary. - Summarize the text below using 100 words or less. Do not expand content beyond its original length if it is already brief. - Most importantly, the summary should be written using much simpler language aimed towards a 7-th grade reading level. ## Input Text {input_text} |
- |
05d | ## Task
Summarize the text below at a 7-th grade reading level in {language}. Keep the summary concise, with a target of 100 words or less, capturing only the essential points of the content. Write the summary in a style and tone appropriate for Wikipedia. ## Detailed summarization guidelines - YOU MUST GENERATE THE SUMMARY IN THE EXACT SAME LANGUAGE AS THE INPUT TEXT. This is a strict requirement that cannot be violated under any circumstances. The entire response must be in {language} only. If you detect yourself starting to switch languages, stop and restart in {language} - Avoid any editorializing, opinions, or expressive language about the subject. Ensure the tone remains neutral, professional and factual. - Refrain from using humorous or imaginative titles; provide only the summary itself as output. - Summarize the text below using 100 words or less. Do not expand content beyond its original length if it is already brief. - Most importantly, the summary should be written using much simpler language aimed towards a 7-th grade reading level. ## Input Text {input_text} |
You are writing encyclopedic articles in the style of Wikipedia using simplified language and a neutral tone. |
05e | ## Task
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. ## Guidelines * Write the summary in a style and tone appropriate for Wikipedia. Avoid any editorializing, opinions, or expressive language about the subject and ensure the tone remains neutral, professional and factual; * Write the summary in {language}. If you detect yourself starting to switch to another, stop and restart in {language}; * Provide only the summary itself as output and refrain from using humorous or imaginative titles; * Do not expand content beyond its original length if it is already brief. * Keep the summary concise capturing only the essential points of the content. Do not expand content beyond its original length if it is already brief. ## Input text {input_text} |
You are writing encyclopedic articles in the style of Wikipedia using simplified language and a neutral tone. |
05f | ## Task
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. ## Guidelines When writing the summary it is crucial to stick to ALL of the following guidelines: * Simplicity: The summary should be easy to read and understand. Aim for a 7-th grade reading level; and * Fluency: The summary should be grammatically correct. Provide only the summary itself as output and refrain from using humorous or imaginative titles; and * Meaning preservation: The summary should be factually consistent with the input text capturing its essential points. Do not expand content beyond its original length if it is already brief; and * Language: The summary should be written in {language}. If you detect yourself starting to switch to another language, stop and restart in {language}; and * Tone: The summary should be written in a style and tone appropriate for Wikipedia. Avoid any editorializing, opinions, or expressive language about the subject and ensure the tone remains encyclopedic, neutral, and professional. ## Input text {input_text} |
- |
05g | ## Task
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. ## Guidelines When writing the summary it is crucial to stick to ALL of the following guidelines: * Simplicity: The summary should be easy to read and understand. This means it has to be substantially simpler than the input text. If that is not the case, restart and use simpler language; and * Fluency: The summary should be grammatically correct. Provide only the summary itself as output and refrain from using humorous or imaginative titles; and * Meaning preservation: The summary should be factually consistent with the input text capturing its essential points. Do not expand content beyond its original length if it is already brief; and * Language: The summary should be written in {language}. If you detect yourself starting to switch to another language, stop and restart in {language}; and * Tone: The summary should be written in a style and tone appropriate for Wikipedia. Avoid any editorializing, opinions, or expressive language about the subject and ensure the tone remains encyclopedic, neutral, and professional. If you detect using any words that have a non-neutral tone, stop and restart with more neutral language. ## Input text {input_text} |
- |
05h | ## Task
Summarize the text below at a 7-th grade reading level, in {language}, using 100 words or less. ## Guidelines When writing the summary it is crucial to stick to ALL of the following guidelines: * Formatting: Provide only the summary itself as output and refrain from using humorous or imaginative titles. Avoid any in-text citations, footnotes or reference links in the summary. * Simplicity: The summary should be easy to read and understand. Aim for a 7-th grade reading level; and * Meaning preservation: The summary should be factually consistent with the input text capturing its essential points. Do not expand content beyond its original length if it is already brief; and * Language: The summary should be written in {language}. If you detect yourself starting to switch to another language, stop and restart in {language}; and * Tone: The summary should be written in a style and tone appropriate for a Wikipedia article. Avoid any editorializing, opinions, exaggerations, or expressive language about the subject. Make sure that the tone remains encyclopedic, neutral, and professional. ## Input text {input_text} |
- |
Results:
Prompt-id | simplicity ↓ | fluency ↓ | meaning preservation ↑ | language confusion ↑ | tone ↓ |
1 | -4.08 | 0.13 | 0.73 | 0.96 | 0.44 |
3 | -0.53 | 0.19 | 0.77 | 0.79 | 0.28 |
4 | -0.35 | 0.15 | 0.82 | 0.93 | 0.31 |
05a | -2.63 | 0.11 | 0.8 | 0.99 | 0.33 |
05b | -2.33 | 0.08 | 0.8 | 0.99 | 0.32 |
05c | -3.28 | 0.11 | 0.79 | 0.99 | 0.36 |
05d | -3.74 | 0.12 | 0.73 | 0.96 | 0.39 |
05e | -2.27 | 0.08 | 0.76 | 0.9 | 0.31 |
05f | -3.44 | 0.09 | 0.8 | 0.98 | 0.35 |
05g | -3.59 | 0.11 | 0.8 | 0.93 | 0.39 |
05h | -3.13 | 0.08 | 0.81 | 0.98 | 0.34 |
From this experiment, candidate prompt 05f seems to yield the best results.
In comparison to prompt_id 04 it creates simple summaries that are substantially easier to read: the readability score decreases by ~3.5 levels (in comparison to 0.35). At the same time, the scores for the tone (checking for peacock language) for prompt_id 05f (0.35) is almost at the same as for prompt_id 04 (0.31) but still substantially better than for the initial prompt_id 01 (0.44) where we detected these issues. The other metrics are also similar (meaning preservation) or even better (fluency, language confusion). So the prompt_id 05f seems like a good compromise between our first prompt (prompt_id 01) with good simplification and prompt_id (04) with good tone without giving up much on the other metrics as well. Interestingly, looking through the results for the other prompts, it seemed that it was hard to improve, both, simplicity AND tone -- when one improved, the other would typically decrease. My rationale for the new prompt_id 05f was to keep the prompt more concise and provide explicit guidelines for each of the dimensions of our quality metrics we are using to evaluate the simple summaries.
Open problems and next steps
editIn this hypothesis, we demonstrated the feasibility for running/hosting a model to generate simple summaries in our own infrastructure. If the model is considered to be useful, the next step would be to scale the model for deployment. Specifically, we would need to optimize how we run the model in order to reduce memory footprint and inference time. This work is beyond the scope of the current task and deserves a dedicated task. This includes (but is not limited to)
- inference optimization on GPU: This requires systematic investigation of different available options and if/how they work in our infrastructure (e.g. many approaches are not supported with RoCm GPUs). In turn, we need to also better understand the trade-off with model quality in order to make sure that the output is still acceptable for the task at hand. Although using the prebuilt huggingface runtime provides a simpler way to deploy models it doesn't facilitate the level of customization we want to have at this time in order to explore. This involves
- quantization
- flash attention
- inference optimization frameworks (e.g. vllm)
- The above need to be tested on ml-lab and then deployed to Lift Wing. Deployment on Lift Wing involves building the packages from source for the required GPU architecture in a a way that is reproducible so that we can iterate and update them when needed.
- Batch inference. The current implementation is generating one sample at a time. However, it is possible to run the model in batches which can reduce the time per sample (though with an additional memory cost). For example, see here. I
It is important to note that improving inference via, e.g., quantization often comes at the cost of reduced quality of the model output. The evaluation metrics are thus crucial to find a good balance between improving latency while still making sure that the model output meets certain quality criteria.
Resources
edit- Repository for running model to generate simple summaries: https://gitlab.wikimedia.org/repos/research/simple-summaries
References
edit- ↑ a b Trokhymovych, Mykola, Indira Sen, and Martin Gerlach. “An Open Multilingual System for Scoring Readability of Wikipedia.” In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Lun-Wei Ku, Andre Martins, and Vivek Srikumar, 6296–6311. Bangkok, Thailand: Association for Computational Linguistics, 2024. https://doi.org/10.18653/v1/2024.acl-long.342.
- ↑ Sun, Renliang; Jin, Hanqi; Wan, Xiaojun (2021). "Document-Level Text Simplification: Dataset, Criteria and Baseline". Association for Computational Linguistics. pp. 7997–8013. doi:10.18653/v1/2021.emnlp-main.630.
- ↑ Joseph, Sebastian; Kazanas, Kathryn; Reina, Keziah; Ramanathan, Vishnesh; Xu, Wei; Wallace, Byron; Li, Junyi (2023). "Multilingual Simplification of Medical Texts". Association for Computational Linguistics. pp. 16662–16692. doi:10.18653/v1/2023.emnlp-main.1037.
- ↑ August, Tal; Wang, Lucy Lu; Bragg, Jonathan; Hearst, Marti A.; Head, Andrew; Lo, Kyle (2023-10-31). "Paper Plain : Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing". ACM Transactions on Computer-Human Interaction 30 (5): 1–38. ISSN 1073-0516. doi:10.1145/3589955.
- ↑ Gehrmann, Sebastian, Elizabeth Clark, and Thibault Sellam. “Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.” Journal of Artificial Intelligence Research 77 (May 29, 2023): 103–66. https://doi.org/10.1613/jair.1.13715.
- ↑ Weidinger, Laura; Raji, Inioluwa Deborah; Wallach, Hanna; Mitchell, Margaret; Wang, Angelina; Salaudeen, Olawale; Bommasani, Rishi; Ganguli, Deep; Koyejo, Sanmi (2025-03-13), Toward an Evaluation Science for Generative AI Systems, arXiv, doi:10.48550/arXiv.2503.05336, retrieved 2025-07-10
- ↑ Alva-Manchego, Fernando, Carolina Scarton, and Lucia Specia. “The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification.” Comput. Linguist. Assoc. Comput. Linguist. 47, no. 4 (December 23, 2021): 861–89. https://doi.org/10.1162/coli_a_00418.
- ↑ Alva-Manchego, Fernando, Carolina Scarton, and Lucia Specia. “Data-Driven Sentence Simplification: Survey and Benchmark.” Comput. Linguist. Assoc. Comput. Linguist. 46, no. 1 (March 2020): 135–87. https://doi.org/10.1162/coli_a_00370.
- ↑ Laban, Philippe, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. “SummaC: Re-Visiting NLI-Based Models for Inconsistency Detection in Summarization.” Edited by Brian Roark and Ani Nenkova. Transactions of the Association for Computational Linguistics 10 (2022): 163–77. https://doi.org/10.1162/tacl_a_00453.
- ↑ Marchisio, Kelly; Ko, Wei-Yin; Berard, Alexandre; Dehaze, Théo; Ruder, Sebastian (2024-11). Al-Onaizan, Yaser; Bansal, Mohit; Chen, Yun-Nung, eds. "Understanding and Mitigating Language Confusion in LLMs". Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Miami, Florida, USA: Association for Computational Linguistics): 6653–6677. doi:10.18653/v1/2024.emnlp-main.380. Check date values in:
|date=
(help) - ↑ Marchisio, Kelly; Dash, Saurabh; Chen, Hongyu; Aumiller, Dennis; Üstün, Ahmet; Hooker, Sara; Ruder, Sebastian (2024-11). Al-Onaizan, Yaser; Bansal, Mohit; Chen, Yun-Nung, eds. "How Does Quantization Affect Multilingual LLMs?". Findings of the Association for Computational Linguistics: EMNLP 2024 (Miami, Florida, USA: Association for Computational Linguistics): 15928–15947. doi:10.18653/v1/2024.findings-emnlp.935. Check date values in:
|date=
(help) - ↑ Burchell, Laurie; Birch, Alexandra; Bogoychev, Nikolay; Heafield, Kenneth (2023-07). Rogers, Anna; Boyd-Graber, Jordan; Okazaki, Naoaki, eds. "An Open Dataset and Model for Language Identification". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Toronto, Canada: Association for Computational Linguistics): 865–879. doi:10.18653/v1/2023.acl-short.75. Check date values in:
|date=
(help)