Research:Develop a model for text simplification to improve readability of Wikipedia articles

Tracked in Phabricator:
Task T342614

Contact

Wikimedia Foundation

Duration: 2023-07 – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In this project, we start exploratory work to develop a model for automatic text simplification of Wikipedia articles using large language models.

The model aims to improve the readability of articles and is a follow-up to our work on measuring readability of articles as part of the Research Team’s program to address knowledge gaps.

The problem of text simplification can be considered a special case of text summarization such that the learnings of this project have broad implications for a wide range of potential use-cases.

Motivation

As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has developed a taxonomy of knowledge gaps. As one of the gaps, we identified the readability of content on Wikimedia projects. Roughly, readability aims to capture how easy it is to read and understand a written text. In the past year, we have successfully developed a multilingual model to measure the readability of articles in Wikipedia across languages: Research:Multilingual_Readability_Research.

As a next step, we would like to go beyond measuring the readability of articles. In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.

More generally, the problem of text simplification is very similar to the problem of text summarization^[1]. The latter could serve many use-cases that have been voiced in the context of Wikimedia (foundation, movement, etc); for example (this is not exhaustive): summarizing sections of articles, summarizing discussions of talk pages, or even summarizing phabricator tickets, etc. (see, e.g., this project at the 2023 Wikimedia Hackathon), etc. Thus, learnings from the project described here will likely be useful when approaching one or more of the many use-cases around text summarization.

Results

Reviewing literature

An attempt for a concise overview on automatic text simplification can be found here: Research:Develop a model for text simplification to improve readability of Wikipedia articles/Background literature review

Some of the most noteworthy points were:

Only few works very recently start to approach document-level simplification; existing works only considered English models / data
Good performance on document-level simplification from fine-tuned BART models
From related tasks, the recently published mLongT5 model family seems promising way to develop a domain-specific model via fine-tuning because i) the mT5 model has been successfully used for multilingual sentence-level simplification tasks; ii) the LongT5 has been successfully used for summarization tasks
Prompt-based general LLMs such as ChatGPT dont seem to be better than the domain-specific models in automatic evaluation (though there has not been that much research been done yet)
Our data in 10+ languages from children/simple encyclopedias provides a new multilingual dataset to approach document-level simplification beyond English

First exploratory experiments

We performed an exploratory experiment to develop a model for automatic text simplification. We fine-tuned a "small" LLM (Flan-T5) on pairs of articles from English Wikipedia (original) and Simple English Wikipedia (simplified).

Summary:

We were able to train a model that yields comparable results to SOTA-results reported in the literature for English.
We realized that a crucial part for the model training is the preparation of a clean and high-quality dataset that is used for fine-tuning: not only the proper pre-processing of the content, but most importantly, the filtering of the examples that are provided as model input. This means making sure that the pair of articles (original/simplified) actually contains meaningful transformations that correspond to text simplification.
We were able to train a 3B parameter model using external GPUs which could then be run in inference mode in our internal infrastructure.
We tested the model in several languages beyond English with varying performance: For a few languages scores were similar to English (German, Italian, Catalan); for many languages scores were slightly lower (Spanish, Basque, French , Portuguese, Dutch); and for some languages performance was substantially worse (Greek, Armenian, and Russian)

Details: Research:Develop a model for text simplification to improve readability of Wikipedia articles/First round of experiments

FY24-24 Hypothesis WE.3.1.3: Content Simplification

After the successful exploratory analysis, we identified content simplification as one potential way to support the objective WE3 in the FY24-25 annual plan towards improving the reader experience: "A new generation of consumers arrives at Wikipedia to discover a preferred destination for discovering, engaging, and building a lasting connection with encyclopedic content."

Specifically, we aim to develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing). This will establish the technical direction for work focused on increasing reader retention through new content discovery features.

Details: Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification

Resources

Code repository: https://gitlab.wikimedia.org/repos/research/text-simplification Datasets: https://analytics.wikimedia.org/published/datasets/one-off/mgerlach/simplification

References

↑ Sun, R., Jin, H., & Wan, X. (2021). Document-Level Text Simplification: Dataset, Criteria and Baseline. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7997–8013. https://doi.org/10.18653/v1/2021.emnlp-main.630

[1] Sun, R., Jin, H., & Wan, X. (2021). Document-Level Text Simplification: Dataset, Criteria and Baseline. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7997–8013. https://doi.org/10.18653/v1/2021.emnlp-main.630

[1]