Research:Develop a model for text simplification to improve readability of Wikipedia articles
In this project, we start exploratory work to develop a model for automatic text simplification of Wikipedia articles using large language models.
The model aims to improve the readability of articles and is a follow-up to our work on measuring readability of articles as part of the Research Team’s program to address knowledge gaps.
The problem of text simplification can be considered a special case of text summarization such that the learnings of this project have broad implications for a wide range of potential use-cases.
As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has developed a taxonomy of knowledge gaps. As one of the gaps, we identified the readability of content on Wikimedia projects. Roughly, readability aims to capture how easy it is to read and understand a written text. In the past year, we have successfully developed a multilingual model to measure the readability of articles in Wikipedia across languages: Research:Multilingual_Readability_Research.
As a next step, we would like to go beyond measuring the readability of articles. In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.
More generally, the problem of text simplification is very similar to the problem of text summarization. The latter could serve many use-cases that have been voiced in the context of Wikimedia (foundation, movement, etc); for example (this is not exhaustive): summarizing sections of articles, summarizing discussions of talk pages, or even summarizing phabricator tickets, etc. (see, e.g., this project at the 2023 Wikimedia Hackathon), etc. Thus, learnings from the project described here will likely be useful when approaching one or more of the many use-cases around text summarization.
Tentative timeline edit
- Reviewing literature and existing models, refine scope of task (3 months)
- Exploratory implementation (3 months)
- Evaluation of model(s) (3 months)
Policy, Ethics and Human Subjects Research edit
The work described here is an exploratory project about the feasibility of such models. The current work is not aiming to deploy these models on any Wikimedia project. We acknowledge that there are many caveats for using LLMs in practice which would need to be addressed. As an example, we co-hosted a discussion session with developers at the 2023 Wikimedia Hackathon about the opportunities and threats of large language models (T333127).
Reviewing literature edit
An attempt for a concise overview on automatic text simplification can be found here: Research:Develop a model for text simplification to improve readability of Wikipedia articles/Background literature review
Some of the most noteworthy points were:
- Only few works very recently start to approach document-level simplification; existing works only considered English models / data
- Good performance on document-level simplification from fine-tuned BART models
- From related tasks, the recently published mLongT5 model family seems promising way to develop a domain-specific model via fine-tuning because i) the mT5 model has been successfully used for multilingual sentence-level simplification tasks; ii) the LongT5 has been successfully used for summarization tasks
- Prompt-based general LLMs such as ChatGPT dont seem to be better than the domain-specific models in automatic evaluation (though there has not been that much research been done yet)
- Our data in 10+ languages from children/simple encyclopedias provides a new multilingual dataset to approach document-level simplification beyond English
- Sun, R., Jin, H., & Wan, X. (2021). Document-Level Text Simplification: Dataset, Criteria and Baseline. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7997–8013. https://doi.org/10.18653/v1/2021.emnlp-main.630