Research:Develop a model for text simplification to improve readability of Wikipedia articles/FY24-25 WE.3.1.3 content simplification

This page captures the work on hypothesis WE.3.1.3 as part of Product & Tech’s Annual Plan for Fiscal year 24-25:

If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.

Current status

2024-07-04: set up page

Up Next: Identify requirements and review shortlist of candidate models

Background

One of the objective's in the Annual Plan is around the Reader experience (WE3): A new generation of consumers arrives at Wikipedia to discover a preferred destination for discovering, engaging, and building a lasting connection with encyclopedic content. The goals are to:

Retain existing and new generations of consumers and donors.
Increase relevance to existing and new generations of consumers by making our content more easy to discover and interact with.
Work across platforms to adapt our experiences and existing content, so that encyclopedic content can be explored and curated by and to a new generation of consumers and donors.

As part of the Key Result WE.3.1 towards this goal, we want to explore opportunities for readers to more easily discover and learn from content they are interested in. In this project, we focus on models for simplifying the existing content on Wikipedia.

The Readability Gap:

We have shown in previous work that content in Wikipedia is generally very difficult to read. ^[1] This means that much of the existing content might not be very accessible to the larger population in terms of readability (the ease with which a reader can understand a written text) considering average reading reading ability (even among adults).
There are some Wikipedias with articles using a decidedly simpler language such as Simple English Wikipedia or children’s encyclopdias (Vikidia, Txikipedia, Klexikon, Wikikids). However, they exist only in few languages (compared to the more than 300 languages in Wikipedia) and cover only a much smaller number of articles (for example, as of July 2024 Simple English Wikipedia contains around 250K articles vs 6.8M in English Wikipedia)

Automatic Simplification:

In order to improve the readability of an article, we would thus like to develop a tool that can automatically generate a simplified version of the text that could be surfaced to the reader. With the recent improvements in performance and availability of Large Language Models (LLMs), it seems feasible to develop a model for this task.
In previous exploratory work, we showed that its possible to automatically generate simplified versions of text with some success even across languages beyond English

Goals

Identify requirements (infrastructure, performance, quality, languages, context, etc)
Review candidate models compatible with requirements
Implement one or more candidate models

Resources

WikiReaD The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

References

↑ Trokhymovych, Mykola, Indira Sen, and Martin Gerlach. "An Open Multilingual System for Scoring Readability of Wikipedia." arXiv preprint arXiv:2406.01835 (2024). https://arxiv.org/abs/2406.01835

[1] Trokhymovych, Mykola, Indira Sen, and Martin Gerlach. "An Open Multilingual System for Scoring Readability of Wikipedia." arXiv preprint arXiv:2406.01835 (2024). https://arxiv.org/abs/2406.01835

[1]