Research:Understanding perception of readability in Wikipedia

Tracked in Phabricator:
Task T325815
13:30, 22 December 2022 (UTC)
Indira Sen
Mareike Wieland
Katrin Weller
Duration:  2023-January – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In this project, we aim to better understand readers’ perception of readability of articles in Wikipedia. For this, we plan to conduct surveys asking participants to rate the readability of articles. This will help us evaluate the validity of a recently developed language-agnostic model to generate automated scores for readability.



As part of our research program to Address Knowledge Gaps, we have been working to develop a multilingual model to assess the readability of Wikipedia articles (Research:Multilingual Readability Research). We have successfully tested a language-agnostic model which yields an automated readability score across languages. The model was evaluated on a set of articles annotated with different levels of readability, i.e. whether it can distinguish between an article from Wikipedia (e.g. English Wikipedia) and its “simpler” counterpart (the corresponding article from Simple English Wikipedia or Vikidia).

However, this approach has (at least) two main limitations. First, ground-truth data of articles with annotations of their readability levels is not easily available for most languages in Wikipedia. Second, automated scores of readability do not necessarily capture the way readers perceive readability of articles[1].

Therefore, we want to better understand how readers perceive readability of Wikipedia articles. We aim to conduct surveys to obtain quantitative measures of perceived readability. This will provide a crucial step in the evaluation of whether our automated readability scores match perceived readability of readers. In addition, our methodology provides a framework to evaluate readability in languages for which we currently do not have ground-truth datasets of articles with annotated labels of their readability level. Overall, this study will hopefully improve our confidence in the validity of automated multilingual measures of readability.





Planned timeline

  • Background research (1 month): literature review on how to measure perception of readability via surveys, review on technical infrastructure to recruit participants and host surveys
  • Pilot survey (2 months): running a pilot for testing our setup, evaluation for improvements.
  • Full survey (3 months): implementing improvements; running the survey in different languages; data analysis of results

Policy, Ethics and Human Subjects Research


We describe how we collect, use, share, store, and delete the information we receive from participants in the survey privacy statement.

Some additional details: We recruit participants for the survey via Prolific where participants are paid a standard hourly rate according to the platform. Participants take part in the survey with consent acknowledging the survey privacy statement. The survey does not contain any deceptive research design. The survey does not collect any "special category data" (Art. 9 GDPR).



Pilot survey


In order to test our setup, we first conduct a small pilot survey in a single language (English).

Summary of the setup:

  • We assess the readability of snippets from 21 articles in English Wikipedia. A snippet contains only the plain text of the first 5 sentences of each article.
  • We ask participants to rate pairs of two snippets (i.e. which of the two snippets is easier to read than the other). From a large set of ratings of pairs among all snippets, we infer an absolute ranking of readability scores of all snippets using the Bradley-Terry model.
  • The survey is hosted on WMF's instance of Limesurvey. Participants to the survey are recruited via Prolific.

Summary of the result:

  • We find that this setup is not a reliable way to assess perception of readability of Wikipedia articles.
  • Before expanding the pilot, we should revise the conceptual approach to get ratings from readers on the readability of articles.

More details on the subpage: Research:Understanding perception of readability in Wikipedia/Pilot survey

Pilot survey: version 2


After assessing potential confounds in the previous pilot, we change the following aspects of the survey:

  • We showed pairs of snippets that came from the same article but in different readability levels, i.e., pairs of snippets where one article was from Simple Wikipedia (easy) and the other from English Wikipedia (difficult). We assume that one of the drawbacks of the previous pilot was that the articles were unpaired. This led to subjective judgments of readability based on differences in topic. The order of the pair in each judgment was randomized to prevent ordering effects.
  • To explicitly surface ambiguous cases, we also added a third option for the rating “both are equally easy to read”.
  • We added an optional free-text question at the end of the survey for the participants to describe what type of strategies they rely on when judging the ease of reading.

In this paired article setup, we can no longer use the Bradley-Terry model to infer absolute readability scores from the ratings of snippets. However, it is still possible to obtain a comparative assessment of readability by investigating how many times the participants pick the Simple Wikipedia article. We can also calculate agreement among participants as well as assess if their collective rating correlates with automated readability scores, such as Flesch Reading Ease (FRE).

We selected 40 pairs of snippets taken from the same article in two different versions (Simple Wikipedia and English Wikipedia). We recruited 15 participants and show each of them 10 randomly selected pairs. We aim to obtain at least three ratings per pair.

Summary of results:

  • Interrater-agreement is only slightly higher than the first pilot, but still indicates weak to little agreement.
  • We find no correlation of agreement with difference in FRE scores of the pair of articles
  • Qualitatively analysing the free-text question about strategies for assessing readability, we find that raters have variable preferences (some prefer shorter sentences while others do not).

In conclusion, our preliminary results from Pilot 2 indicate that even while controlling for the topic of the articles, participants do not come to a consensus on which version seems easier to read. What is ‘easy to read’ seems quite subjective and may depend on other factors that we have yet to control for, such as the characteristics of the raters (education level), of the content (short vs. long sentences, complex words, punctuation), and their combination (raters' familiarity with the topic of the article).

Cognitive Pre-testing


In order to better understand the low interrater agreement of participants and rule out factors stemming from the participant recruitment platform (Prolific), we conducted coginitive pretesting via interviews with three volunteers. The interviews followed a think-aloud study to understand how participants approach the task of rating readability of articles.

Summary of results:

  • We find that the interviewees, like the Prolific participants, also struggle to concretely choose the seemingly simpler snippet, i.e., the one from Simple Wikipedia as the easier one.
  • The interviewees indicate difficulty in judging pairs where the snippets do not fully align in terms of content, even when they are analogous in topic.
  • The task is detached from a concrete application, which leads to arbitrariness.

Therefore, we rule out issues due to recruiting survey participants from Prolific, e.g., that they are not attentive or fluent enough. Instead, the interviews indicate that the construct of readability, and the instructions to judge it, need to be better defined. Additionally, it points to additional confounders in our easy-hard paired setup, i.e., content mismatches that preclude making a direct comparison.

Pilot survey: version 3






  1. Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889.