Research:Understanding perception of readability in Wikipedia

Tracked in Phabricator:
Task T325815
13:30, 22 December 2022 (UTC)
Indira Sen
Mareike Wieland
Katrin Weller
Duration:  2023-January – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In this project, we aim to better understand readers’ perception of readability of articles in Wikipedia. For this, we plan to conduct surveys asking participants to rate the readability of articles. This will help us evaluate the validity of a recently developed language-agnostic model to generate automated scores for readability.

Background edit

As part of our research program to Address Knowledge Gaps, we have been working to develop a multilingual model to assess the readability of Wikipedia articles (Research:Multilingual Readability Research). We have successfully tested a language-agnostic model which yields an automated readability score across languages. The model was evaluated on a set of articles annotated with different levels of readability, i.e. whether it can distinguish between an article from Wikipedia (e.g. English Wikipedia) and its “simpler” counterpart (the corresponding article from Simple English Wikipedia or Vikidia).

However, this approach has (at least) two main limitations. First, ground-truth data of articles with annotations of their readability levels is not easily available for most languages in Wikipedia. Second, automated scores of readability do not necessarily capture the way readers perceive readability of articles[1].

Therefore, we want to better understand how readers perceive readability of Wikipedia articles. We aim to conduct surveys to obtain quantitative measures of perceived readability. This will provide a crucial step in the evaluation of whether our automated readability scores match perceived readability of readers. In addition, our methodology provides a framework to evaluate readability in languages for which we currently do not have ground-truth datasets of articles with annotated labels of their readability level. Overall, this study will hopefully improve our confidence in the validity of automated multilingual measures of readability.

Methods edit

Timeline edit

Planned timeline

  • Background research (1 month): literature review on how to measure perception of readability via surveys, review on technical infrastructure to recruit participants and host surveys
  • Pilot survey (2 months): running a pilot for testing our setup, evaluation for improvements.
  • Full survey (3 months): implementing improvements; running the survey in different languages; data analysis of results

Policy, Ethics and Human Subjects Research edit

We describe how we collect, use, share, store, and delete the information we receive from participants in the survey privacy statement.

Some additional details: We recruit participants for the survey via Prolific where participants are paid a standard hourly rate according to the platform. Participants take part in the survey with consent acknowledging the survey privacy statement. The survey does not contain any deceptive research design. The survey does not collect any "special category data" (Art. 9 GDPR).

Results edit

Pilot survey edit

In order to test our setup, we first conduct a small pilot survey in a single language (English).

Summary of the setup:

  • We assess the readability of snippets from 21 articles in English Wikipedia. A snippet contains only the plain text of the first 5 sentences of each article.
  • We ask participants to rate pairs of two snippets (i.e. which of the two snippets is easier to read than the other). From a large set of ratings of pairs among all snippets, we infer an absolute ranking of readability scores of all snippets using the Bradley-Terry model.
  • The survey is hosted on WMF's instance of Limesurvey. Participants to the survey are recruited via Prolific.

Summary of the result:

  • We find that this setup is not a reliable way to assess perception of readability of Wikipedia articles.
  • Before expanding the pilot, we should revise the conceptual approach to get ratings from readers on the readability of articles.

More details on the subpage: Research:Understanding perception of readability in Wikipedia/Pilot survey

Pilot survey: version 2 edit

Improving the first version of the pilot survey. work in progress.

Resources edit

References edit

  1. Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889.