Research:Understanding perception of readability in Wikipedia
In this project, we aim to better understand readers’ perception of readability of articles in Wikipedia. For this, we plan to conduct surveys asking participants to rate the readability of articles. This will help us evaluate the validity of a recently developed language-agnostic model to generate automated scores for readability.
As part of our research program to Address Knowledge Gaps, we have been working to develop a multilingual model to assess the readability of Wikipedia articles (Research:Multilingual Readability Research). We have successfully tested a language-agnostic model which yields an automated readability score across languages. The model was evaluated on a set of articles annotated with different levels of readability, i.e. whether it can distinguish between an article from Wikipedia (e.g. English Wikipedia) and its “simpler” counterpart (the corresponding article from Simple English Wikipedia or Vikidia).
However, this approach has (at least) two main limitations. First, ground-truth data of articles with annotations of their readability levels is not easily available for most languages in Wikipedia. Second, automated scores of readability do not necessarily capture the way readers perceive readability of articles.
Therefore, we want to better understand how readers perceive readability of Wikipedia articles. We aim to conduct surveys to obtain quantitative measures of perceived readability. This will provide a crucial step in the evaluation of whether our automated readability scores match perceived readability of readers. In addition, our methodology provides a framework to evaluate readability in languages for which we currently do not have ground-truth datasets of articles with annotated labels of their readability level. Overall, this study will hopefully improve our confidence in the validity of automated multilingual measures of readability.
- Background research (1 month): literature review on how to measure perception of readability via surveys, review on technical infrastructure to recruit participants and host surveys
- Pilot survey (2 months): running a pilot for testing our setup, evaluation for improvements.
- Full survey (3 months): implementing improvements; running the survey in different languages; data analysis of results
Policy, Ethics and Human Subjects Research edit
We describe how we collect, use, share, store, and delete the information we receive from participants in the survey privacy statement.
Some additional details: We recruit participants for the survey via Prolific where participants are paid a standard hourly rate according to the platform. Participants take part in the survey with consent acknowledging the survey privacy statement. The survey does not contain any deceptive research design. The survey does not collect any "special category data" (Art. 9 GDPR).
Pilot survey edit
In order to test our setup, we first conduct a small pilot survey in a single language (English).
Summary of the setup:
- We assess the readability of snippets from 21 articles in English Wikipedia. A snippet contains only the plain text of the first 5 sentences of each article.
- We ask participants to rate pairs of two snippets (i.e. which of the two snippets is easier to read than the other). From a large set of ratings of pairs among all snippets, we infer an absolute ranking of readability scores of all snippets using the Bradley-Terry model.
- The survey is hosted on WMF's instance of Limesurvey. Participants to the survey are recruited via Prolific.
Summary of the result:
- We find that this setup is not a reliable way to assess perception of readability of Wikipedia articles.
- Before expanding the pilot, we should revise the conceptual approach to get ratings from readers on the readability of articles.
More details on the subpage: Research:Understanding perception of readability in Wikipedia/Pilot survey
Pilot survey: version 2 edit
Improving the first version of the pilot survey. work in progress.
- Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889. https://doi.org/10.1162/coli_a_00418