Research:Understanding perception of readability in Wikipedia

Tracked in Phabricator:
Task T325815

Created

13:30, 22 December 2022 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Indira Sen

Gesis

Mareike Wieland

Gesis

Katrin Weller

Gesis

Duration: 2023-January – ??

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In this project, we aim to better understand readers’ perception of readability of articles in Wikipedia. For this, we plan to conduct surveys asking participants to rate the readability of articles. This will help us evaluate the validity of a recently developed language-agnostic model to generate automated scores for readability.

Background

As part of our research program to Address Knowledge Gaps, we have been working to develop a multilingual model to assess the readability of Wikipedia articles (Research:Multilingual Readability Research). We have successfully tested a language-agnostic model which yields an automated readability score across languages. The model was evaluated on a set of articles annotated with different levels of readability, i.e. whether it can distinguish between an article from Wikipedia (e.g. English Wikipedia) and its “simpler” counterpart (the corresponding article from Simple English Wikipedia or Vikidia).

However, this approach has (at least) two main limitations. First, ground-truth data of articles with annotations of their readability levels is not easily available for most languages in Wikipedia. Second, automated scores of readability do not necessarily capture the way readers perceive readability of articles^[1].

Therefore, we want to better understand how readers perceive readability of Wikipedia articles. We aim to conduct surveys to obtain quantitative measures of perceived readability. This will provide a crucial step in the evaluation of whether our automated readability scores match perceived readability of readers. In addition, our methodology provides a framework to evaluate readability in languages for which we currently do not have ground-truth datasets of articles with annotated labels of their readability level. Overall, this study will hopefully improve our confidence in the validity of automated multilingual measures of readability.

Methods

Timeline

Planned timeline

Background research (1 month): literature review on how to measure perception of readability via surveys, review on technical infrastructure to recruit participants and host surveys
Pilot survey (2 months): running a pilot for testing our setup, evaluation for improvements.
Full survey (3 months): implementing improvements; running the survey in different languages; data analysis of results

Policy, Ethics and Human Subjects Research

We describe how we collect, use, share, store, and delete the information we receive from participants in the survey privacy statement.

Some additional details: We recruit participants for the survey via Prolific where participants are paid a standard hourly rate according to the platform. Participants take part in the survey with consent acknowledging the survey privacy statement. The survey does not contain any deceptive research design. The survey does not collect any "special category data" (Art. 9 GDPR).

Results

Pilot survey

In order to test our setup, we first conduct a small pilot survey in a single language (English).

Summary of the setup:

We assess the readability of snippets from 21 articles in English Wikipedia. A snippet contains only the plain text of the first 5 sentences of each article.
We ask participants to rate pairs of two snippets (i.e. which of the two snippets is easier to read than the other). From a large set of ratings of pairs among all snippets, we infer an absolute ranking of readability scores of all snippets using the Bradley-Terry model.
The survey is hosted on WMF's instance of Limesurvey. Participants to the survey are recruited via Prolific.

Summary of the result:

We find that this setup is not a reliable way to assess perception of readability of Wikipedia articles.
Before expanding the pilot, we should revise the conceptual approach to get ratings from readers on the readability of articles.

More details on the subpage: Research:Understanding perception of readability in Wikipedia/Pilot survey

Pilot survey: version 2

After assessing potential confounds in the previous pilot, we change the following aspects of the survey:

We showed pairs of snippets that came from the same article but in different readability levels, i.e., pairs of snippets where one article was from Simple Wikipedia (easy) and the other from English Wikipedia (difficult). We assume that one of the drawbacks of the previous pilot was that the articles were unpaired. This led to subjective judgments of readability based on differences in topic. The order of the pair in each judgment was randomized to prevent ordering effects.
To explicitly surface ambiguous cases, we also added a third option for the rating “both are equally easy to read”.
We added an optional free-text question at the end of the survey for the participants to describe what type of strategies they rely on when judging the ease of reading.

In this paired article setup, we can no longer use the Bradley-Terry model to infer absolute readability scores from the ratings of snippets. However, it is still possible to obtain a comparative assessment of readability by investigating how many times the participants pick the Simple Wikipedia article. We can also calculate agreement among participants as well as assess if their collective rating correlates with automated readability scores, such as Flesch Reading Ease (FRE).

We selected 40 pairs of snippets taken from the same article in two different versions (Simple Wikipedia and English Wikipedia). We recruited 15 participants and show each of them 10 randomly selected pairs. We aim to obtain at least three ratings per pair.

Summary of results:

Interrater-agreement is only slightly higher than the first pilot, but still indicates weak to little agreement.
We find no correlation of agreement with difference in FRE scores of the pair of articles
Qualitatively analysing the free-text question about strategies for assessing readability, we find that raters have variable preferences (some prefer shorter sentences while others do not).

In conclusion, our preliminary results from Pilot 2 indicate that even while controlling for the topic of the articles, participants do not come to a consensus on which version seems easier to read. What is ‘easy to read’ seems quite subjective and may depend on other factors that we have yet to control for, such as the characteristics of the raters (education level), of the content (short vs. long sentences, complex words, punctuation), and their combination (raters' familiarity with the topic of the article).

Cognitive Pre-testing

In order to better understand the low interrater agreement of participants and rule out factors stemming from the participant recruitment platform (Prolific), we conducted coginitive pretesting via interviews with three volunteers. The interviews followed a think-aloud study to understand how participants approach the task of rating readability of articles.

Summary of results:

We find that the interviewees, like the Prolific participants, also struggle to concretely choose the seemingly simpler snippet, i.e., the one from Simple Wikipedia as the easier one.
The interviewees indicate difficulty in judging pairs where the snippets do not fully align in terms of content, even when they are analogous in topic.
The task is detached from a concrete application, which leads to arbitrariness.

Therefore, we rule out issues due to recruiting survey participants from Prolific, e.g., that they are not attentive or fluent enough. Instead, the interviews indicate that the construct of readability, and the instructions to judge it, need to be better defined. Additionally, it points to additional confounders in our easy-hard paired setup, i.e., content mismatches that preclude making a direct comparison.

Pilot survey: version 3

Based on the findings from Pilots 1 and 2, we attempt to have better control of the snippets shown to participant, get more background details about the participants, and quantify uncertainty better. Therefore, we implement the following changes:

Group English-Simple wikipedia pairs into **treatment** and **control** groups, where the difference in quantitative readability (measured with Flesch Reading Ease) between snippets is low for control (> 5, < -5) and high for treatment (> 30). This allows us to better quantify the effect of readability difference. In a survey, participants are shown 5 control snippets and 5 treatment snippets (ordered randomly).

Run two versions of the same pilot, but with reversed ordering of snippet pairs in each rating assignment. This allows us to tease apart any ordering effects such as recency bias

Instead of the three options we had in pilot 2 ("snippet 1 is easier to read", "snippet 2 is easier to read", and "both are equally easy to read"), we now use a four-point scale and rephrase the options to be closer to the mission statement of Simple Wikipedia (https://simple.wikipedia.org/wiki/Simple_English_Wikipedia):

     - "Snippet 1 is clearly easier to understand"
     - "Snippet 1 is slightly easier to understand"  
     - "Snippet 2 is slightly easier to understand"   
     - "Snippet 2 is clearly easier to understand"

By distinguishing between 'slightly' and 'clearly' options, we allow the raters to better signal their certainty.

To better unpack the effect of education we saw in the previous pilot, we include more questions in the survey related to the rater's cognitive style and information processing, and their need for cognition. The survey questions were taken from the following sources ^[2]^[3].

After making these changes to the survey and running two versions of it (identical in all ways, except the ordering of snippets in a pair), we obtained 33 rater's responses on 10 snippets.

Resources

Research:Multilingual_Readability_Research

References

↑ Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889. https://doi.org/10.1162/coli_a_00418
↑ Schemer, Christian, Jörg Matthes, and Werner Wirth. "Toward improving the validity and reliability of media information processing measures in surveys." Communication Methods and Measures 2.3 (2008): 193-225. https://doi.org/10.1080/19312450802310474
↑ Beißert, Hanna, et al. "Eine deutschsprachige Kurzskala zur Messung des konstrukts Need for cognition: Die Need for cognition kurzskala (NfC-K)." (2014): 28. https://nbn-resolving.org/urn:nbn:de:0168-ssoar-403157

[1] Alva-Manchego, F., Scarton, C., & Specia, L. (2021). The (Un)suitability of automatic evaluation metrics for Text Simplification. Computational Linguistics (Association for Computational Linguistics), 47(4), 861–889. https://doi.org/10.1162/coli_a_00418

[2] Schemer, Christian, Jörg Matthes, and Werner Wirth. "Toward improving the validity and reliability of media information processing measures in surveys." Communication Methods and Measures 2.3 (2008): 193-225. https://doi.org/10.1080/19312450802310474

[3] Beißert, Hanna, et al. "Eine deutschsprachige Kurzskala zur Messung des konstrukts Need for cognition: Die Need for cognition kurzskala (NfC-K)." (2014): 28. https://nbn-resolving.org/urn:nbn:de:0168-ssoar-403157

[1]

[2]

[3]