Research:Multilingual Readability Research
As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. The next step consists of identifying metrics to quantify the size of these gaps. For some of the gaps in the content dimension we have readily available metrics (especially around representation gaps such as gender, geography, etc). However, for some of the metrics we are still lacking metrics. This project focuses on identifying possible metrics for the readability-gap in Wikimedia projects (with focus on supporting multiple languages).
Roughly, readability aims to capture how hard it is for a reader to understand a written text. While there are off-the-shelve readability-scores for English (and other languages), it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.
The aim of this project is to assess whether it is feasible to automatically evaluate readability of Wikipedia articles across the many languages covered in Wikimedia projects.
- Conduct background research on existing approaches to measuring readability (see Research:Multilingual_Readability_Research/Background_Research)
- Identify candidate approaches that support multiple languages. Conduct exploratory research with corresponding models and datasets
- Specifying the task, implementing the models, and evaluating performance
- Make a decision whether an automatic evaluation of readability across projects is feasible.
Identifying a set of candidate approachesEdit
In Research:Multilingual_Readability_Research/Background_Research I tried to get an overview of different approaches to measuring readability with a focus on multilingual approaches. The following two approaches emerge as the most promising candidates:
- Language-dependent approach using pre-trained multilingual language models. Several studies have shown that standard language models such as BERT can be used to derive features from sentences or texts that can capture readability similar or better than hand-crafted linguistic features. These models support not all but on the order of 100 different languages.
- Language-agnostic approach using an entity-linker. This approach yields a language-agnostic representation of text as a sequence of entities (instead of words/syllables/etc). From this we can derive shallow features (e.g. average number of entities per sentence) without language-specific parsing; this has been shown to capture some aspects of readability. This relies on the availability of the entity-linker, one promising open candidate is dbpedia-spotlight which is open, exists for several languages, and can be expanded to new languages.
We can apply these two models on different datasets for evaluation as a classification task:
- Simple vs English Wikipeda corpus. Texts with 2 reading levels (simple, normal). While this only covers English texts the main advantage is that it provides on the order of 65k articles
- Vikidia vs Wikipedia corpus. Texts with 2 reading levels (simple, normal). While it is a much smaller corpus, it contains texts in several languages.