Machine learning models/Production/Multilingual readability model card

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Mykola Trokhymovych and MGerlach_(WMF)
Model owner(s)	MGerlach_(WMF)
Code	training and inference
Uses PII	No
In production?	No
	This model uses article text to predict how hard it is for a reader to understand it.
	v; t; e;

This model generates scores to assess the readability of Wikipedia articles. The readability scores is a rough proxy to capture how difficult it is for a reader to understand the text of the article.

Specifically, we propose a multilingual model using pre-trained xlm-roberta-longformer^[1]. It supports not all but about 100 languages.

We fine-tune the model using annotated data of articles available in different readability levels. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We evaluate the model's performance on small annotated datasets available in a few languages using different children's encyclopedias (such as Vikidia).

Motivation

As part of the program to address knowledge gaps, the Research team at the Wikimedia Foundation has started to develop a taxonomy of knowledge gaps. One of the goals is to identifying metrics to quantify the size of these gaps. This model attempts to provide a metric to measure readability of articles in Wikimedia projects; specifically focusing to provide multilingual support.

While there are readily available formulas to calculate readability of articles (such as the Flesch-Kincaid score), these formulas are often developed for a specific language (most commonly English). Usually, these formulas cannot be applied out of the box to other languages. As a result, it is not clear how these approaches can be used to assess readability across the more than 300 language versions of Wikipedia.

You can find more details about the project here: Research:Multilingual Readability Research

Users and uses

Use this model for

Define the readability score of the Wikipedia article revision
Define the Flesch–Kincaid score of the article in multilingual setup
Compare the readability of different revisions of the same article

Don't use this model for

Making predictions on language editions of Wikipedia that are not in the listed languages or other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
Making predictions on namespaces outside of 0, disambiguation pages, and redirects

Current uses

This model is publicly available via LiftWing as of October 2023 but not currently incorporated in any products.

API endpoint on LiftWing: Get readability prediction
Functional user interface on toolforge: Wiki-readability

Current supported languages

['af', 'sq', 'am', 'ar', 'hy', 'as', 'az', 'eu', 'be', 'bn', 'bs', 'br', 'bg', 'my', 'ca', 'zh-yue', 'zh', 'zh-classical', 'hr', 'cs', 'da', 'nl', 'en', 'eo', 'et', 'tl', 'fi', 'fr', 'gl', 'ka', 'de', 'el', 'gu', 'ha', 'he', 'hi', 'hu', 'is', 'id', 'ga', 'it', 'ja', 'jv', 'kn', 'kk', 'km', 'ko', 'ku', 'ky', 'lo', 'la', 'lv', 'lt', 'mk', 'mg', 'ms', 'ml', 'mr', 'mn', 'ne', 'no', 'or', 'om', 'ps', 'fa', 'pl', 'pt', 'pa', 'ro', 'ru', 'sa', 'gd', 'sr', 'sd', 'si', 'sk', 'sl', 'so', 'es', 'su', 'sw', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'ur', 'ug', 'uz', 'vi', 'cy', 'fy', 'xh', 'yi', 'simple']

Ethical considerations, caveats, and recommendations

The model only uses publicly available data of the content (i.e. plain text) extracted from the articles.

Nevertheless, there are certain caveats:

Multitlingual support: The model has only been trained on English data annotated with different readability levels. Our evaluation shows that the resulting model also works for other languages. However, performance varies across languages (see below). While this is a known issue for multilingual transformer models more generally^[2], in the context of readability we are unable to systematically evaluate the model for many supported languages due to the lack of ground-truth data. In order to address this issues, we have started a research project to manually evaluate the model based on readers' perception of readability through surveys (ongoing).

Model

The presented system is based on fine-tuned language model xlm-roberta-longformer^[3] trained with ranking loss along with Linear Regression model ^[4] that transform model ranking score to the Flesch–Kincaid scoring scale. It is built on the paradigm of having one generalized model for all covered languages. The system includes the following steps:

1. Text features preparation:

Process wikitext and extract the revision text

2. Masked Language Models (MLM) outputs extraction:

Pass the text to the pre-trained ranking model (to extract ranking score)

3. Transform the ranking score to Flesch–Kincaid scale

Apply linear transformation to the ranking score. This score corresponds to a predicted Flesch-Kincaid grade level, i.e. a U.S. grade level capturing roughly "the number of years of education generally required to understand this text", that can be applied to other languages. The motivation is to provide a more interpretable score as an alternative to the ranking score obtained from the model.

Sketch of the model architecture consisting of two joint readability scoring models trained using a Margin Ranking Loss.

Performance

We evaluate the model using the Ranking Accuracy metric, which shows how well the model can differentiate the easy and hard text versions.

The testing data consist of pairs of texts that correspond to the simple (easy) and difficult (hard) versions of one article (for example, the same article from English Wikipedia and Simple English Wikipedia). Even though we train the model only on English texts, we evaluate performance in other languages.

We evaluate model performance using Ranking Accuracy (RA), which is equal to the rate of correctly ranked pairs. Also, we provide the confidence intervals (CI).

Model performance metric and confidence interval
Dataset	RA	±CI
simplewiki-en	0.976	0.002
vikidia-en	0.991	0.004
vikidia-ca	0.962	0.025
vikidia-de	0.938	0.03
vikidia-el	0.923	0.086
vikidia-es	0.911	0.013
vikidia-eu	0.818	0.032
vikidia-fr	0.923	0.005
vikidia-hy	0.802	0.036
vikidia-it	0.958	0.01
vikidia-oc	1.0	0.0
vikidia-pt	0.960	0.014
vikidia-ru	0.880	0.058
vikidia-scn	0.9	0.191
klexikon-de	0.999	0.002
txikipedia-eu	0.81	0.023
wikikids-nl	0.897	0.006

Implementation

Model architecture

xlm-roberta-longformer (Readability scoring) model tunning:

Learning rate: 1e-5
Weight Decay: 1e-7
Epochs: 3
Loss: Margin Ranking Loss
Margin: 0.5
Maximum input length: 1500 tokens

Output schema

{
  lang: <language code string>,
  rev_id: <revision_id string>,
  score: {
     score: <Readability ranking score>
     fk_score_proxy: <Flesch–Kincaid score approximation>
  }
}

Example input and output

See LiftWing API for more details.

Example input:

curl https://api.wikimedia.org/service/lw/inference/v1/models/readability:predict -X POST -d '{"rev_id": 123456, "lang": "en"}' -H "Content-type: application/json"

Example output:

{
"model_name":"readability",
"model_version":"2",
"wiki_db":"enwiki",
"revision_id":1161100049,
"output":{
    "score":1.1111111111,
    "fk_score_proxy":11.1111
    }
}

Data

Training data consist of pairs of texts that correspond to the articles in English Wikipedia and Simple English Wikipedia. We treat one of the texts in a pair as simple (easy) and another as difficult (hard). We split data into two parts: train (80%) and validation (20%).

Apart from the holdout dataset, we evaluate model performance in other languages. In particular, we use Vikidia pairs for it, oc, el, de, ru, es, en, ca, hy, scn, pt, fr, eu, Klexikon for de, wikikids for nl, Txikipedia for eu. This data is used only for model testing.

Data pipeline

Training data

Number of samples (pairs): 112342
Languages: en

Testing data

Number of samples (pairs): 58309
Languages: en, de, ca, el, es, eu, fr, hy, it, oc, pt, ru, scn, nl

Licenses

Code: GNU General Public License v2.0
Model: Apache 2.0 License

Citation

Preprint:

@misc{trokhymovych2024openmultilingualscoringreadability,
      title={An Open Multilingual System for Scoring Readability of Wikipedia}, 
      author={Mykola Trokhymovych and Indira Sen and Martin Gerlach},
      year={2024},
      eprint={2406.01835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.01835}, 
}

↑ https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
↑ Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16
↑ https://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096
↑ https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

[1] ttps://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096

[2] Wu, S., & Dredze, M. (2020). Are All Languages Created Equal in Multilingual BERT? Proceedings of the 5th Workshop on Representation Learning for NLP, 120–130. https://doi.org/10.18653/v1/2020.repl4nlp-1.16

[3] ttps://huggingface.co/Peltarion/xlm-roberta-longformer-base-4096

[4] ttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

[1]

[2]

[3]

[4]