Research:Multilingual Readability Research/Evaluation multilingual model

An improved multilingual model to measure readability of Wikipedia articles.

Background

The aim of this project is to build a model to predict the readability scores of Wikipedia articles in different languages. Specifically, we take advantage of multilingual large language models which support around 100 languages. One of the main challenges is that for most languages there is no ground-truth data available about the reading level of an article so that fine-tuning or re-training in each language is not a scalable option. Therefore, we train the model only in English on a large corpus of Wikipedia articles with two readability levels (Simple English Wikipedia and English Wikipedia). We then evaluate how well the model generalizes to other corpora (in English) as well as other languages using a smaller multilingual corpus of encyclopedic articles with two readability levels consisting of articles in Wikipedia and their respective article from different children’s encyclopedias.

Method

Data

We use the multilingual dataset of encyclopedic articles available in two different readability levels (see Research:Multilingual Readability Research#Generating a multilingual dataset):

difficult: Wikipedia article from a specific language
simple: matched article in the same language from children's encyclopedia


Language	Dataset	# pairs
en	Simplewiki	109,152
en	Vikidia	1,994
ca	Vikidia	244
de	Klexikon	2,259
de	Vikidia	273
el	Vikidia	41
es	Vikidia	2,450
eu	Txikipedia	2,649
eu	Vikidia	1,059
fr	Vikidia	12,675
hy	Vikidia	550
it	Vikidia	1,704
nl	Wikikids	11,319
oc	Vikidia	7
pt	Vikidia	598
ru	Vikidia	104
scn	Vikidia	11

Model

We represent out task as a binary classification problem: The simple version of the text is labeled as 0 (negative), difficult text as a 1 (positive). The model is trained to classify whether an article contains the simple or difficult label. The output probability can be interpreted as a readability score.

Distributions of length of texts for easy and hard texts (with a limit of 3000 chars)

For our task we are using bert-base-multilingual-cased model. It is a pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. We tune this model for binary classification task, using the training part of dataset observed previously. It is important to mention that tuning of multilingual model is performed using only English texts.

We fine-tune the Masked Language Model (MLM) for binary classification task. Instead of taking the whole text as an input, we decided to work on sentence level. We treat each sentence as an independent sample with the label of corresponding text. During inference, we are getting predictions for each independent sentence and use mean pooling to get the prediction for the whole text. There are few reasons for such decision: (i) the full text can be long, and as a result not fit into memory while tuning; (ii) the length of text represent unwanted leakage to the model, as hard and easy texts have different length distributions.

Even though we train data for sentence level prediction, we do the evaluation on text level as it represents the model usage scenario. We use a simple accuracy metric for our solution evaluation. It is well interpretable, and we can use it as we have perfectly balanced data, where each hard sample has the corresponding easy text.

Training and test split

Example pair for the same article with different readability levels (easy, hard)

Training data consist of pairs of texts that correspond to the articles in Simple English Wikipedia and English Wikipedia. We treat one of the texts in a pair as a simple (easy) and another as a difficult (hard). Each text represented as a list of sentences.

We use such a data for our experiment. We split data in three parts: train (80%), validation (10%), test (10%). Important detail is that we include different versions of the article only to one data part (train, test or validation).

Apart from holdout dataset we evaluate model performance on other languages (see above). However, we train only on English text, as there are no enough samples to use all mentioned languages for training.

Train and Test split for readability model keeping pairs of sentences either in training or in test set

Evaluation metric

Having balanced dataset (the same quantity for hard and easy samples, as they represent pairs) we are using accuracy as the main metric for the evaluation. Also, we are using AUC as an additional metric for evaluation.

Results

Model evaluation

We evaluate the model using accuracy and AUC for each dataset in the different languages (model was only trained using the 80% sample of the training data of Simplewiki in English).

We can conclude that the best model performs on English, which is the language, that was used in training. However, having multilingual base model, we also generalize the knowledge, so the model performs not random on other languages. For example, we have 0.73 for fr, 0.72 for nl, 0.76 for it. The worst performance is for eu (Basque) and el (Greek).

Language	Dataset	accuracy	AUC
en	Simplewiki (test set)	0.891352	0.955451
en	Vikidia	0.921013	0.982656
ca	Vikidia	0.860656	0.914270
de	Klexikon	0.757636	0.948942
de	Vikidia	0.690476	0.872446
el	Vikidia	0.524390	0.761154
es	Vikidia	0.702041	0.822553
eu	Txikipedia	0.425975	0.386073
eu	Vikidia	0.579792	0.611134
fr	Vikidia	0.731558	0.826539
hy	Vikidia	0.535455	0.695755
it	Vikidia	0.763791	0.856777
nl	Wikikids	0.715346	0.788743
oc	Vikidia	0.571429	0.795918
pt	Vikidia	0.811037	0.908483
ru	Vikidia	0.701923	0.837555
scn	Vikidia	0.636364	0.752066

One more observation is that model performance is different for samples of different length (number of sentences). The table represents the accuracy for samples of different length of texts in number of sentences. The longer the text the more accurate are our predictions.

We also observe that the readability scores from the multilingual BERT model separate articles from the simple (0) and difficult (1) class. In fact, we see that the separation is more pronounced that using standard Flesch-Kincaid readability formula for English

Distribution of Flesch Kincaid scores for the articles with different readability levels: simple (0), difficult (1).

Distribution of mutlingual BERT scores for the articles with different readability levels: simple (0), difficult (1).

Comparison to language-agnostic model

Comparison to the language-agnostic model (we re-ran the language-agnostic model with the same train-test splits and evaluation data).

The multilingual BERT model substantially improves the accuracy of the language-angostic model; the only exception are the two datasets in de (German)
The multilingual BERT model supports more languages. Some languages (el: Greek, eu: Basque, hy: Armenian, oc: Occitan, scn: Sicilian) are currently not supported by the language-agnostic model because we do not have a working entity-linking model generating the language-agnostic representations.

Language	Dataset	acc multilingual BERT	acc Language-agnostic
en	Simplewiki (test set)	0.891	0.763
en	Vikidia	0.921	0.863
ca	Vikidia	0.860	0.809
de	Klexikon	0.757	0.823
de	Vikidia	0.690	0.793
el	Vikidia	0.524	-
es	Vikidia	0.702	0.680
eu	Txikipedia	0.425	-
eu	Vikidia	0.579	-
fr	Vikidia	0.731	0.637
hy	Vikidia	0.535	-
it	Vikidia	0.763	0.741
nl	Wikikids	0.715	0.562
oc	Vikidia	0.571	-
pt	Vikidia	0.811	0.809
ru	Vikidia	0.701	0.610
scn	Vikidia	0.636	-

Additional experiments

We also performed the various experiments aiming to improve the performance of the model that we observe here but dont show detailed results.

We also tried out the finetuning MLM with longformer architecture, that allows to pass the whole text as an input instead of per-sentence prediction. It showed comparable performance to the sentence-based approach. However, we decided not to proceed with it, as the model is more difficult to maintain, needs more memory and not show boost in performance.
Another experiment was based on attempt to extend the training dataset with translated texts. As we don’t have enough date for most of languages except English, the idea was to artificially create such a dataset, tune the model based on it and evaluate the results based on testing dataset of real texts. The results showed that adding translated texts decrease performance on original texts (English), with no reliable increase of performance on languages of translated texts. We used facebook/m2m100_418M for texts translation.
One more experiment was adding a small amount of original texts from other languages to the train. It shoved the slightly increase of performance for added languages. However, we decided not to proceed with this experiment, as we don’t really have data for such model training procedure.

Resources

Table with predicted readability scores for (almost) all articles of the 2023-06 snapshot of all supported Wikipedias: link to download (this is a one-off dataset)
Code: https://gitlab.wikimedia.org/repos/research/readability
Code for model on LiftWing:https://gitlab.wikimedia.org/trokhymovych/readability-liftwing
Model card: Machine learning models/Proposed/Multilingual readability model card