Research:Multilingual Readability Research/Evaluation language agnostic

We develop and evaluate a language-agnostic model for assessing the readability of Wikipedia articles:

While there are many different ways in which to operationalize readability, we start from standard readability formulas which are known to capture at least some aspects of readability (such as the Flesch-Reading ease in for articles in English Wikipedia^[1]). Our focus is to adapt these approaches to more languages by using language-agnostic features.
We build a binary classification model that predicts the annotated level of readability of an article (easy or difficult). The model’s prediction score (between 0 and 1) can be interpreted as a normalized readability score.
We train the model only in English where we have sufficient ground truth data from Simple Wikipedia and English Wikipedia. We test the model in other languages using annotated ground truth data from Wikipedia and corresponding children’s encyclopedias. This reveals how well the model generalizes to other languages without re-training or fine-tuning in the absence of ground truth data (which is the case for the majority of languages in Wikipedia).

In summary, we find that the language-agnostic model constitutes a promising approach for obtaining readability scores of articles in different languages without language-specific fine-tuning or customization.

The language-agnostic approach is less precise than the standard readability formulas for English
The language-agnostic approach generalizes better to other languages than the standard readability formulas (non-customized).
The language-agnostic approach performs similar or almost as good as the customized versions of the standard readability formulas for most languages (noting that such customizations exist only for very few languages).

Datasets

We compile a dataset of documents which are annotated with different readability levels in different languages. Specifically, each dataset consists of pairs of aligned articles with two assigned readability levels which we denote “easy” and “difficult”.

Simple English Wikipedia (SEW)

This dataset consists of articles from Simple Wikipedia (easy) and English Wikipedia (difficult) following similar previous approaches^[2]. While this dataset is monolingual (a “simple” Wikipedia only exists in English), this is a relatively large aligned corpus with over 200k articles in simplewiki (overlapping considerably with the more than 6M articles in enwiki).

Using the 2021-11 snapshot, we match pairs of common articles from simplewiki and enwiki via their Wikidata-item (main namespace, no redirect). We remove pairs of articles where either article is a disambiguation page or a list page. We extract the text of an article from its wikitext removing any markup and non-text elements. We only keep the lead section of each article in order to avoid that pairs of articles will differ too much in their length. We split text into sentences keeping only pairs where both articles consist of at least 3 sentences. The final dataset consists of 100,454 pairs of articles.

Vikidia Wikipedia (VW)

This dataset consists of articles from Vikidia (easy), a children’s encyclopedia, and Wikipedia (difficult) following similar previous approaches^[3]. Vikidia exists in 12 languages with the number of articles varying from 35k in French to less than 100 in Greek: French (fr), Italian (it), Spanish (es), English (en), Basque (eu), Armenian (hy), Portuguese (pt), German (de), Catalan (ca), Sicilian (scn), Russian (ru), and Greek (el).

For each language, we identify the common articles between Vikidia and Wikipedia in the following way. We first extract all article-titles from Vikidia via the public API (e.g. for English Vikidia). Second, we match the article titles from Vikidia with article titles from the 2021-11 snapshot of Wikipedia. We also take into account matches between titles from any redirect-pages to the corresponding articles since spelling of the actual articles sometimes slightly differs, e.g., UEFA_Euro_2016 (enwiki) vs. Euro_2016 (enviki). We only keep the pair if the two articles can be matched unambiguously (the title and its redirects cannot be matched to any other article with its corresponding redirects).

We extract the text of an article from its wikitext removing any markup and non-text elements. We only keep the lead section of each article in order to avoid that pairs of articles differ too much in their length. We split text into sentences keeping only pairs where both articles consist of at least 3 sentences.

The final datasets have the following number of pairs of articles:

French (fr): 12,153
Italian (it): 1,650
Spanish (es): 2,337
English (en): 1,789
Basque (eu): 1,045
Armenian (hy): 10
German (de): 263
Catalan (ca): 236
Sicilian (scn): 12
Russian (ru): 98
Portuguese (pt): 41
Greek (el): 40

Klexikon Wikipedia (KW)

This dataset consists of articles from Klexikon (easy), an encyclopedia for children aged 6 to 12 years in German, and German Wikipedia (difficult). We use an existing publicly available dataset ^[4]. This dataset contains 2,898 aligned documents in German, and is thus substantially larger than the corresponding Vikidia data. We only keep the lead section of each article.

Readability features

Standard readability formulas

There are several formulas for calculating readability or reading level of text. We consider the following readability formulas^[5].

Flesch reading ease (FRE) is one of the most widely used formulas with higher scores indicating that the text that is easier to read

$206.835-1.015\left({\frac {\mathrm {total\ words} }{\mathrm {total\ sentences} }}\right)-84.6\left({\frac {\mathrm {total\ syllables} }{\mathrm {total\ words} }}\right)$

Flesch-Kincaid Grade: This formula represents the number of years of education generally required to understand a given text

$0.39\left({\frac {\mathrm {total\ words} }{\mathrm {total\ sentences} }}\right)+11.8\left({\frac {\mathrm {total\ syllables} }{\mathrm {total\ words} }}\right)-15.59$

Dale-Chall Readability Score is based on a manually curated list of difficult words

$0.1579\left({\frac {\mathrm {difficult\ words} }{\mathrm {words} }}\times 100\right)+0.0496\left({\frac {\mathrm {words} }{\mathrm {sentences} }}\right)$

Gunning-Fog Index leverages the notion of complex words defined as words with at least three syllables

$0.4\left[\left({\frac {\mathrm {words} }{\mathrm {sentences} }}\right)+100\left({\frac {\mathrm {complex\ words} }{\mathrm {words} }}\right)\right]$

Smog Index is a modification of the Gunning Fog index and also uses complex or polysyllabic words in its formula

$1.0430{\sqrt {\mathrm {number\ of\ polysyllables} \times {\frac {30}{\mathrm {number\ of\ sentences} }}}}+3.1291$

Automated Readability Index is the only formula that take characters into account rather than syllables

$4.71\left({\frac {\mathrm {characters} }{\mathrm {words} }}\right)+0.5\left({\frac {\mathrm {words} }{\mathrm {sentences} }}\right)-21.43$

Unlike FRE, for all other metrics, the higher the score, the harder the text is to read.

The formulas were designed specifically for English texts. There are attempts to adapt these formulas to other languages, but these customizations exist only for few cases. We use the customization of FRE for 5 languages described here:

Language	Name	Base	Sentence length	Syllables per word
EN	Flesch Reading Ease	207	1.015	73.6
FR		206.835	1.015	84.6
DE	Toni Amstad	180	1	58.5
ES	Fernandez Huerta Readability Formula	206.84	1.02	0.6
IT	Flesch-Vacca	217	1.3	0.6
RU		206.835	1.3	60.1

We calculate all formulas with the python package textstat.

Language-agnostic features

In order to extract language-agnostic features, we first represent sentences as a sequence of entities. For this, we need an entity-linker to annotate each sentence with its corresponding entity. Each entity consists of an entity-mention (the text) and the entity-id (e.g. its wikidata-item). In practice, we use DBPedia-spotlight (via the python-library spacy-dbpedia-spotlight) which provides entity-linking models for 17 languages. Each entity consists of an entity-mention (the text) and the entity-id (e.g. its wikidata-item). In addition, the model provides a confidence score for each annotated entity. In order to increase the recall of the entities, we choose the lowest possible threshold for the confidence score.

Given the representation of a sentence as a sequence of entities, we define the following quantities:

Number of tokens : total number of entities in a sentence
Number of types: the number of unique entities in a sentence in terms of their entity-mentions (types_str) or their entity-ids (types_str)

As an example, consider the following sentences (adapted from ^[6]) with annotated entities underlined:

1. Attention-deficit hyperactivity disorder (ADHD), or attention deficit disorder (ADD), is a neurodevelopmental disorder.
2. People with ADHD may be too active. 3. ADHD is called a neurological developmental disorder because it affects how people's nervous systems develop.
4. ADHD is most common in children: fewer adults have ADHD.

sentence	Total # of entities (tokens)	# of different entities (types_ids)	# of different mentions (types_str)	Type-token ratio (ids)	Type-token ratio (str)
1	5	2 (ADHD, neuro)	5 (ADHD, ADD, attention-deficit hyperactivity, attention.., neuro…)	2/5	5/5
2	1	1	1	1/1	1/1
3	3	3	3	3/3	3/3
4	2	1	1	1/2	1/2

From this, we extract the following language-agnostic features:

Average sentence length in tokens
Average sentence length in types (types_ids)
Average sentence length in types (types_str)
Average sentence type-token ratio (types_ids)
Average sentence type-token ratio (types_str)
Document type-token ratio (types_ids)
Document type-token ratio (types_str)

Results

Exploratory Analysis

Before we develop the model we want to gain some intuition around the different readability features across the different languages.

What are the scores of the different features for easy and difficult texts?

We plot the distributions of the individual features comparing easy and difficult articles.

Observations

for the standard readability features, the distributions for easy (simple) and difficult (en) texts are systematically shifted (the differences are statistically significantly)
for the language agnostic features, we also observe systematic differences between easy and difficult. Most notably, the average sentence length is higher for en than for simple.
Similar observations hold across all considered languages.

The distribution of document and readability metrics for SEW. Document length refers to length of an article in characters, while average sentence length is reported in average number of word-tokens per sentence. Note, that for all metrics except FRE, the lower the score the better the readability. In terms of overall distribution, we see that simple wikipedia scores better for all document and readability metrics.

The distribution of language-agnostic metrics for SEW. The average length metrics in terms of language-agnostic tokens, entity and mention types have lower scores for Simple wikipedia articles, while the opposite is true for the ratio metrics. However, compared to the readability metrics’ score distribution above, there is a higher overlap between Simple and English articles.

The distribution of FRE for Wiki-Viki and Klexikon datasets, over different languages. In terms of overall distribution, we see that Vikidia and Klexikon articles score better than Wikipedia articles, but the overlap is higher compared to SEW, especially for the non-English articles.

The distribution of Average Sentence Length in Language-Agnostic Tokens for Wiki-Viki and Klexikon datasets, over different languages. In terms of overall distribution, we see that Vikidia and Klexikon articles score better than Wikipedia articles.

Can the individual features distinguish between an individual pair of an easy/hard article?

This is different from plotting the distribution, instead we are checking whether the feature can distinguish the easy and difficult version of an individual pair of articles. This corresponds to an “unsupervised classification”. Specifically, for any corresponding pair of articles across different languages and datasets, for the readability and language-agnostic metrics, we assign the ‘correct’, i..e, the more readable, label to the article that scores higher for that metric. Ties are broken randomly. In the following table, we report the average correct score across all articles for the WV dataset disaggregated by language. The higher the score, the better that metric is at differentiating readability for paired articles. Observations:

The customized Flesch-reading ease formula has the highest score in correctly assigning a pair of articles to the easy and difficult class, respectively.
The non-customized version of the FRE can distinguish with substantially lower accuracy in some languages
The language-agnostic features are consistent across languages. It is worse than the customized FRE. For some languages it is comparable or better than the non-customized FRE.

Table: The scores for different features for the paired evaluation across different languages for Wiki-Viki. The higher the score the better the metric can differentiate between a vikidia article and its wikipedia counterpart. The FRE metric (customized for each language) performs best, but the non-customized metric also works well for most languages except Italian.
Feature	ru	es	fr	en	de	it	de (KW)
flesch_reading_ease	0.755	0.842	0.839	0.921	0.829	0.785	0.987
flesch_reading_ease [non-custom]	0.724	0.757	0.835	0.92	0.817	0.588	0.987
avg_sent_length_tokens_lang_agn	0.633	0.746	0.721	0.791	0.734	0.819	0.948
avg_sent_length_entity_types	0.643	0.734	0.72	0.781	0.707	0.817	0.939
avg_sent_length_mention_types	0.643	0.746	0.721	0.789	0.703	0.816	0.948
sent_type_token_ratio_entity	0.531	0.64	0.567	0.663	0.726	0.628	0.744
sent_type_token_ratio_mention	0.551	0.62	0.565	0.641	0.586	0.604	0.578
doc_type_token_ratio_entity	0.622	0.645	0.545	0.679	0.51	0.679	0.807
doc_type_token_ratio_mention	0.592	0.641	0.539	0.696	0.589	0.659	0.89

How similar are the different features?

We calculate the spearman rank correlation between the scores of different features across all articles in a dataset.

Observations:

Standard readability formulas are all strongly correlated (0.53…0.74)
Standard readability formulas are medium-correlated to the language-agnostic features (0.24…0.52). The reason is that standard readability formulas are correlated to simplistic features such as average sentence length in word-tokens (0.32…0.52). This in turn is correlated to the average sentence length in entity-tokens (0.4…0.42)

We plot the heatmap of correlation between scores for the paired evaluation scores for the document, readability, and language-agnostic metrics for SEW. The readability metrics are positively correlated with each other, except Dale Chall which had a low score. But the readability formulae are not well correlated with document length or # of tokens, but are moderately correlated with average sentence length. For the language-agnostic features, the sentence length metrics are correlated with each other, while the ratio features are correlated among each other.

We plot the heatmap of correlation between scores for the paired evaluation scores for the document, readability, and language-agnostic metrics for the WV Spanish dataset.

Classification model

The aim is to develop a binary classifier that predicts whether an individual text belongs to the easy or the difficult class. The advantage in comparison to the individual formulas described above is that we can combine different features to distinguish between easy and difficult texts. In fact, the output of the model will be a score between 0 (easy) and 1 (difficult) that can be interpreted as a normalized readability score.

We train the model using the English SEW dataset and split the dataset into a 70% training and 30% test set (specifically, we split pairs of articles such that the easy and the difficult version of an article both appear in the training or in the test set). Despite being only available in English, the SEW dataset is by far the largest amount of samples for training the model.

We will evaluate the model on the other languages (VW and KW) without further training or fine-tuning of the model for that specific language. The reason is that there is no ground truth data of articles annotated with readability levels for the vast majority of the 300+ languages in Wikipedia. Thus, in practice, it will not be feasible to fine-tune the models for the specific languages. Therefore, evaluating on the multilingual datasets (VW and KW) indicates how well the trained model generalizes to predict readability levels in other languages for which there is no training data (which is the case for almost any other language). In addition, the availability of the VW dataset in English shows how well the trained model generalizes to another corpus. As an evaluation metric we use the accuracy-score (fraction of correctly predicted samples) since the two classes are perfectly balanced. Results report mean and standard error over 5 predictions of each sample in the test set.

We train three different models: Logistic Regression (LR), Linear Support Vector machine (SVM), and Random Forest (RF). We select hyperparameters via grid-search picking the model that performs best on the SEW test-data averaged over 5 different 70-30 splits.

Below we report results for all datasets (SEW, VW, KW) comparing different sets of features:

Language-agnostic features. Only entity-based features without language-parsing.
Readability formula (FRE for English). The standard readability formula that we could easily apply for other languages too. Thus, this constitutes a good comparison to the language-agnostic model because it does not require any language-specific tuning.
Customized readability formula (FRE for the specific language). These are readability formulas, where the coefficients are fine-tuned to the corresponding language. This is, in general, not applicable for most languages as fine-tuned formulas do not exist.

Summary:

Language-agnostic works: It provides a slightly less precise measure than the standard readability formula used for English but generalizes much better to other languages without any fine-tuning
- For the English datasets, the language-agnostic performs worse than standard readability formula (both SEW and VW)
- For the non-English datasets, the language-agnostic model performs better than the non-customized readability formula (except French where the latter is slightly better). This is the typical situation we compare against in the context of Wikipedia because for the majority of Wikipedia’s languages we do not have such a customization.
- For some of the non-English datasets, the language-agnostic model performs even better or similar to the customized readability formulas. However, with the lack of annotated training data available for most of Wikipedia’s languages, we cannot expect such customizations
- For the language-agnostic model, the Random Forest constitutes the best model for most cases. The accuracy is between 61% (VW-ru) and 76% (VE-en).

Table: Summary of results for the supervised model trained on 70% SEW data and tested on 30% SEW data (held-out), VW, and Klexikon. We have three types of model for each dataset, one trained on all 7 language-agnostic features, the second on readability metrics (only FRE to remain constant across all languages) not customized for the language, and finally one on customized FRE. For each feature set, we train three types of supervised ML models — Logistic Regression (LR), Support Vector Machines (SVM, with a linear kernel), and Random Forests (RF). In the VW datasets, for German, Russian, and Spanish, the language agnostic features perform best, especially the RF models. For English, in SEW the readability features are the best. Notably, for languages other than French, the language-agnostic features always outperform the non-customized readability features. v
dataset	lang	features	LR	Linear SVM	RF
SEW	en	language agnostic	0.66 ± 0.0	0.66 ± 0.0	0.661 ± 0.0
SEW	en	readability	0.712 ± 0.0	0.712 ± 0.0	0.763 ± 0.0
VW	en	language agnostic	0.678 ± 0.0	0.678 ± 0.0	0.765 ± 0.002
	en	readability	0.779 ± 0.0	0.75 ± 0.057	0.756 ± 0.1
	es	language agnostic	0.654 ± 0.0	0.654 ± 0.0	0.685 ± 0.001
		readability	0.5 ± 0.0	0.5 ± 0.0	0.5 ± 0.0
		readability_custom	0.665 ± 0.0	0.667 ± 0.004	0.639 ± 0.005
	fr	language agnostic	0.62 ± 0.0	0.618 ± 0.0	0.63 ± 0.0
		readability	0.67 ± 0.0	0.679 ± 0.017	0.61 ± 0.06
		readability_custom	0.649 ± 0.0	0.651 ± 0.004	0.641 ± 0.03
	it	language agnostic	0.687 ± 0.0	0.683 ± 0.0	0.684 ± 0.001
		readability	0.5 ± 0.0	0.5 ± 0.0	0.5 ± 0.0
		readability_custom	0.698 ± 0.0	0.697 ± 0.001	0.632 ± 0.006
	ru	language agnostic	0.536 ± 0.0	0.536 ± 0.0	0.606 ± 0.011
		readability	0.515 ± 0.0	0.515 ± 0.0	0.431 ± 0.014
		readability_custom	0.582 ± 0.0	0.584 ± 0.003	0.547 ± 0.015
	de	language agnostic	0.593 ± 0.0	0.587 ± 0.0	0.678 ± 0.003
		readability	0.538 ± 0.0	0.546 ± 0.016	0.56 ± 0.066
		readability_custom	0.676 ± 0.002	0.671 ± 0.008	0.593 ± 0.031
KW	de	language agnostic	0.747 ± 0.0	0.745 ± 0.0	0.713 ± 0.002
		readability	0.512 ± 0.0	0.532 ± 0.04	0.607 ± 0.131
		readability_custom	0.91 ± 0.0	0.907 ± 0.006	0.646 ± 0.015

Or more compact as a bar-plot

Summary of results for the supervised model trained on 70% SEW data and tested on 30% SEW data (held-out), VW, and Klexikon. We have three types of model for each dataset, one trained on all 7 language-agnostic features, the second on readability metrics (only FRE to remain constant across all languages) not customized for the language, and finally one on customized FRE. For each feature set, we train three types of supervised ML models — Logistic Regression (LR), Support Vector Machines (SVM, with a linear kernel), and Random Forests (RF). In the VW datasets, for German, Russian, and Spanish, the language agnostic features perform best, especially the RF models. For English, in SEW the readability features are the best. Notably, for languages other than French, the language-agnostic features always outperform the non-customized readability features.

References

↑ Lucassen, Teun; Dijkstra, Roald; Schraagen, Jan Maarten (2012-09-03). "Readability of Wikipedia". First Monday 17 (9). doi:10.5210/fm.v0i0.3916.
↑ Napoles, Courtney; Dredze, Mark (2010-06-06). "Learning simple Wikipedia: a cogitation in ascertaining abecedarian language". Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids. CL&W '10 (USA: Association for Computational Linguistics): 42–50. doi:10.5555/1860657.1860663.
↑ Madrazo Azpiazu, I., & Pera, M. S. (2020). Is cross‐lingual readability assessment possible? Journal of the Association for Information Science and Technology, 71(6), 644–656. https://doi.org/10.1002/asi.24293
↑ Aumiller, D., & Gertz, M. (2022). Klexikon: A German Dataset for Joint Summarization and Simplification. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2201.07198
↑ Martinc, Matej; Pollak, Senja; Robnik-Šikonja, Marko (2021-04-21). "Supervised and Unsupervised Neural Approaches to Text Readability". Computational Linguistics 47 (1): 141–179. ISSN 0891-2017. doi:10.1162/coli_a_00398.
↑ Štajner, S., & Hulpuș, I. (2020). When shallow is good enough: Automatic assessment of conceptual text complexity using shallow semantic features. Proceedings of the 12th Language Resources and Evaluation Conference, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177/

[1] Lucassen, Teun; Dijkstra, Roald; Schraagen, Jan Maarten (2012-09-03). "Readability of Wikipedia". First Monday 17 (9). doi:10.5210/fm.v0i0.3916.

[2] Napoles, Courtney; Dredze, Mark (2010-06-06). "Learning simple Wikipedia: a cogitation in ascertaining abecedarian language". Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids. CL&W '10 (USA: Association for Computational Linguistics): 42–50. doi:10.5555/1860657.1860663.

[3] Madrazo Azpiazu, I., & Pera, M. S. (2020). Is cross‐lingual readability assessment possible? Journal of the Association for Information Science and Technology, 71(6), 644–656. https://doi.org/10.1002/asi.24293

[4] Aumiller, D., & Gertz, M. (2022). Klexikon: A German Dataset for Joint Summarization and Simplification. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2201.07198

[5] Martinc, Matej; Pollak, Senja; Robnik-Šikonja, Marko (2021-04-21). "Supervised and Unsupervised Neural Approaches to Text Readability". Computational Linguistics 47 (1): 141–179. ISSN 0891-2017. doi:10.1162/coli_a_00398.

[6] Štajner, S., & Hulpuș, I. (2020). When shallow is good enough: Automatic assessment of conceptual text complexity using shallow semantic features. Proceedings of the 12th Language Resources and Evaluation Conference, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177/

[1]

[2]

[3]

[4]

[5]

[6]