Machine learning models/Production/Language-agnostic Wikipedia article quality

This model card describes a model for predicting the quality of Wikipedia articles. It uses structural features extracted from the article and a simple set of weights and wiki-specific normalization critera to label Wikipedia articles in any language with a score between 0 and 1 (that can then be mapped to more recognized article quality classes such as Stubs). These scores are relative to a given language edition (not directly comparable across languages). The weights and feature selection were trained on editor asssessments from Arabic, English, and French Wikipedia. This model is a prototype and may still be substantially updated.

Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Isaac Johnson
Model owner(s)Isaac Johnson
Model interfaceEnglish Wikipedia example
Past performancedocumentation
CodeGitlab
Uses PIINo
This model uses the structure and size of an article to predict quality scores for Wikipedia articles.

Motivation

edit

Wikipedia articles range in quality from rich, well-illustrated, fully-referenced articles that fully cover their topic and are easy to read to single sentence stubs that define the topic of the article but do not offer much more information. It is very useful to be able to reliably distinguish between these extremes and the various stages of quality along this spectrum. Wikipedia editors have developed rich rubrics for how to evaluate the quality of Wikipedia articles and are constantly assessing article quality to assist in coordinating work on the wikis (English Wikipedia example). Editors use these quality scores to evaluate and prioritize their work. Researchers use these quality scores to understand content dynamics. Developers use these quality scores as filters when building recommender systems or other tools.

Wikipedia is ever-changing though, which makes it time-consuming (and largely impossible) for editors to keep these quality assessments complete and up-to-date. An automatic quality model can help fill these gaps by evaluating the quality for articles that are unassessed or have changed substantially since they were last assessed. In doing so, it can provide researchers and tool developers with more consistent data and even potentially help editors identify articles that would benefit from a human assessment. Initial models were language-specific, which allowed them to be finely-tuned to the dynamics and existing quality classes of a particular language edition. This model approach is language-agnostic (works for all Wikipedia language editions). The model may require further fine-tuning for a given community to better align its scores with existing quality classes, but this approach ensures that all language editions, even those lacking their own quality assessment schema, can benefit from these labels.

Users and uses

edit
Use this model for
  • high-level analyses of article quality trends (example)
  • filtering / ranking articles in tools – e.g., only show low-quality articles in a recommender system
  • identifying potential ways to improve articles – e.g., using the lowest-value feature from the model as a recommendation
Don't use this model for
  • projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
  • namespaces outside of 0, disambiguation pages, and redirects
  • directly comparing article quality across language editions – the scores are mostly relative to a given Wikipedia so e.g., an article that received a 0.5 score on English Wikipedia would get a much higher score if it had been on Simple English Wikipedia instead (because high-quality articles on English Wikipedia generally have more content than high-quality articles on Simple English Wikipedia)

Ethical considerations, caveats, and recommendations

edit
  • The weights used in this quality model were derived from a groundtruth dataset based on quality assessments made by editors on Arabic, English, and French Wikipedia (using the PageAssessments Extension). The model therefore reflects how editors in these communities weight the value of different aspects of an article, which may or may not extend to other language editions.
  • The model does not currently take into account the quality of the specific writing, so a long article with many fake words would register as high quality. It does take into account structure though, so a long article would be penalized if it did not have many sections or was poorly referenced.
  • The scores are relative for a given wiki – i.e. the feature scores needed to achieve a high quality prediction vary by wiki. For instance, a high-quality article on English Wikipedia is expected to have at least 3 images while only 2 are required on Swedish Wikipedia (and in fact, less than 5% of articles on Swedish Wikipedia have more than 2 or more images).
  • The predicted scores in many wikis skew higher than the groundtruth assessments provided by Wikipedians. Some of this can be tempered by calibrating the thresholds used for mapping the predictions to classes (see the model evaluation for recommended thresholds) but even so, the model seems to be more optimistic than Wikipedians (likely capturing the many articles with a lot of content but that perhaps would benefit from improved readability, background, etc.)

Model

edit

Performance

edit

Implementation

edit
Model architecture
linear regression model without an intercept that is mapped to an output range of 0 to 1. It can also be thought of as a weighted-average of features that is derived from a linear regression model.
Output schema
{
  lang: <language-code string>,
  title: <title string>,
  quality: <float [0-1]>
  features: {
    normalized: {
      <feature 1>: <float [0-1]>
      ...
      <feature n>: <float [0-1]>
    }, 
    raw: {
      <feature 1>: <int [0-]>
      ...
      <feature n>: <int [0-]>

    }
}
Example input and output

Input

GET /api/v1/quality-article-features?lang=en&title=Frida_Kahlo

Output

{
  "lang": "en",
  "title": "Frida_Kahlo",
  "quality": 0.9019721559418881,
  "features": {
    "normalized": {
      "categories": 1,
      "headings": 0.4688479565973076,
      "length (bytes)": 1,
      "media": 1,
      "references": 0.8304080512730345,
      "wikilinks": 0.4578991720468065
    },
    "raw": {
      "categories": 28,
      "headings": 19,
      "length (bytes)": 123748,
      "media": 20,
      "references": 86,
      "wikilinks": 351
    }
  }
}
Feature Weight Pre-processing Minimum threshold for top quality[1]
Page length 0.395 Square-root of number of bytes in wikitext 10000 characters
References 0.181 # ref tags / normalized-page-length 0.15 (roughly equivalent to 2 refs / section)
Sections 0.123 Number of level 2 and 3 headings / normalized-page-length 0.1 (1 heading at 100 chars, 2 headings at 400 chars, etc.)
Wikilinks 0.115 Square root of # of wikilinks (ns=0) / normalized-page-length 0.1 (~1 link per sentence)
Media 0.114 raw count of number of media files – e.g., image, video, audio – in wikitext 2
Categories 0.070 raw count of categories in wikitext 5

Data

edit

The model weights are based on 19,173 articles that whose quality was assessed by editors in December 2021 across English (7,937), French (7,554), and Arabic (3,682) Wikipedia. The breakdown by quality class and language edition are as follows:

Groundtruth data distribution
Quality class (based on English Wikipedia) Language Number articles
Stub Arabic 3448
Stub English 1726
Stub French 3811
Start Arabic 166
Start English 2056
Start French 2965
C Arabic 17
C English 2809
C French 601
B Arabic 15
B English 867
B French 76
GA Arabic 19
GA English 415
GA French 55
FA Arabic 17
FA English 64
FA French 46
Data pipeline
The pipeline has two stages: 1) learning feature weights, and, 2) deriving pre-processinig thresholds. In the first stage, a small sample of data is used to learn the relative weight of each of the model features – e.g., categories, text, etc. This stage is also used for testing different feature transformations such as log-normalization. In the second stage, features for every Wikipedia article are computed and the top 5% of articles for each wiki and feature are used to determine what a "high-quality" article should attain in a given wiki and therefore how to compute feature weightsw – e.g., if the top 5% of articles in English Wikipedia have 14 categories, then an article with 5 categories will have a score of 0.36 (min(1, 5/14)) for that feature while an article with 20 categories would have a score of 1 (min(1, 20/14)). Certain global minimum thresholds are also set based on eye-balling the data at this stage too.
Training data
A small sample of recently-assessed Wikipedia articles from the wikis described above was used to derive the model weights.
Test data
See this PAWS notebook for a detailed model evaluation. Testing the model on a sample of data from Arabic, French, and English from several months after the training shows a high correlation between model predictions and Wikipedian assessments. Note: this evaluation does not yet include any language editions not also included in the training data.


Licenses

edit

Citation

edit

Cite this model as:

@misc{johnson2022quality,
   title={Language-agnostic Wikipedia article quality model card},
   author={Johnson, Isaac},
   year={2022},
   url = {https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_Wikipedia_article_quality_model_card},
}

References

edit
  1. Most wikis have thresholds that are higher than this minimum.