Machine learning models/Proposed/Multilingual reference need


How can we assist editors in identifying articles with statements missing citations?

Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aitolkyn Baigutanova, Muniza Aslam, and Diego Saez-Trumper
Model owner(s)Diego Saez-Trumper
CodeInference
Uses PIINo
In production?TBA
This model uses revision content to predict the reference need score of that revision.

The goal of this model is to detect sentences that lack proper citations. One of the key policies of Wikipedia is verifiability, which ensures that users of the encyclopedia can check the sources of the provided information. Unfortunately, this policy is often not adhered to, resulting in numerous uncited statements. To deal with this issue, Wikipedia implements citation needed tag allowing anyone to request a source for uncited claims. However, due to the sheer volume of daily edits on Wikipedia, manually labeling such uncited claims could be inefficient and prone to delays. Therefore, this model aims to automate the process of detecting uncited claims and support editors in maintaining Wikipedia's standards of verifiability.

This model is to be deployed on LiftWing.

Motivation

edit

The model is created to support the editors and provide further insight to the readers on the referencing quality of an article.

The model is recommended for use in the top 10 language editions of Wikipedia based on the active user count (as of July 2024), namely ['fa', 'it', 'zh', 'ru','pt', 'es', 'ja', 'de', 'fr', 'en']. For other editions of Wikipedia the model is advised to be used with special precautions.


Users and uses

edit
Use this model for
  • compute the reference need score of a revision
Don't use this model for
  • making predictions on language editions of Wikipedia other than the ones listed above should be used with precautions
  • making predictions on other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
Current uses

Ethical considerations, caveats, and recommendations

edit

Model

edit

The presented model is based on the content features extracted from wikitext of a revision. We used mwtokenizer to extract sentences from the main text of the article. To provide more context to the model, we further added the section heading and the paragraph where the sentence is located and passed it to the fine-tuned multilingual BERT model. The system includes the following steps:

1. Prepare text features:

  • Process wikitext and locate the inline citations used
  • Tokenize the revision into sentences
  • Supplement each sentence with a section heading and paragraph

2. Classification

  • Pass the features to the model
  • Obtain a prediction per each sentence in a revision

3. Compute reference need score

  • Based on the model prediction per each sentence and the citations existing in a revision, compute the reference need score as a proportion of uncited sentences over the total number of sentences that require a citation.

Performance

edit

Implementation

edit
Model architecture

mBERT model tunning:

  • Learning rate: 2e-5
  • Weight Decay: 0.01
  • Epochs: 3
  • Maximum input length: 512
Output schema
Example input and output

Data

edit

The data was trained on a set of featured articles, which are determined as the highest quality on Wikipedia by the editors. We used the mediawiki_wikitext_current table to extract the latest available revision for each featured article. The snapshot used was 2024-02.

Data pipeline
The data was collected using Wikimedia Data Lake and Wikimedia Analytics cluster. For each article, we tokenized its latest revision into sentences to be passed to the model.
Training data
  • Number of languages: 5 ('ru', 'es', 'de', 'fr', 'en')
  • Number of sentences: 100,000
  • Random sample of 20,000 sentences from each language balanced on the ground-truth label.
Test data

Licenses

edit
  • Code:
  • Model:

Citation

edit

Cite this model as:

@misc{name_year_modeltype,
   title={Model card title},
   author={Lastname, Firstname (and Lastname, Firstname and...)},
   year={year},
   url={this URL}
}