Machine learning models/Proposed/Multilingual reference need


How can we assist editors in identifying articles with statements missing citations?

Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aitolkyn Baigutanova, Muniza Aslam, and Diego Saez-Trumper
Model owner(s)Diego Saez-Trumper
CodeInference
Uses PIINo
In production?TBA
This model uses revision content to predict the reference need score of that revision.

The goal of this model is to detect sentences that lack proper citations. One of the key policies of Wikipedia is verifiability, which ensures that users of the encyclopedia can check the sources of the provided information. Unfortunately, this policy is often not adhered to, resulting in numerous uncited statements. To deal with this issue, Wikipedia implements citation needed tag allowing anyone to request a source for uncited claims. However, due to the sheer volume of daily edits on Wikipedia, manually labeling such uncited claims could be inefficient and prone to delays. Therefore, this model aims to automate the process of detecting uncited claims and support editors in maintaining Wikipedia's standards of verifiability.

This model is to be deployed on LiftWing.

Motivation

edit

The model is created to support the editors and provide further insight to the readers on the referencing quality of an article.

The model is recommended for use in the top 10 language editions of Wikipedia based on the active user count (as of July 2024), namely ['fa', 'it', 'zh', 'ru','pt', 'es', 'ja', 'de', 'fr', 'en']. For other editions of Wikipedia the model is advised to be used with special precautions.


Users and uses

edit
Use this model for
  • compute the reference need score of a revision
Don't use this model for
  • making predictions on language editions of Wikipedia other than the ones listed above should be used with precautions
  • making predictions on other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
Current uses

Ethical considerations, caveats, and recommendations

edit
  • This models relies on Multilingual Bert, a Large Language Model, that might contain certain biases.
  • The processing time for this model grows proportionally with the number of sentences. Therefore, longer articles will require more time to be processed. Keep in mind that there might exist a correlation between an article's popularity (pageviews) and length.
  • References the current model catches include those within a <ref> tag and templates starting with sfn, snf, and harv . Although these are the major referencing methods used in the largest wikis included in the experiments for this model, some citation templates may still be missing. One reason for this is folksonomy, where a single template is referred to by multiple names.

Model

edit

The presented model is based on the content features extracted from wikitext of a revision. We used mwtokenizer to extract sentences from the main text of the article. To provide additional context to the model, we included the section heading along with the sentences immediately preceding and following the target sentence within its paragraph and then fed this enriched context into the fine-tuned distilled version of the multilingual BERT model. For additional speed up in the production setting, we applied dynamic quantization to our model, a technique performed after training to reduce the size of the model's weights. Particularly, we represented 32-bit floating-point weights as more compact 8-bit integers.

1. Prepare text features:

  • Process wikitext and locate the inline citations used
  • Tokenize the revision into sentences
  • Supplement each sentence with a section heading, a preceding sentence, and a following sentence

2. Classification

  • Pass the features to the model
  • Obtain a prediction per each uncited sentence in a revision

3. Compute reference need score

  • Based on the model prediction per each sentence and the citations existing in a revision, compute the reference need score as a proportion of uncited sentences that require a citation out of the total number of sentences needing citations within the uncited sentences of the revision.

Performance

edit

The model performance was tested on the five languages in the dataset. We evaluate the model using F1-score, PR-AUC, and ROC-AUC metrics.

Model performance metrics
wiki_db F1-score PR-AUC ROC-AUC
enwiki 0.742 0.835 0.833
eswiki 0.697 0.749 0.755
frwiki 0.677 0.729 0.740
dewiki 0.693 0.765 0.765
ruwiki 0.725 0.817 0.807
All 0.705 0.778 0.774

We further test the time performance on a random sample of 1000 revisions. The plots below present the cumulative distribution function (CDF) plot of time taken per revision. Note that the performance is tested on CPU with batch_size of 1, and the time could be significantly reduced in the GPU settings.

 
Time performance with CPU

Implementation

edit
Model architecture

Distilbert multilingual model tunning:

  • Learning rate: 2e-5
  • Weight Decay: 0.01
  • Epochs: 3
  • Maximum input length: 128
Output schema
{
  rn_score: <float [0-1]>
}
Example input and output

Input

{“rev_id”: 1221892202, “lang”: ‘en’}

Output

{
  rn_score: 0.22,
}

Data

edit

The data was trained on a set of featured articles, which are determined as the highest quality on Wikipedia by the editors. We used the mediawiki_wikitext_current table to extract the latest available revision for each featured article. The snapshot used was 2024-02.

Data pipeline
The data was collected using Wikimedia Data Lake and Wikimedia Analytics cluster. For each article, we tokenized its latest revision into sentences to be passed to the model.
Training data
  • Number of languages: 5 ('ru', 'es', 'de', 'fr', 'en')
  • Number of sentences: 100,000
  • Random sample of 20,000 sentences from each language balanced on the ground-truth label.
Test data
  • Number of languages: 5 ('ru', 'es', 'de', 'fr', 'en')
  • Number of sentences: 15,000
  • Random sample of 3,000 sentences from each language balanced on the ground-truth label.

Licenses

edit
  • Code:
  • Model:

Citation

edit

Cite this model as:

@misc{name_year_modeltype,
   title={Model card title},
   author={Lastname, Firstname (and Lastname, Firstname and...)},
   year={year},
   url={this URL}
}

References

edit