Machine learning models/Proposed/Multilingual reference need

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Aitolkyn Baigutanova, Muniza Aslam, and Diego Saez-Trumper
Model owner(s)	Diego Saez-Trumper
Code	Inference
Uses PII	No
In production?	TBA
	This model uses revision content to predict the reference need score of that revision.
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

How can we assist editors in identifying articles with statements missing citations?

The goal of this model is to detect sentences that lack proper citations. One of the key policies of Wikipedia is verifiability, which ensures that users of the encyclopedia can check the sources of the provided information. Unfortunately, this policy is often not adhered to, resulting in numerous uncited statements. To deal with this issue, Wikipedia implements citation needed tag allowing anyone to request a source for uncited claims. However, due to the sheer volume of daily edits on Wikipedia, manually labeling such uncited claims could be inefficient and prone to delays. Therefore, this model aims to automate the process of detecting uncited claims and support editors in maintaining Wikipedia's standards of verifiability.

This model is to be deployed on LiftWing.

Motivation

The model is created to support the editors and provide further insight to the readers on the referencing quality of an article.

The model is recommended for use in the top 10 language editions of Wikipedia based on the active user count (as of July 2024), namely ['fa', 'it', 'zh', 'ru','pt', 'es', 'ja', 'de', 'fr', 'en']. For other editions of Wikipedia the model is advised to be used with special precautions.

Users and uses

Use this model for

compute the reference need score of a revision

Don't use this model for

making predictions on language editions of Wikipedia other than the ones listed above should be used with precautions
making predictions on other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)

Current uses

Ethical considerations, caveats, and recommendations

This models relies on Multilingual Bert, a Large Language Model, that might contain certain biases.
The processing time for this model grows proportionally with the number of sentences. Therefore, longer articles will require more time to be processed. Keep in mind that there might exist a correlation between an article's popularity (pageviews) and length.
References the current model catches include those within a <ref> tag and templates starting with sfn, snf, and harv . Although these are the major referencing methods used in the largest wikis included in the experiments for this model, some citation templates may still be missing. One reason for this is folksonomy, where a single template is referred to by multiple names.

Model

The presented model is based on the content features extracted from wikitext of a revision. We used mwtokenizer to extract sentences from the main text of the article. To provide additional context to the model, we included the section heading along with the sentences immediately preceding and following the target sentence within its paragraph and then fed this enriched context into the fine-tuned distilled version of the multilingual BERT model. For additional speed up in the production setting, we applied dynamic quantization to our model, a technique performed after training to reduce the size of the model's weights. Particularly, we represented 32-bit floating-point weights as more compact 8-bit integers.

1. Prepare text features:

Process wikitext and locate the inline citations used
Tokenize the revision into sentences
Supplement each sentence with a section heading, a preceding sentence, and a following sentence

2. Classification

Pass the features to the model
Obtain a prediction per each uncited sentence in a revision

3. Compute reference need score

Based on the model prediction per each sentence and the citations existing in a revision, compute the reference need score as a proportion of uncited sentences that require a citation out of the total number of sentences needing citations within the uncited sentences of the revision.

Performance

The model performance was tested on the five languages in the dataset. We evaluate the model using F1-score, PR-AUC, and ROC-AUC metrics.

Model performance metrics
wiki_db	F1-score	PR-AUC	ROC-AUC
enwiki	0.742	0.835	0.833
eswiki	0.697	0.749	0.755
frwiki	0.677	0.729	0.740
dewiki	0.693	0.765	0.765
ruwiki	0.725	0.817	0.807
All	0.705	0.778	0.774

We further test the time performance on a random sample of 1000 revisions. The plots below present the cumulative distribution function (CDF) plot of time taken per revision. Note that the performance is tested on CPU with batch_size of 1, and the time could be significantly reduced in the GPU settings.

Time performance with CPU

Implementation

Model architecture

Distilbert multilingual model tunning:

Learning rate: 2e-5
Weight Decay: 0.01
Epochs: 3
Maximum input length: 128

Output schema

{
  rn_score: <float [0-1]>
}

Example input and output

Input

{“rev_id”: 1221892202, “lang”: ‘en’}

Output

{
  rn_score: 0.22,
}

Data

The data was trained on a set of featured articles, which are determined as the highest quality on Wikipedia by the editors. We used the mediawiki_wikitext_current table to extract the latest available revision for each featured article. The snapshot used was 2024-02.

Data pipeline

The data was collected using Wikimedia Data Lake and Wikimedia Analytics cluster. For each article, we tokenized its latest revision into sentences to be passed to the model.

Training data

Number of languages: 5 ('ru', 'es', 'de', 'fr', 'en')
Number of sentences: 100,000
Random sample of 20,000 sentences from each language balanced on the ground-truth label.

Test data

Number of languages: 5 ('ru', 'es', 'de', 'fr', 'en')
Number of sentences: 15,000
Random sample of 3,000 sentences from each language balanced on the ground-truth label.

Licenses

Code:
Model:

Citation

Cite this model as:

@misc{name_year_modeltype,
   title={Model card title},
   author={Lastname, Firstname (and Lastname, Firstname and...)},
   year={year},
   url={this URL}
}