Machine learning models/Proposed/Multilingual reference need
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
How can we assist editors in identifying articles with statements missing citations?
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Aitolkyn Baigutanova, Muniza Aslam, and Diego Saez-Trumper |
Model owner(s) | Diego Saez-Trumper |
Code | Inference |
Uses PII | No |
In production? | TBA |
This model uses revision content to predict the reference need score of that revision. | |
The goal of this model is to detect sentences that lack proper citations. One of the key policies of Wikipedia is verifiability, which ensures that users of the encyclopedia can check the sources of the provided information. Unfortunately, this policy is often not adhered to, resulting in numerous uncited statements. To deal with this issue, Wikipedia implements citation needed tag allowing anyone to request a source for uncited claims. However, due to the sheer volume of daily edits on Wikipedia, manually labeling such uncited claims could be inefficient and prone to delays. Therefore, this model aims to automate the process of detecting uncited claims and support editors in maintaining Wikipedia's standards of verifiability.
This model is to be deployed on LiftWing.
Motivation
editThe model is created to support the editors and provide further insight to the readers on the referencing quality of an article.
The model is recommended for use in the top 10 language editions of Wikipedia based on the active user count (as of July 2024), namely ['fa', 'it', 'zh', 'ru','pt', 'es', 'ja', 'de', 'fr', 'en']
.
For other editions of Wikipedia the model is advised to be used with special precautions.
Users and uses
edit- compute the reference need score of a revision
- making predictions on language editions of Wikipedia other than the ones listed above should be used with precautions
- making predictions on other Wiki projects (Wiktionary, Wikinews, Wikidata, etc.)
Ethical considerations, caveats, and recommendations
edit- This models relies on Multilingual Bert, a Large Language Model, that might contain certain biases.
- The processing time for this model grows proportionally with the number of sentences. Therefore, longer articles will require more time to be processed. Keep in mind that there might exist a correlation between an article's popularity (pageviews) and length.
- References the current model catches include those within a <ref> tag and templates starting with
sfn, snf, and harv
. Although these are the major referencing methods used in the largest wikis included in the experiments for this model, some citation templates may still be missing. One reason for this is folksonomy, where a single template is referred to by multiple names.
Model
editThe presented model is based on the content features extracted from wikitext of a revision. We used mwtokenizer to extract sentences from the main text of the article. To provide additional context to the model, we included the section heading along with the sentences immediately preceding and following the target sentence within its paragraph and then fed this enriched context into the fine-tuned distilled version of the multilingual BERT model. For additional speed up in the production setting, we applied dynamic quantization to our model, a technique performed after training to reduce the size of the model's weights. Particularly, we represented 32-bit floating-point weights as more compact 8-bit integers.
1. Prepare text features:
- Process wikitext and locate the inline citations used
- Tokenize the revision into sentences
- Supplement each sentence with a section heading, a preceding sentence, and a following sentence
2. Classification
- Pass the features to the model
- Obtain a prediction per each uncited sentence in a revision
3. Compute reference need score
- Based on the model prediction per each sentence and the citations existing in a revision, compute the reference need score as a proportion of uncited sentences that require a citation out of the total number of sentences needing citations within the uncited sentences of the revision.
Performance
editThe model performance was tested on the five languages in the dataset. We evaluate the model using F1-score, PR-AUC, and ROC-AUC metrics.
wiki_db | F1-score | PR-AUC | ROC-AUC |
---|---|---|---|
enwiki | 0.742 | 0.835 | 0.833 |
eswiki | 0.697 | 0.749 | 0.755 |
frwiki | 0.677 | 0.729 | 0.740 |
dewiki | 0.693 | 0.765 | 0.765 |
ruwiki | 0.725 | 0.817 | 0.807 |
All | 0.705 | 0.778 | 0.774 |
We further test the time performance on a random sample of 1000 revisions. The plots below present the cumulative distribution function (CDF) plot of time taken per revision. Note that the performance is tested on CPU with batch_size of 1, and the time could be significantly reduced in the GPU settings.
Implementation
editDistilbert multilingual model tunning:
- Learning rate: 2e-5
- Weight Decay: 0.01
- Epochs: 3
- Maximum input length: 128
{
rn_score: <float [0-1]>
}
Input
{“rev_id”: 1221892202, “lang”: ‘en’}
Output
{
rn_score: 0.22,
}
Data
editThe data was trained on a set of featured articles, which are determined as the highest quality on Wikipedia by the editors. We used the mediawiki_wikitext_current table to extract the latest available revision for each featured article. The snapshot used was 2024-02.
- Number of languages: 5
('ru', 'es', 'de', 'fr', 'en')
- Number of sentences: 100,000
- Random sample of 20,000 sentences from each language balanced on the ground-truth label.
- Number of languages: 5
('ru', 'es', 'de', 'fr', 'en')
- Number of sentences: 15,000
- Random sample of 3,000 sentences from each language balanced on the ground-truth label.
Licenses
edit- Code:
- Model:
Citation
editCite this model as:
@misc{name_year_modeltype,
title={Model card title},
author={Lastname, Firstname (and Lastname, Firstname and...)},
year={year},
url={this URL}
}