Machine learning models/Production/Turkish Wikipedia article quality


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
CodeORES Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Turkish Wikipedia
This model uses data about a revision to predict the likelihood that the article is of a certain quality level.


Motivation

edit

This model card describes a model for predicting the quality of Wikipedia articles. It uses structural features extracted from the article to label Wikipedia articles with a probability score for each article quality class.

Wikipedia articles range in quality from rich, well-illustrated, fully-referenced articles that fully cover their topic and are easy to read to single sentence stubs that define the topic of the article but do not offer much more information. It is very useful to be able to reliably distinguish between these extremes and the various stages of quality along this spectrum. Wikipedia editors have developed rich rubrics for how to evaluate the quality of Wikipedia articles and are constantly assessing article quality to assist in coordinating work on the wikis. Editors use these quality scores to evaluate and prioritize their work. Researchers use these quality scores to understand content dynamics. Developers use these quality scores as filters when building recommender systems or other tools.

Wikipedia is always changing, which makes it time-consuming (and largely impossible) for editors to keep these quality assessments complete and up-to-date. An automatic quality model can help fill these gaps by evaluating the quality for articles that are unassessed or have changed substantially since they were last assessed. In doing so, it can provide researchers and tool developers with more consistent data and even potentially help editors identify articles that would benefit from a human assessment.

Users and uses

edit
Use this model for
  • high-level analyses of article quality trends
  • filtering / ranking articles in tools – e.g. only show low-quality articles in a recommender system
  • identifying potential ways to improve articles – e.g. using the lowest-value feature from the model as a recommendation
Don't use this model for
  • projects outside of Turkish Wikipedia
  • namespaces outside of 0, disambiguation pages, and redirects
  • directly comparing article quality across language editions – the scores are for a given project so e.g., an article that received a 0.5 score on English Wikipedia would get a much higher score if it had been on Simple English Wikipedia instead (because high-quality articles on English Wikipedia generally have more content than high-quality articles on Simple English Wikipedia)
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/trwiki/1234/articlequality

Ethical considerations, caveats, and recommendations

edit
  • The source data for this model is several years old — data drift may skew current outputs relative to the training data.
  • The model does not currently take into account the quality of the specific writing, so a long article with many fake words would register as high quality. It does take into account structure though, so a long article would be penalized if it did not have many sections or was poorly referenced.
  • Different wikis have different labeling schemes — do not use this model in conjunction with other models to conduct an interwiki analysis.

Model

edit

Performance

edit

Test data confusion matrix:

Label n ~taslak ~baslagıç ~c ~b ~km ~sm
taslak 267 192 72 3 0 0 0
baslagıç 271 64 158 43 2 4 0
c 269 13 50 131 39 29 7
b 270 5 17 38 127 38 45
km 267 1 6 20 14 184 42
sm 270 2 3 6 26 41 192

Test data sample rates:

taslak baslagıç c b km sm
sample 0.165 0.168 0.167 0.167 0.165 0.167
population 0.58 0.248 0.086 0.053 0.017 0.016

Test data performance:

Statistic taslak baslagıç c b km sm
match_rate 0.444 0.227 0.117 0.082 0.093 0.08
filter_rate 0.556 0.773 0.883 0.918 0.907 0.92
recall 0.719 0.583 0.487 0.47 0.689 0.711
precision 0.94 0.635 0.359 0.305 0.125 0.14
f1 0.815 0.608 0.413 0.37 0.212 0.234
accuracy 0.81 0.814 0.881 0.915 0.913 0.927
fpr 0.063 0.11 0.082 0.06 0.083 0.07
roc_auc 0.947 0.88 0.851 0.856 0.92 0.935
pr_auc 0.95 0.641 0.493 0.372 0.236 0.322

Implementation

edit
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "warm_start": false,
        "center": true,
        "min_weight_fraction_leaf": 0.0,
        "loss": "deviance",
        "min_impurity_decrease": 0.0,
        "min_samples_split": 2,
        "max_features": "log2",
        "min_samples_leaf": 1,
        "max_depth": 5,
        "n_iter_no_change": null,
        "multilabel": false,
        "scale": true,
        "max_leaf_nodes": null,
        "label_weights": null,
        "presort": "deprecated",
        "n_estimators": 300,
        "criterion": "friedman_mse",
        "verbose": 0,
        "population_rates": null,
        "random_state": null,
        "validation_fraction": 0.1,
        "subsample": 1.0,
        "labels": [
            "taslak",
            "baslag\u0131\u00e7",
            "c",
            "b",
            "km",
            "sm"
        ],
        "ccp_alpha": 0.0,
        "learning_rate": 0.01,
        "min_impurity_split": null,
        "init": null,
        "tol": 0.0001
    }
}
Output schema
{
    "type": "object",
    "title": "Scikit learn-based classifier score with probability",
    "properties": {
        "probability": {
            "type": "object",
            "description": "A mapping of probabilities onto each of the potential output labels",
            "properties": {
                "taslak": {
                    "type": "number"
                },
                "baslag\u0131\u00e7": {
                    "type": "number"
                },
                "c": {
                    "type": "number"
                },
                "b": {
                    "type": "number"
                },
                "km": {
                    "type": "number"
                },
                "sm": {
                    "type": "number"
                }
            }
        },
        "prediction": {
            "type": "string",
            "description": "The most likely label predicted by the estimator"
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/trwiki/1234/articlequality

Output:

{
    "trwiki": {
        "models": {
            "articlequality": {
                "version": "0.8.0"
            }
        },
        "scores": {
            "1234": {
                "articlequality": {
                    "score": {
                        "prediction": "b",
                        "probability": {
                            "b": 0.5189410705089456,
                            "baslag\u0131\u00e7": 0.17445084872938216,
                            "c": 0.16595000423606762,
                            "km": 0.026004385330073404,
                            "sm": 0.08557972032207807,
                            "taslak": 0.029073970873453137
                        }
                    }
                }
            }
        }
    }
}

Data

edit
Data pipeline
Labels were collected from on-wiki judgements of article quality, and then joined with revision features to create a source dataset.
Training data
Train data was automatically split off from test data using functionality from the revscoring repository.
Test data
Test data was automatically and randomly split off from train data using functionality from the revscoring repository and held out during the training process. The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

edit

Citation

edit

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Turkish_Wikipedia_article_quality,
  title={ Turkish Wikipedia article quality model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Turkish_Wikipedia_article_quality }
}