Machine learning models/Production/English Wikipedia article topic


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?English Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

edit

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

edit
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/enwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

edit
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

edit

Performance

edit

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 16539 14663 1876 1218 46123
Culture.Biography.Women 4162 3180 982 812 58906
Culture.Food and drink 1301 916 385 69 62510
Culture.Internet culture 2969 2307 662 200 60711
Culture.Linguistics 1367 1027 340 104 62409
Culture.Literature 5298 3939 1359 433 58149
Culture.Media.Books 1906 1355 551 198 61776
Culture.Media.Entertainment 1825 917 908 173 61882
Culture.Media.Films 2347 1940 407 145 61388
Culture.Media.Media* 14457 12411 2046 1445 47978
Culture.Media.Music 2665 2162 503 292 60923
Culture.Media.Radio 1191 963 228 28 62661
Culture.Media.Software 1784 1028 756 349 61747
Culture.Media.Television 2224 1694 530 179 61477
Culture.Media.Video games 2114 1904 210 36 61730
Culture.Performing arts 1324 854 470 101 62455
Culture.Philosophy and religion 2712 1579 1133 339 60829
Culture.Sports 5901 5302 599 306 57673
Culture.Visual arts.Architecture 2571 1906 665 250 61059
Culture.Visual arts.Comics and Anime 1509 1231 278 91 62280
Culture.Visual arts.Fashion 1173 805 368 46 62661
Culture.Visual arts.Visual arts* 5989 4575 1414 521 57370
Geography.Geographical 3518 2284 1234 278 60084
Geography.Regions.Africa.Africa* 6484 5678 806 359 57037
Geography.Regions.Africa.Central Africa 1155 887 268 35 62690
Geography.Regions.Africa.Eastern Africa 1100 901 199 36 62744
Geography.Regions.Africa.Northern Africa 1309 999 310 94 62477
Geography.Regions.Africa.Southern Africa 1260 1011 249 43 62577
Geography.Regions.Africa.Western Africa 1151 958 193 65 62664
Geography.Regions.Americas.Central America 1302 931 371 88 62490
Geography.Regions.Americas.North America 7482 5303 2179 1251 55147
Geography.Regions.Americas.South America 1575 1156 419 134 62171
Geography.Regions.Asia.Asia* 11206 9648 1558 824 51850
Geography.Regions.Asia.Central Asia 1133 889 244 45 62702
Geography.Regions.Asia.East Asia 2749 2089 660 256 60875
Geography.Regions.Asia.North Asia 1369 929 440 183 62328
Geography.Regions.Asia.South Asia 2428 2104 324 118 61334
Geography.Regions.Asia.Southeast Asia 1726 1358 368 103 62051
Geography.Regions.Asia.West Asia 2301 1897 404 124 61455
Geography.Regions.Europe.Eastern Europe 3088 2460 628 292 60500
Geography.Regions.Europe.Europe* 12265 9544 2721 1743 49872
Geography.Regions.Europe.Northern Europe 4099 2867 1232 637 59144
Geography.Regions.Europe.Southern Europe 2397 1720 677 297 61186
Geography.Regions.Europe.Western Europe 3062 2119 943 444 60374
Geography.Regions.Oceania 2535 2103 432 146 61199
History and Society.Business and economics 3458 1651 1807 561 59861
History and Society.Education 2204 1073 1131 244 61432
History and Society.History 3307 1380 1927 513 60060
History and Society.Military and warfare 4048 2928 1120 380 59452
History and Society.Politics and government 4604 2919 1685 468 58808
History and Society.Society 4009 1667 2342 407 59464
History and Society.Transportation 3601 3115 486 196 60083
STEM.Biology 2951 2421 530 146 60783
STEM.Chemistry 1319 933 386 152 62409
STEM.Computing 2102 1394 708 427 61351
STEM.Earth and environment 1619 1156 463 123 62138
STEM.Engineering 2361 1704 657 184 61335
STEM.Libraries & Information 1165 691 474 80 62635
STEM.Mathematics 1136 766 370 69 62675
STEM.Medicine & Health 1784 1200 584 167 61929
STEM.Physics 1173 766 407 135 62572
STEM.STEM* 16613 14497 2116 1062 46205
STEM.Space 1412 1219 193 50 62418
STEM.Technology 3691 2310 1381 611 59578

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.259 0.123
Culture.Biography.Women 0.065 0.015
Culture.Food and drink 0.02 0.002
Culture.Internet culture 0.046 0.003
Culture.Linguistics 0.021 0.007
Culture.Literature 0.083 0.015
Culture.Media.Books 0.03 0.004
Culture.Media.Entertainment 0.029 0.004
Culture.Media.Films 0.037 0.011
Culture.Media.Media* 0.226 0.058
Culture.Media.Music 0.042 0.024
Culture.Media.Radio 0.019 0.002
Culture.Media.Software 0.028 0.001
Culture.Media.Television 0.035 0.009
Culture.Media.Video games 0.033 0.003
Culture.Performing arts 0.021 0.003
Culture.Philosophy and religion 0.042 0.011
Culture.Sports 0.092 0.071
Culture.Visual arts.Architecture 0.04 0.011
Culture.Visual arts.Comics and Anime 0.024 0.002
Culture.Visual arts.Fashion 0.018 0.001
Culture.Visual arts.Visual arts* 0.094 0.018
Geography.Geographical 0.055 0.024
Geography.Regions.Africa.Africa* 0.102 0.008
Geography.Regions.Africa.Central Africa 0.018 0.001
Geography.Regions.Africa.Eastern Africa 0.017 0
Geography.Regions.Africa.Northern Africa 0.02 0.001
Geography.Regions.Africa.Southern Africa 0.02 0.001
Geography.Regions.Africa.Western Africa 0.018 0.001
Geography.Regions.Americas.Central America 0.02 0.003
Geography.Regions.Americas.North America 0.117 0.064
Geography.Regions.Americas.South America 0.025 0.006
Geography.Regions.Asia.Asia* 0.175 0.045
Geography.Regions.Asia.Central Asia 0.018 0.001
Geography.Regions.Asia.East Asia 0.043 0.011
Geography.Regions.Asia.North Asia 0.021 0.001
Geography.Regions.Asia.South Asia 0.038 0.015
Geography.Regions.Asia.Southeast Asia 0.027 0.006
Geography.Regions.Asia.West Asia 0.036 0.011
Geography.Regions.Europe.Eastern Europe 0.048 0.013
Geography.Regions.Europe.Europe* 0.192 0.076
Geography.Regions.Europe.Northern Europe 0.064 0.031
Geography.Regions.Europe.Southern Europe 0.038 0.013
Geography.Regions.Europe.Western Europe 0.048 0.019
Geography.Regions.Oceania 0.04 0.015
History and Society.Business and economics 0.054 0.01
History and Society.Education 0.035 0.007
History and Society.History 0.052 0.011
History and Society.Military and warfare 0.063 0.014
History and Society.Politics and government 0.072 0.028
History and Society.Society 0.063 0.013
History and Society.Transportation 0.056 0.015
STEM.Biology 0.046 0.034
STEM.Chemistry 0.021 0.002
STEM.Computing 0.033 0.003
STEM.Earth and environment 0.025 0.005
STEM.Engineering 0.037 0.005
STEM.Libraries & Information 0.018 0.001
STEM.Mathematics 0.018 0
STEM.Medicine & Health 0.028 0.006
STEM.Physics 0.018 0.001
STEM.STEM* 0.26 0.069
STEM.Space 0.022 0.006
STEM.Technology 0.058 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.132 0.868 0.887 0.829 0.857 0.963 0.978 0.909
Culture.Biography.Women 0.025 0.975 0.764 0.454 0.569 0.983 0.981 0.567
Culture.Food and drink 0.003 0.997 0.704 0.612 0.655 0.998 0.982 0.625
Culture.Internet culture 0.006 0.994 0.777 0.454 0.573 0.996 0.986 0.739
Culture.Linguistics 0.007 0.993 0.751 0.769 0.76 0.997 0.978 0.785
Culture.Literature 0.019 0.981 0.743 0.613 0.672 0.989 0.977 0.742
Culture.Media.Books 0.006 0.994 0.711 0.474 0.568 0.996 0.981 0.597
Culture.Media.Entertainment 0.005 0.995 0.502 0.393 0.441 0.995 0.968 0.415
Culture.Media.Films 0.011 0.989 0.827 0.788 0.807 0.996 0.983 0.844
Culture.Media.Media* 0.078 0.922 0.858 0.646 0.737 0.964 0.978 0.847
Culture.Media.Music 0.024 0.976 0.811 0.806 0.809 0.991 0.985 0.853
Culture.Media.Radio 0.002 0.998 0.809 0.796 0.802 0.999 0.987 0.85
Culture.Media.Software 0.006 0.994 0.576 0.12 0.199 0.994 0.98 0.187
Culture.Media.Television 0.01 0.99 0.762 0.7 0.73 0.995 0.981 0.77
Culture.Media.Video games 0.003 0.997 0.901 0.802 0.848 0.999 0.993 0.902
Culture.Performing arts 0.003 0.997 0.645 0.536 0.586 0.997 0.981 0.593
Culture.Philosophy and religion 0.012 0.988 0.582 0.531 0.555 0.99 0.964 0.538
Culture.Sports 0.069 0.931 0.898 0.929 0.913 0.988 0.984 0.947
Culture.Visual arts.Architecture 0.012 0.988 0.741 0.66 0.698 0.993 0.983 0.749
Culture.Visual arts.Comics and Anime 0.003 0.997 0.816 0.552 0.658 0.998 0.986 0.74
Culture.Visual arts.Fashion 0.001 0.999 0.686 0.431 0.529 0.999 0.983 0.513
Culture.Visual arts.Visual arts* 0.023 0.977 0.764 0.613 0.68 0.987 0.976 0.756
Geography.Geographical 0.02 0.98 0.649 0.773 0.706 0.987 0.97 0.757
Geography.Regions.Africa.Africa* 0.013 0.987 0.876 0.524 0.656 0.993 0.985 0.718
Geography.Regions.Africa.Central Africa 0.001 0.999 0.768 0.465 0.579 0.999 0.988 0.627
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.819 0.394 0.532 0.999 0.984 0.459
Geography.Regions.Africa.Northern Africa 0.002 0.998 0.763 0.384 0.511 0.998 0.981 0.422
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.802 0.579 0.673 0.999 0.985 0.654
Geography.Regions.Africa.Western Africa 0.002 0.998 0.832 0.355 0.497 0.999 0.983 0.446
Geography.Regions.Americas.Central America 0.004 0.996 0.715 0.627 0.668 0.998 0.982 0.672
Geography.Regions.Americas.North America 0.066 0.934 0.709 0.687 0.698 0.961 0.966 0.767
Geography.Regions.Americas.South America 0.007 0.993 0.734 0.684 0.708 0.996 0.983 0.745
Geography.Regions.Asia.Asia* 0.054 0.946 0.861 0.724 0.787 0.979 0.98 0.841
Geography.Regions.Asia.Central Asia 0.001 0.999 0.785 0.487 0.601 0.999 0.987 0.714
Geography.Regions.Asia.East Asia 0.013 0.987 0.76 0.677 0.716 0.993 0.981 0.739
Geography.Regions.Asia.North Asia 0.004 0.996 0.679 0.176 0.28 0.997 0.985 0.223
Geography.Regions.Asia.South Asia 0.015 0.985 0.867 0.874 0.87 0.996 0.985 0.896
Geography.Regions.Asia.Southeast Asia 0.006 0.994 0.787 0.742 0.763 0.997 0.982 0.74
Geography.Regions.Asia.West Asia 0.011 0.989 0.824 0.819 0.822 0.996 0.986 0.875
Geography.Regions.Europe.Eastern Europe 0.015 0.985 0.797 0.683 0.735 0.993 0.984 0.775
Geography.Regions.Europe.Europe* 0.09 0.91 0.778 0.655 0.711 0.952 0.964 0.768
Geography.Regions.Europe.Northern Europe 0.032 0.968 0.699 0.674 0.687 0.98 0.973 0.713
Geography.Regions.Europe.Southern Europe 0.014 0.986 0.718 0.662 0.689 0.992 0.977 0.711
Geography.Regions.Europe.Western Europe 0.02 0.98 0.692 0.649 0.67 0.987 0.976 0.684
Geography.Regions.Oceania 0.015 0.985 0.83 0.843 0.836 0.995 0.985 0.86
History and Society.Business and economics 0.014 0.986 0.477 0.344 0.4 0.986 0.958 0.345
History and Society.Education 0.008 0.992 0.487 0.477 0.482 0.992 0.96 0.436
History and Society.History 0.013 0.987 0.417 0.351 0.381 0.985 0.944 0.319
History and Society.Military and warfare 0.016 0.984 0.723 0.619 0.667 0.99 0.978 0.706
History and Society.Politics and government 0.026 0.974 0.634 0.7 0.665 0.982 0.965 0.709
History and Society.Society 0.012 0.988 0.416 0.439 0.427 0.986 0.929 0.398
History and Society.Transportation 0.016 0.984 0.865 0.803 0.833 0.995 0.986 0.877
STEM.Biology 0.03 0.97 0.82 0.923 0.868 0.992 0.981 0.915
STEM.Chemistry 0.004 0.996 0.707 0.312 0.433 0.997 0.984 0.459
STEM.Computing 0.009 0.991 0.663 0.206 0.314 0.992 0.981 0.301
STEM.Earth and environment 0.005 0.995 0.714 0.622 0.665 0.997 0.978 0.691
STEM.Engineering 0.007 0.993 0.722 0.559 0.63 0.996 0.98 0.652
STEM.Libraries & Information 0.002 0.998 0.593 0.224 0.325 0.998 0.975 0.356
STEM.Mathematics 0.001 0.999 0.674 0.204 0.313 0.999 0.981 0.385
STEM.Medicine & Health 0.007 0.993 0.673 0.617 0.644 0.995 0.978 0.655
STEM.Physics 0.003 0.997 0.653 0.205 0.312 0.998 0.981 0.325
STEM.STEM* 0.081 0.919 0.873 0.743 0.802 0.97 0.977 0.897
STEM.Space 0.006 0.994 0.863 0.867 0.865 0.998 0.987 0.904
STEM.Technology 0.013 0.987 0.626 0.241 0.348 0.988 0.969 0.365

Implementation

edit
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "min_samples_split": 2,
        "random_state": null,
        "presort": "deprecated",
        "criterion": "friedman_mse",
        "init": null,
        "n_estimators": 150,
        "label_weights": {},
        "min_weight_fraction_leaf": 0.0,
        "min_impurity_split": null,
        "scale": false,
        "max_leaf_nodes": null,
        "tol": 0.0001,
        "min_impurity_decrease": 0.0,
        "learning_rate": 0.1,
        "ccp_alpha": 0.0,
        "loss": "deviance",
        "warm_start": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "n_iter_no_change": null,
        "min_samples_leaf": 1,
        "subsample": 1.0,
        "population_rates": null,
        "validation_fraction": 0.1,
        "max_features": "log2",
        "max_depth": 5,
        "multilabel": true,
        "center": false,
        "verbose": 0
    }
}
Output schema
Output schema
{
    "type": "object",
    "title": "Scikit learn-based classifier score with probability",
    "properties": {
        "probability": {
            "type": "object",
            "properties": {
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                }
            },
            "description": "A mapping of probabilities onto each of the potential output labels"
        },
        "prediction": {
            "items": {
                "type": "string"
            },
            "description": "The most likely labels predicted by the estimator",
            "type": "array"
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/enwiki/1234/articletopic

Output:

Example output
{
    "enwiki": {
        "models": {
            "articletopic": {
                "version": "1.3.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.007324216272400619,
                            "Culture.Biography.Women": 0.0012902608239981099,
                            "Culture.Food and drink": 0.012590959877795965,
                            "Culture.Internet culture": 0.0005287782922645555,
                            "Culture.Linguistics": 0.0023009688839875403,
                            "Culture.Literature": 0.003750485897413451,
                            "Culture.Media.Books": 0.0006460359265654311,
                            "Culture.Media.Entertainment": 0.001019549745209333,
                            "Culture.Media.Films": 0.0008073367081496864,
                            "Culture.Media.Media*": 0.012011986684036512,
                            "Culture.Media.Music": 0.0008537381008416275,
                            "Culture.Media.Radio": 0.0002869665819413362,
                            "Culture.Media.Software": 0.001089289634412607,
                            "Culture.Media.Television": 0.0008907233279284757,
                            "Culture.Media.Video games": 0.0001200262230068454,
                            "Culture.Performing arts": 0.0019581135216171887,
                            "Culture.Philosophy and religion": 0.006759709347651477,
                            "Culture.Sports": 0.002708191532469151,
                            "Culture.Visual arts.Architecture": 0.006677867414092196,
                            "Culture.Visual arts.Comics and Anime": 0.00012020261708547758,
                            "Culture.Visual arts.Fashion": 0.0008955508074615747,
                            "Culture.Visual arts.Visual arts*": 0.01634491597811793,
                            "Geography.Geographical": 0.021850841219839115,
                            "Geography.Regions.Africa.Africa*": 0.01403461063487551,
                            "Geography.Regions.Africa.Central Africa": 0.0001865675770278535,
                            "Geography.Regions.Africa.Eastern Africa": 0.0299178497643664,
                            "Geography.Regions.Africa.Northern Africa": 0.0002661363607017024,
                            "Geography.Regions.Africa.Southern Africa": 0.0002798146633784188,
                            "Geography.Regions.Africa.Western Africa": 0.00034452672014203926,
                            "Geography.Regions.Americas.Central America": 0.0008260548714261792,
                            "Geography.Regions.Americas.North America": 0.264295864207147,
                            "Geography.Regions.Americas.South America": 0.00015273918706570618,
                            "Geography.Regions.Asia.Asia*": 0.005740602435312139,
                            "Geography.Regions.Asia.Central Asia": 9.711333852867366e-05,
                            "Geography.Regions.Asia.East Asia": 0.0008691503787446779,
                            "Geography.Regions.Asia.North Asia": 0.0001781509472634297,
                            "Geography.Regions.Asia.South Asia": 0.0002519598342558709,
                            "Geography.Regions.Asia.Southeast Asia": 0.0007149941847729443,
                            "Geography.Regions.Asia.West Asia": 0.0007647733764823995,
                            "Geography.Regions.Europe.Eastern Europe": 0.0021314421964480305,
                            "Geography.Regions.Europe.Europe*": 0.04348112583641243,
                            "Geography.Regions.Europe.Northern Europe": 0.03873409183096865,
                            "Geography.Regions.Europe.Southern Europe": 0.0017978097009254826,
                            "Geography.Regions.Europe.Western Europe": 0.001196308436445433,
                            "Geography.Regions.Oceania": 0.0010145084950130057,
                            "History and Society.Business and economics": 0.014571914809621094,
                            "History and Society.Education": 0.022647190342216877,
                            "History and Society.History": 0.006181778255649838,
                            "History and Society.Military and warfare": 0.00431305074559507,
                            "History and Society.Politics and government": 0.017814417716100546,
                            "History and Society.Society": 0.022076257425149896,
                            "History and Society.Transportation": 0.0011766622074844352,
                            "STEM.Biology": 0.009271860405065028,
                            "STEM.Chemistry": 0.0011815942896098593,
                            "STEM.Computing": 0.0007807185930526152,
                            "STEM.Earth and environment": 0.0036473015518475203,
                            "STEM.Engineering": 0.003285589480897292,
                            "STEM.Libraries & Information": 0.0037889231177539767,
                            "STEM.Mathematics": 0.010840473137394606,
                            "STEM.Medicine & Health": 0.004612666090909239,
                            "STEM.Physics": 0.0007788391574850387,
                            "STEM.STEM*": 0.7655342414491919,
                            "STEM.Space": 8.220588424577983e-05,
                            "STEM.Technology": 0.01811368106322167
                        }
                    }
                }
            }
        }
    }
}

Data

edit
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

edit

Citation

edit

Cite this model card as:

@misc{
  Triedman_Bazira_2023_English_Wikipedia_article_topic,
  title={ English Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/English_Wikipedia_article_topic }
}