Machine learning models/Production/Basque Wikipedia article topic


Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Aaron Halfaker (User:EpochFail) and Amir Sarabadani
Model owner(s)WMF Machine Learning Team (ml@wikimediafoundation.org)
Model interfaceOres homepage
Codedrafttopic Github, ORES training data, and ORES model binaries
Uses PIINo
In production?Yes
Which projects?Basque Wikipedia
This model uses article text to predict the likelihood that the article belongs to a set of topics.


Motivation

edit

How can we predict what general topic an article is in? Answering this question is useful for various analyses of Wikipedia dynamics. However, it is difficult to group a very diverse range of Wikipedia articles into coherent, consistent topics manually.

This model, part of the ORES suite of models, analyzes an article to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.

This model may be useful for high-level analyses of Wikipedia dynamics (pageviews, article quality, edit trends) and filtering articles.

Users and uses

edit
Use this model for
  • high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
  • filtering to relevant articles — e.g. filter articles only to those in the music category.
Don't use this model for
  • definitively establishing what topic an article pertains to
  • automated editing of articles or topics without a human in the loop
Current uses

This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikipedia, platform research, and other on-wiki tasks.

Example API call:
https://ores.wikimedia.org/v3/scores/euwiki/1234/articletopic

Ethical considerations, caveats, and recommendations

edit
  • This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
  • This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikipedia has known biases in its text, this model may encode and at times reproduce those biases.
  • This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.

Model

edit

Performance

edit

Test data confusion matrix:

Test data confusion matrix
Label n True positive False positive False negative True Negative
Culture.Biography.Biography* 14111 12792 1319 690 41643
Culture.Biography.Women 3558 2491 1067 324 52562
Culture.Food and drink 1464 1064 400 111 54869
Culture.Internet culture 2157 1885 272 144 54143
Culture.Linguistics 2171 1623 548 142 54131
Culture.Literature 4875 3743 1132 407 51162
Culture.Media.Books 1305 1083 222 73 55066
Culture.Media.Entertainment 2110 1216 894 252 54082
Culture.Media.Films 3102 2853 249 83 53259
Culture.Media.Media* 11523 10170 1353 956 43965
Culture.Media.Music 2715 2334 381 133 53596
Culture.Media.Radio 253 189 64 28 56163
Culture.Media.Software 2005 1867 138 178 54261
Culture.Media.Television 1733 1403 330 86 54625
Culture.Media.Video games 651 615 36 20 55773
Culture.Performing arts 1551 977 574 110 54783
Culture.Philosophy and religion 4242 2436 1806 370 51832
Culture.Sports 3639 2935 704 137 52668
Culture.Visual arts.Architecture 2502 1953 549 243 53699
Culture.Visual arts.Comics and Anime 1161 1008 153 39 55244
Culture.Visual arts.Fashion 618 428 190 40 55786
Culture.Visual arts.Visual arts* 4969 3733 1236 384 51091
Geography.Geographical 4627 3220 1407 643 51174
Geography.Regions.Africa.Africa* 3906 2632 1274 322 52216
Geography.Regions.Africa.Central Africa 808 503 305 85 55551
Geography.Regions.Africa.Eastern Africa 360 234 126 22 56062
Geography.Regions.Africa.Northern Africa 1527 1045 482 115 54802
Geography.Regions.Africa.Southern Africa 590 387 203 41 55813
Geography.Regions.Africa.Western Africa 77 49 28 24 56343
Geography.Regions.Americas.Central America 1406 814 592 138 54900
Geography.Regions.Americas.North America 6544 4963 1581 763 49137
Geography.Regions.Americas.South America 1784 1373 411 122 54538
Geography.Regions.Asia.Asia* 9866 7962 1904 829 45749
Geography.Regions.Asia.Central Asia 928 688 240 74 55442
Geography.Regions.Asia.East Asia 3315 2750 565 180 52949
Geography.Regions.Asia.North Asia 1410 1202 208 120 54914
Geography.Regions.Asia.South Asia 1800 1294 506 104 54540
Geography.Regions.Asia.Southeast Asia 1750 968 782 191 54503
Geography.Regions.Asia.West Asia 2383 1862 521 135 53926
Geography.Regions.Europe.Eastern Europe 3057 2564 493 162 53225
Geography.Regions.Europe.Europe* 16976 14458 2518 1901 37567
Geography.Regions.Europe.Northern Europe 4149 3271 878 310 51985
Geography.Regions.Europe.Southern Europe 5477 4245 1232 644 50323
Geography.Regions.Europe.Western Europe 5201 4151 1050 393 50850
Geography.Regions.Oceania 1895 1318 577 149 54400
History and Society.Business and economics 3125 2030 1095 230 53089
History and Society.Education 1777 921 856 116 54551
History and Society.History 6150 4002 2148 725 49569
History and Society.Military and warfare 4199 2821 1378 408 51837
History and Society.Politics and government 4713 2715 1998 504 51227
History and Society.Society 6946 3529 3417 657 48841
History and Society.Transportation 2533 2132 401 73 53838
STEM.Biology 6821 6232 589 159 49464
STEM.Chemistry 1541 1273 268 107 54796
STEM.Computing 2464 2198 266 142 53838
STEM.Earth and environment 1875 1366 509 103 54466
STEM.Engineering 2557 1862 695 201 53686
STEM.Libraries & Information 476 371 105 31 55937
STEM.Mathematics 1069 950 119 33 55342
STEM.Medicine & Health 1942 1393 549 132 54370
STEM.Physics 1574 1210 364 138 54732
STEM.STEM* 19988 18321 1667 805 35651
STEM.Space 1879 1736 143 28 54537
STEM.Technology 4389 3453 936 530 51525

Test data sample rates:

Test data sample rates
Label Sample Population
Culture.Biography.Biography* 0.25 0.123
Culture.Biography.Women 0.063 0.015
Culture.Food and drink 0.026 0.002
Culture.Internet culture 0.038 0.003
Culture.Linguistics 0.038 0.007
Culture.Literature 0.086 0.015
Culture.Media.Books 0.023 0.004
Culture.Media.Entertainment 0.037 0.004
Culture.Media.Films 0.055 0.011
Culture.Media.Media* 0.204 0.058
Culture.Media.Music 0.048 0.024
Culture.Media.Radio 0.004 0.002
Culture.Media.Software 0.036 0.001
Culture.Media.Television 0.031 0.009
Culture.Media.Video games 0.012 0.003
Culture.Performing arts 0.027 0.003
Culture.Philosophy and religion 0.075 0.011
Culture.Sports 0.064 0.071
Culture.Visual arts.Architecture 0.044 0.011
Culture.Visual arts.Comics and Anime 0.021 0.002
Culture.Visual arts.Fashion 0.011 0.001
Culture.Visual arts.Visual arts* 0.088 0.018
Geography.Geographical 0.082 0.024
Geography.Regions.Africa.Africa* 0.069 0.008
Geography.Regions.Africa.Central Africa 0.014 0.001
Geography.Regions.Africa.Eastern Africa 0.006 0
Geography.Regions.Africa.Northern Africa 0.027 0.001
Geography.Regions.Africa.Southern Africa 0.01 0.001
Geography.Regions.Africa.Western Africa 0.001 0.001
Geography.Regions.Americas.Central America 0.025 0.003
Geography.Regions.Americas.North America 0.116 0.064
Geography.Regions.Americas.South America 0.032 0.006
Geography.Regions.Asia.Asia* 0.175 0.045
Geography.Regions.Asia.Central Asia 0.016 0.001
Geography.Regions.Asia.East Asia 0.059 0.011
Geography.Regions.Asia.North Asia 0.025 0.001
Geography.Regions.Asia.South Asia 0.032 0.015
Geography.Regions.Asia.Southeast Asia 0.031 0.006
Geography.Regions.Asia.West Asia 0.042 0.011
Geography.Regions.Europe.Eastern Europe 0.054 0.013
Geography.Regions.Europe.Europe* 0.301 0.076
Geography.Regions.Europe.Northern Europe 0.074 0.031
Geography.Regions.Europe.Southern Europe 0.097 0.013
Geography.Regions.Europe.Western Europe 0.092 0.019
Geography.Regions.Oceania 0.034 0.015
History and Society.Business and economics 0.055 0.01
History and Society.Education 0.031 0.007
History and Society.History 0.109 0.011
History and Society.Military and warfare 0.074 0.014
History and Society.Politics and government 0.083 0.028
History and Society.Society 0.123 0.013
History and Society.Transportation 0.045 0.015
STEM.Biology 0.121 0.034
STEM.Chemistry 0.027 0.002
STEM.Computing 0.044 0.003
STEM.Earth and environment 0.033 0.005
STEM.Engineering 0.045 0.005
STEM.Libraries & Information 0.008 0.001
STEM.Mathematics 0.019 0
STEM.Medicine & Health 0.034 0.006
STEM.Physics 0.028 0.001
STEM.STEM* 0.354 0.069
STEM.Space 0.033 0.006
STEM.Technology 0.078 0.005

Test data performance:

Test data performance
Label Match rate Filter rate Recall Precision f1 Accuracy ROC AUC PR AUC
Culture.Biography.Biography* 0.126 0.874 0.907 0.886 0.896 0.974 0.983 0.955
Culture.Biography.Women 0.016 0.984 0.7 0.628 0.662 0.99 0.979 0.684
Culture.Food and drink 0.004 0.996 0.727 0.471 0.571 0.997 0.98 0.623
Culture.Internet culture 0.006 0.994 0.874 0.536 0.665 0.997 0.987 0.76
Culture.Linguistics 0.008 0.992 0.748 0.678 0.711 0.996 0.983 0.773
Culture.Literature 0.02 0.98 0.768 0.605 0.677 0.989 0.98 0.743
Culture.Media.Books 0.005 0.995 0.83 0.717 0.769 0.998 0.984 0.789
Culture.Media.Entertainment 0.007 0.993 0.576 0.309 0.402 0.994 0.973 0.342
Culture.Media.Films 0.011 0.989 0.92 0.863 0.89 0.998 0.988 0.929
Culture.Media.Media* 0.072 0.928 0.883 0.72 0.793 0.973 0.982 0.892
Culture.Media.Music 0.023 0.977 0.86 0.895 0.877 0.994 0.985 0.903
Culture.Media.Radio 0.002 0.998 0.747 0.764 0.755 0.999 0.948 0.58
Culture.Media.Software 0.005 0.995 0.931 0.275 0.424 0.997 0.989 0.569
Culture.Media.Television 0.009 0.991 0.81 0.821 0.815 0.997 0.985 0.844
Culture.Media.Video games 0.003 0.997 0.945 0.873 0.908 0.999 0.983 0.936
Culture.Performing arts 0.004 0.996 0.63 0.476 0.543 0.997 0.975 0.528
Culture.Philosophy and religion 0.013 0.987 0.574 0.466 0.514 0.988 0.959 0.52
Culture.Sports 0.06 0.94 0.807 0.96 0.876 0.984 0.978 0.934
Culture.Visual arts.Architecture 0.013 0.987 0.781 0.649 0.709 0.993 0.984 0.763
Culture.Visual arts.Comics and Anime 0.003 0.997 0.868 0.73 0.793 0.999 0.988 0.837
Culture.Visual arts.Fashion 0.001 0.999 0.693 0.439 0.537 0.999 0.978 0.503
Culture.Visual arts.Visual arts* 0.021 0.979 0.751 0.652 0.698 0.988 0.977 0.739
Geography.Geographical 0.029 0.971 0.696 0.575 0.63 0.981 0.972 0.661
Geography.Regions.Africa.Africa* 0.011 0.989 0.674 0.464 0.549 0.991 0.974 0.584
Geography.Regions.Africa.Central Africa 0.002 0.998 0.623 0.205 0.308 0.998 0.981 0.233
Geography.Regions.Africa.Eastern Africa 0.001 0.999 0.65 0.43 0.517 0.999 0.964 0.324
Geography.Regions.Africa.Northern Africa 0.003 0.997 0.684 0.286 0.404 0.998 0.978 0.364
Geography.Regions.Africa.Southern Africa 0.002 0.998 0.656 0.512 0.575 0.999 0.968 0.423
Geography.Regions.Africa.Western Africa 0.001 0.999 0.636 0.505 0.563 0.999 0.832 0.313
Geography.Regions.Americas.Central America 0.004 0.996 0.579 0.433 0.496 0.996 0.971 0.467
Geography.Regions.Americas.North America 0.063 0.937 0.758 0.773 0.766 0.97 0.974 0.842
Geography.Regions.Americas.South America 0.007 0.993 0.77 0.686 0.726 0.996 0.983 0.806
Geography.Regions.Asia.Asia* 0.054 0.946 0.807 0.684 0.74 0.974 0.975 0.825
Geography.Regions.Asia.Central Asia 0.002 0.998 0.741 0.325 0.452 0.998 0.981 0.355
Geography.Regions.Asia.East Asia 0.013 0.987 0.83 0.739 0.781 0.995 0.983 0.81
Geography.Regions.Asia.North Asia 0.003 0.997 0.852 0.266 0.405 0.998 0.988 0.362
Geography.Regions.Asia.South Asia 0.013 0.987 0.719 0.853 0.78 0.994 0.976 0.801
Geography.Regions.Asia.Southeast Asia 0.007 0.993 0.553 0.489 0.519 0.994 0.975 0.538
Geography.Regions.Asia.West Asia 0.011 0.989 0.781 0.775 0.778 0.995 0.979 0.778
Geography.Regions.Europe.Eastern Europe 0.014 0.986 0.839 0.782 0.809 0.995 0.982 0.838
Geography.Regions.Europe.Europe* 0.109 0.891 0.852 0.593 0.699 0.944 0.968 0.804
Geography.Regions.Europe.Northern Europe 0.03 0.97 0.788 0.807 0.798 0.988 0.981 0.841
Geography.Regions.Europe.Southern Europe 0.023 0.977 0.775 0.448 0.567 0.985 0.98 0.639
Geography.Regions.Europe.Western Europe 0.023 0.977 0.798 0.67 0.729 0.989 0.982 0.806
Geography.Regions.Oceania 0.013 0.987 0.696 0.796 0.742 0.993 0.979 0.782
History and Society.Business and economics 0.011 0.989 0.65 0.605 0.627 0.992 0.972 0.64
History and Society.Education 0.006 0.994 0.518 0.644 0.574 0.994 0.97 0.544
History and Society.History 0.021 0.979 0.651 0.331 0.439 0.982 0.963 0.463
History and Society.Military and warfare 0.017 0.983 0.672 0.551 0.605 0.988 0.974 0.637
History and Society.Politics and government 0.026 0.974 0.576 0.632 0.603 0.979 0.956 0.636
History and Society.Society 0.02 0.98 0.508 0.328 0.399 0.981 0.938 0.381
History and Society.Transportation 0.014 0.986 0.842 0.905 0.872 0.996 0.985 0.918
STEM.Biology 0.034 0.966 0.914 0.908 0.911 0.994 0.986 0.954
STEM.Chemistry 0.003 0.997 0.826 0.398 0.537 0.998 0.987 0.58
STEM.Computing 0.005 0.995 0.892 0.478 0.622 0.997 0.99 0.702
STEM.Earth and environment 0.005 0.995 0.729 0.637 0.68 0.997 0.981 0.705
STEM.Engineering 0.008 0.992 0.728 0.507 0.598 0.995 0.978 0.637
STEM.Libraries & Information 0.001 0.999 0.779 0.467 0.584 0.999 0.951 0.605
STEM.Mathematics 0.001 0.999 0.889 0.383 0.536 0.999 0.99 0.538
STEM.Medicine & Health 0.007 0.993 0.717 0.656 0.685 0.996 0.978 0.719
STEM.Physics 0.003 0.997 0.769 0.206 0.325 0.997 0.985 0.341
STEM.STEM* 0.084 0.916 0.917 0.755 0.828 0.974 0.98 0.922
STEM.Space 0.006 0.994 0.924 0.916 0.92 0.999 0.992 0.947
STEM.Technology 0.014 0.986 0.787 0.285 0.418 0.989 0.98 0.474

Implementation

edit
Model architecture
Model architecture
{
    "type": "GradientBoosting",
    "params": {
        "scale": false,
        "center": false,
        "labels": [
            "Culture.Biography.Biography*",
            "Culture.Biography.Women",
            "Culture.Food and drink",
            "Culture.Internet culture",
            "Culture.Linguistics",
            "Culture.Literature",
            "Culture.Media.Books",
            "Culture.Media.Entertainment",
            "Culture.Media.Films",
            "Culture.Media.Media*",
            "Culture.Media.Music",
            "Culture.Media.Radio",
            "Culture.Media.Software",
            "Culture.Media.Television",
            "Culture.Media.Video games",
            "Culture.Performing arts",
            "Culture.Philosophy and religion",
            "Culture.Sports",
            "Culture.Visual arts.Architecture",
            "Culture.Visual arts.Comics and Anime",
            "Culture.Visual arts.Fashion",
            "Culture.Visual arts.Visual arts*",
            "Geography.Geographical",
            "Geography.Regions.Africa.Africa*",
            "Geography.Regions.Africa.Central Africa",
            "Geography.Regions.Africa.Eastern Africa",
            "Geography.Regions.Africa.Northern Africa",
            "Geography.Regions.Africa.Southern Africa",
            "Geography.Regions.Africa.Western Africa",
            "Geography.Regions.Americas.Central America",
            "Geography.Regions.Americas.North America",
            "Geography.Regions.Americas.South America",
            "Geography.Regions.Asia.Asia*",
            "Geography.Regions.Asia.Central Asia",
            "Geography.Regions.Asia.East Asia",
            "Geography.Regions.Asia.North Asia",
            "Geography.Regions.Asia.South Asia",
            "Geography.Regions.Asia.Southeast Asia",
            "Geography.Regions.Asia.West Asia",
            "Geography.Regions.Europe.Eastern Europe",
            "Geography.Regions.Europe.Europe*",
            "Geography.Regions.Europe.Northern Europe",
            "Geography.Regions.Europe.Southern Europe",
            "Geography.Regions.Europe.Western Europe",
            "Geography.Regions.Oceania",
            "History and Society.Business and economics",
            "History and Society.Education",
            "History and Society.History",
            "History and Society.Military and warfare",
            "History and Society.Politics and government",
            "History and Society.Society",
            "History and Society.Transportation",
            "STEM.Biology",
            "STEM.Chemistry",
            "STEM.Computing",
            "STEM.Earth and environment",
            "STEM.Engineering",
            "STEM.Libraries & Information",
            "STEM.Mathematics",
            "STEM.Medicine & Health",
            "STEM.Physics",
            "STEM.STEM*",
            "STEM.Space",
            "STEM.Technology"
        ],
        "multilabel": true,
        "population_rates": null,
        "ccp_alpha": 0.0,
        "criterion": "friedman_mse",
        "init": null,
        "learning_rate": 0.1,
        "loss": "deviance",
        "max_depth": 5,
        "max_features": "log2",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "min_impurity_split": null,
        "min_samples_leaf": 1,
        "min_samples_split": 2,
        "min_weight_fraction_leaf": 0.0,
        "n_estimators": 150,
        "n_iter_no_change": null,
        "presort": "deprecated",
        "random_state": null,
        "subsample": 1.0,
        "tol": 0.0001,
        "validation_fraction": 0.1,
        "verbose": 0,
        "warm_start": false,
        "label_weights": {}
    }
}
Output schema
Output schema
{
    "title": "Scikit learn-based classifier score with probability",
    "type": "object",
    "properties": {
        "prediction": {
            "description": "The most likely labels predicted by the estimator",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "probability": {
            "description": "A mapping of probabilities onto each of the potential output labels",
            "type": "object",
            "properties": {
                "Culture.Biography.Biography*": {
                    "type": "number"
                },
                "Culture.Biography.Women": {
                    "type": "number"
                },
                "Culture.Food and drink": {
                    "type": "number"
                },
                "Culture.Internet culture": {
                    "type": "number"
                },
                "Culture.Linguistics": {
                    "type": "number"
                },
                "Culture.Literature": {
                    "type": "number"
                },
                "Culture.Media.Books": {
                    "type": "number"
                },
                "Culture.Media.Entertainment": {
                    "type": "number"
                },
                "Culture.Media.Films": {
                    "type": "number"
                },
                "Culture.Media.Media*": {
                    "type": "number"
                },
                "Culture.Media.Music": {
                    "type": "number"
                },
                "Culture.Media.Radio": {
                    "type": "number"
                },
                "Culture.Media.Software": {
                    "type": "number"
                },
                "Culture.Media.Television": {
                    "type": "number"
                },
                "Culture.Media.Video games": {
                    "type": "number"
                },
                "Culture.Performing arts": {
                    "type": "number"
                },
                "Culture.Philosophy and religion": {
                    "type": "number"
                },
                "Culture.Sports": {
                    "type": "number"
                },
                "Culture.Visual arts.Architecture": {
                    "type": "number"
                },
                "Culture.Visual arts.Comics and Anime": {
                    "type": "number"
                },
                "Culture.Visual arts.Fashion": {
                    "type": "number"
                },
                "Culture.Visual arts.Visual arts*": {
                    "type": "number"
                },
                "Geography.Geographical": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Africa*": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Central Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Eastern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Northern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Southern Africa": {
                    "type": "number"
                },
                "Geography.Regions.Africa.Western Africa": {
                    "type": "number"
                },
                "Geography.Regions.Americas.Central America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.North America": {
                    "type": "number"
                },
                "Geography.Regions.Americas.South America": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Asia*": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Central Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.East Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.North Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.South Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.Southeast Asia": {
                    "type": "number"
                },
                "Geography.Regions.Asia.West Asia": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Eastern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Europe*": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Northern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Southern Europe": {
                    "type": "number"
                },
                "Geography.Regions.Europe.Western Europe": {
                    "type": "number"
                },
                "Geography.Regions.Oceania": {
                    "type": "number"
                },
                "History and Society.Business and economics": {
                    "type": "number"
                },
                "History and Society.Education": {
                    "type": "number"
                },
                "History and Society.History": {
                    "type": "number"
                },
                "History and Society.Military and warfare": {
                    "type": "number"
                },
                "History and Society.Politics and government": {
                    "type": "number"
                },
                "History and Society.Society": {
                    "type": "number"
                },
                "History and Society.Transportation": {
                    "type": "number"
                },
                "STEM.Biology": {
                    "type": "number"
                },
                "STEM.Chemistry": {
                    "type": "number"
                },
                "STEM.Computing": {
                    "type": "number"
                },
                "STEM.Earth and environment": {
                    "type": "number"
                },
                "STEM.Engineering": {
                    "type": "number"
                },
                "STEM.Libraries & Information": {
                    "type": "number"
                },
                "STEM.Mathematics": {
                    "type": "number"
                },
                "STEM.Medicine & Health": {
                    "type": "number"
                },
                "STEM.Physics": {
                    "type": "number"
                },
                "STEM.STEM*": {
                    "type": "number"
                },
                "STEM.Space": {
                    "type": "number"
                },
                "STEM.Technology": {
                    "type": "number"
                }
            }
        }
    }
}
Example input and output
Input:
https://ores.wikimedia.org/v3/scores/euwiki/1234/articletopic

Output:

Example output
{
    "euwiki": {
        "models": {
            "articletopic": {
                "version": "1.4.0"
            }
        },
        "scores": {
            "1234": {
                "articletopic": {
                    "score": {
                        "prediction": [
                            "STEM.STEM*"
                        ],
                        "probability": {
                            "Culture.Biography.Biography*": 0.04474557328405812,
                            "Culture.Biography.Women": 0.01497907119604387,
                            "Culture.Food and drink": 0.004752578712389338,
                            "Culture.Internet culture": 0.0023449403303050457,
                            "Culture.Linguistics": 0.0022740280576232182,
                            "Culture.Literature": 0.1182325750318887,
                            "Culture.Media.Books": 0.019549951107536098,
                            "Culture.Media.Entertainment": 0.01794452542211024,
                            "Culture.Media.Films": 0.002490622258081927,
                            "Culture.Media.Media*": 0.09069096640892323,
                            "Culture.Media.Music": 0.001933147122339207,
                            "Culture.Media.Radio": 0.015838931867138636,
                            "Culture.Media.Software": 0.0024631575903988523,
                            "Culture.Media.Television": 0.002503439185117112,
                            "Culture.Media.Video games": 2.942126295513376e-05,
                            "Culture.Performing arts": 0.004104411115017456,
                            "Culture.Philosophy and religion": 0.08753806275640869,
                            "Culture.Sports": 0.01539663101982379,
                            "Culture.Visual arts.Architecture": 0.07475321770031441,
                            "Culture.Visual arts.Comics and Anime": 0.0106637883730939,
                            "Culture.Visual arts.Fashion": 0.008487458202515499,
                            "Culture.Visual arts.Visual arts*": 0.07794234539600262,
                            "Geography.Geographical": 0.07125542627710914,
                            "Geography.Regions.Africa.Africa*": 0.05225989682090054,
                            "Geography.Regions.Africa.Central Africa": 0.001331958383514617,
                            "Geography.Regions.Africa.Eastern Africa": 0.0010447758438333376,
                            "Geography.Regions.Africa.Northern Africa": 0.06864989790471651,
                            "Geography.Regions.Africa.Southern Africa": 0.020527802857465454,
                            "Geography.Regions.Africa.Western Africa": 0.0002347637092073111,
                            "Geography.Regions.Americas.Central America": 0.0037047140715219433,
                            "Geography.Regions.Americas.North America": 0.09146335046453571,
                            "Geography.Regions.Americas.South America": 0.013138976351002755,
                            "Geography.Regions.Asia.Asia*": 0.11804834121648503,
                            "Geography.Regions.Asia.Central Asia": 0.0003669441969964404,
                            "Geography.Regions.Asia.East Asia": 0.01393943345886899,
                            "Geography.Regions.Asia.North Asia": 0.007152463634187921,
                            "Geography.Regions.Asia.South Asia": 0.054313198796259836,
                            "Geography.Regions.Asia.Southeast Asia": 0.04614969513772421,
                            "Geography.Regions.Asia.West Asia": 0.01663646391547243,
                            "Geography.Regions.Europe.Eastern Europe": 0.03664229764058916,
                            "Geography.Regions.Europe.Europe*": 0.31386936084546685,
                            "Geography.Regions.Europe.Northern Europe": 0.010818009917968689,
                            "Geography.Regions.Europe.Southern Europe": 0.09413262097348489,
                            "Geography.Regions.Europe.Western Europe": 0.02790924075522509,
                            "Geography.Regions.Oceania": 0.004733876078865698,
                            "History and Society.Business and economics": 0.047713624709890816,
                            "History and Society.Education": 0.024040146797416194,
                            "History and Society.History": 0.25198383619251563,
                            "History and Society.Military and warfare": 0.03931141868901636,
                            "History and Society.Politics and government": 0.04428292344027469,
                            "History and Society.Society": 0.1284071527319192,
                            "History and Society.Transportation": 0.020875200987459503,
                            "STEM.Biology": 0.05144344976440123,
                            "STEM.Chemistry": 0.002200494070686233,
                            "STEM.Computing": 0.002307932122133117,
                            "STEM.Earth and environment": 0.0028832527239357223,
                            "STEM.Engineering": 0.06476645687968684,
                            "STEM.Libraries & Information": 0.0006815428356950466,
                            "STEM.Mathematics": 0.0025962242677785587,
                            "STEM.Medicine & Health": 0.02667075270370134,
                            "STEM.Physics": 0.009673325411912001,
                            "STEM.STEM*": 0.8435791704707369,
                            "STEM.Space": 0.0010474124653407004,
                            "STEM.Technology": 0.04173308618717429
                        }
                    }
                }
            }
        }
    }
}

Data

edit
Data pipeline
The data to train was fetched from a set of revision IDs. Then various pieces of information about the revision were extracted using automated processes, and the revision text was fed into word2vec to get an article embedding. Finally, labels are derived from the mid-level WikiProject categories that the article is associated with.
Training data
Training data was automatically and randomly separated from test data during training using the drafttopic git repository (which trains both drafttopic and articletopic models).
Test data
Test data was automatically and randomly split off from train data using the drafttopic git repository (which trains both drafttopic and articletopic models). The model then makes a prediction on that data, which is compared to the underlying ground truth to calculate performance statistics.

Licenses

edit

Citation

edit

Cite this model card as:

@misc{
  Triedman_Bazira_2023_Basque_Wikipedia_article_topic,
  title={ Basque Wikipedia article topic model card },
  author={ Triedman, Harold and Bazira, Kevin },
  year={ 2023 },
  url={ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Basque_Wikipedia_article_topic }
}