Template:Model card wikidata item topic
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Aaron Halfaker (User:EpochFail) and Amir Sarabadani |
Model owner(s) | WMF Machine Learning Team (ml@wikimediafoundation.org) |
Model interface | Ores homepage |
Code | drafttopic Github, ORES training data, and ORES model binaries |
Uses PII | No |
In production? | Yes |
Which projects? | Wikidata |
This model uses item features to predict the likelihood that the item belongs to a set of topics. | |
Motivation
editHow can we predict what general topic an item is in? Answering this question is useful for various analyses of Wikidata dynamics. However, it is difficult to group a very diverse range of Wikidata items into coherent, consistent topics manually.
This model, part of the ORES suite of models, analyzes an item to predict its likelihood of belonging to a set of topics. Similar models (though not necessarily with the same performance level or topics, are deployed across about a dozen other projects. There is also a language agnostic article topic model.
This model may be useful for high-level analyses of Wikidata dynamics (pageviews, item quality, edit trends) and filtering items.
Users and uses
edit- high-level analyses of Wikidata dynamics such as pageview, item quality, or edit trends — e.g. How are pageview dynamics different between the physics and biology categories?
- filtering to relevant items — e.g. filter items only to those in the music category.
- definitively establishing what topic an items pertains to
- automated editing of items or topics without a human in the loop
This model is a part of ORES, and generally accessible via API. It is used for high-level analysis of Wikidata, platform research, and other on-wiki tasks.
Example API call:{{{model_input}}}
Ethical considerations, caveats, and recommendations
edit- This model was trained on data that is now several years old (from mid-2020). Underlying data drift may skew model outputs.
- This model uses word2vec as a training feature. Word2vec, like other natural language embeddings, encodes the linguistic biases of underlying datasets — along the lines of gender, race, ethnicity, religion etc. Since Wikidata has known biases in its text, this model may encode and at times reproduce those biases.
- This model has highly variable performance across different topics — consult the test statistics below to get a sense of inter-topic performance.
Model
editPerformance
editTest data confusion matrix: {{{confusion_matrix}}}
Test data sample rates: {{{sample_rates}}}
Test data performance: {{{performance}}}
Implementation
edit{{{model_input}}}
Output:
{{{model_output}}}Data
editLicenses
edit- Code: MIT license
- Model: MIT license
Citation
editCite this model card as:
@misc{
Triedman_Bazira_2023_Wikidata_item_topic,
title={ Wikidata item topic model card },
author={ Triedman, Harold and Bazira, Kevin },
year={ 2023 },
url={ https://meta.wikimedia.org/wiki/Model_card_wikidata_item_topic }
}