Machine learning models/Proposed/Article country
This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed. |
Taking a Wikipedia article and determining which countries are relevant to it is not a straightforward task. As explored in the Knowledge Gaps Taxonomy, there are different ways in which article subjects are connected to countries. Some are very direct -- e.g., places being physically located in a country or someone being born in a country. Others are far more hazy -- e.g., a species being native to a region or a cuisine or type of music being culturally connected with a country. This model seeks to operationalize these different definitions of relevance and combine these signals into a complete set of predictions for which countries are relevant to any given Wikipedia article.
Model card | |
---|---|
This page is an on-wiki machine learning model card. | |
Model Information Hub | |
Model creator(s) | Isaac Johnson |
Model owner(s) | Isaac Johnson |
Model interface | UI; API |
Past performance | Research meta page |
Code | API and Data pipeline |
Uses PII | no |
In production? | no |
Which projects? | all Wikipedia languages |
This model uses Wikidata properties, wikilinks, and categories to predict countries about Wikipedia articles. | |
Specifically, the model takes the union of three signals:
- Wikidata properties: this includes coordinates for places, citizenship and other information for most biographies, and the generic country property (P17) that covers a wide-range of relationships. These are used directly by the model.
- Categories: many categories on Wikidata are explicitly modeled as being related to a particular country -- e.g., Flora of France -- and usage of this category in a Wikipedia article can signify its connection to that country. These are used directly by the model.
- Wikilinks: for country relationships that either are difficult to model via Wikidata/categories or have not been yet, we fallback onto the links in an article to indicate other relevant countries. If enough links in an article point to a given country, that is treated as a valid prediction as well.
Motivation
editEditors generally want to edit content about which they have some familiarity -- i.e. they know something about the topic, its context, the types of sources that might be relevant and reliable, etc. An important aspect of this familiarity is some amount of geographic/cultural proximity to the topic. Countries are one very important scale that is often determinant of this geographic/cultural proximity. More fine-grained scales -- e.g., the topic is relevant to your particular state or town -- are also often an important scale but in practice it can be difficult to operationalize these scales whereas countries is a relatively constrained vocabulary that are used in many places on the Wikimedia projects as well. Earlier article-region models had used even larger regions such as Western Europe or South America, which were reported as too coarse to be useful as filters for many editors. This is especially true for organizers who are receiving grant funding that expects them to focus on content within their particular country. They currently cannot use much of the recommender system tooling (Content Translation, Newcomer Tasks, etc.) to directly address their needs.
Users and uses
edit- high-level analyses of Wikipedia dynamics
- filtering / ranking articles in tools – e.g., only showing articles about Ecuador in a recommender system
- projects outside of Wikipedia — e.g. Wiktionary, Wikinews, etc.
- namespaces outside of 0, disambiguation pages, and redirects
- editing to add country predictions as e.g., properties on Wikidata
- Under development but intended as a filter for recommender systems similar to articletopic
Ethical considerations, caveats, and recommendations
edit- The model is oriented towards precision as opposed to recall -- i.e. do not assume that because a country is not listed as a prediction, that it is not relevant.
- Where editors are not seeing countries appear that they would expect to be relevant to a given article, they can consider making the edits (where appropriate) to that article's Wikidata item, categories, or the Wikidata items for the categories themselves. Wikilinks may also be added that point to articles that are tied to that country but care should be taken to not overly-wikilink just to trigger a specific model prediction.
- No list of countries can be perfect, but details about the list used here may be found on the Research page for the model.
Model
editPerformance
editImplementation
edit{
countries: [
<country name>,
... (up to 250 countries)
]
}
$ curl https://wiki-region.wmcloud.org/regions?lang=en&title=Japanese%20iris
Output
{
"qid": "Q16753983",
"countries": [
"Japan"
],
"wikidata": [],
"links": [
{
"country": "Japan",
"count": 4,
"prop-tfidf": 0.4264280798348245
},
{
"country": "United Kingdom",
"count": 2,
"prop-tfidf": 0.1924294562973159
}
],
"categories": []
}
Data
editTwo components of the model (Wikidata properties and Wikipedia categories) do not require any calibration/learning as they are used directly. The wikilinks portion of the model does require calibration, however, to know when a set of links is providing sufficient evidence to elevate a given country to a full prediction. That is what is described below.
Licenses
edit- Code: MIT License
- Model: CC0 License
Citation
editCite this model as:
@misc{johnson_2024_articlecountry,
title={Article country model},
author={Johnson, Isaac},
year={2024},
url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_country}
}