Research:Language-Agnostic Topic Classification/Countries

Tracked in Phabricator:
Task T366273

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


This project is an effort to build a model for taking any given Wikipedia article and inferring which (0 to many) countries are relevant to that article. For example, taking the article for en:Akwaeke Emezi and inferring Nigeria and the United States of America. The intent is for the model's inference to be a two-stage process: 1) extraction of high-confidence countries based on a curated set of Wikidata properties -- e.g., country of citizenship; 2) inference of additional, lower-confidence countries based on what articles (and associated countries) are linked to by the article itself. This second stage is designed to cover cases where the Wikidata item is under-developed or the relationship between the article's topic and country is not one that is currently modeled well on Wikidata -- e.g., where a given species is endemic to.

Current status

edit
  • Working on: deploying the model.
  • 2024-09-20 with the final model evaluated, the next major step is hosting the model on LiftWing (and any changes that that requires): task T371897
  • 2024-09-13 did an evaluation of the model outputs that shows 0.975 precision and 0.772 recall across all of Wikipedia (more details). This confirms the initial evaluation and also points to some other potential areas for improvement.
  • 2024-08-30 did an initial round of evaluation on a random sample of 50 Wikipedia articles, which had perfect agreement between the human evaluator (me) and model output for 44 of them (88%). The six disagreements were all cases where I thought a country label could have been applied that was not (false negative). This is a much better error than a false positive in my opinion because it just means an article is missing from a query as opposed to falsely appearing in one. Of the six errors, three were related to species and endemic areas, two were people who spent a few years in a country, and one was a movie created in one country but set in another (more details).
  • 2024-08-02 produced draft model card as pre-requisite towards starting conversations about hosting this model on LiftWing. Also uploaded notebooks that cover various aspects of the data pipeline.
  • 2024-07-26 produced a dump (data; README) of all Wikipedia articles along with their topics and country predictions from each of the three signals. This extends the small-scale analysis from the week before and allows us to identify areas where the model predictions are largely coming from wikilinks (which would be an indication that we should dig further to see why). An initial exploration of the results identified a few areas -- e.g., biographies on Chinese Wikipedia, biology articles (mostly species) on many language editions, music articles (mostly albums/songs) on many language editions. One issue raised is that hidden tracking categories sometimes map musical articles to all the countries where they were on the charts (example). Luckily these can be easily removed if we want to (example query).
  • 2024-07-19 did a small-scale analysis (notebook) of the overlap between the three signals under consideration (Wikidata properties; Wikilinks; Wikipedia categories) that shows good agreement amongst them but unique information from each.
  • 2024-07-12 added a sub-page describing work on how to improve the modeling for articles about plant, animal, etc. species (a known challenge for the model given their low coverage of country-related properties on Wikidata).
  • 2024-05-30 updated page to reflect thinking for FY24-25 hypothesis (WE 2.1.1).


Test it out

edit

Background

edit

The ORES topic taxonomy was an initiative to establish a set of high-level topics (~60) that captured many of the ways that the Wikipedia community delineated content and could be used to train a machine-learning model that could infer these topics for any given article. It forms the backbone of the topic infrastructure that has been built to understand content dynamics and enable filters that editors can use to more quickly identify content that they might want to edit via recommender systems.

One facet of the taxonomy that was too coarse to be useful for most editors was the regional topics -- e.g., Western Europe, Central Asia, etc. While these topics are useful for high-level statistics about article or pageview distributions, they do not support other use-cases such as helping editors finding content relevant to their region (which generally is a country or even smaller subdivisions) or more fine-grained analyses of content gaps. The goal of this project is to develop a new model that can replace this aspect of the original taxonomy with country-level predictions (and any aggregation to larger regions can still be easily applied).

What is a country?

edit

While many regions are clearly countries and have widely-recognized borders and sovereignty, other regions that we might think of as countries are disputed or actually officially part of a larger region. Choosing a set of "countries" is an inherently political act. The point of this classifier is to support editors who wish to find and edit content relevant to their region as well as analyses of geographic trends while still mapping well to other geographic data such as pageviews or grant regions. The model currently uses this list, which is derived primarily from regions with ISO 3166-1 codes.

edit

Content Gap Metrics

edit

The Knowledge Gaps Index is an attempt to quantify certain important gaps on the Wikimedia projects so as to provide insight into movement trends and ways in which the diversity of the projects could be strengthened. It considers countries an important aspect of content gaps and splits this between a "geographic" gap that covers content with a set of latitude-longitude coordinates on Wikidata (primarily places and events) and a "cultural" gap that covers other content with a less physical relationship to a country (people, political movements, sports teams, etc.) via a separate set of Wikidata properties. This model combines these two approaches (coordinate-based geolocation and cultural Wikidata properties) to form a first-pass, high-confidence prediction of countries for a given Wikipedia article.

Earlier attempts

edit

Given the success of the language-agnostic topic classification model based on an article's outlinks, that same approach was initially tried but replacing the 64 topics with one or more of 193 countries based on entities identified as sovereign states in Wikidata with a small amount of manual cleaning. Groundtruth data was based purely on an article's associated Wikidata items and was a union of coordinate location (geolocated to the same set of 193 countries by checking simple containment in each country's borders), place of birth, country, country of citizenship, and country of origin. Only direct matches were used so e.g., a place of birth property that references a city such as Cambridge for Douglas Adams) would not be mapped to the United Kingdom (though in Douglas Adams' case, the country of citizenship property served to still include the United Kingdom).

This model performed well statistically -- i.e. relatively high precision and recall for most countries -- but had a number of drawbacks that suggested it could still be improved substantially. Most notably, the model struggled to handle articles that relate to many geographies -- e.g., World War II -- and would predict many countries with a very low confidence that generally means no confident predictions. This could potentially be handled by lowering the prediction threshold but, in practice, I think this would result in other issues related to false positives.

This led me to think that graph-based approaches might show more promise here than I would have expected for the broader topic taxonomy. For example, many topics do not show simple homophily-type relationships -- e.g., an article that links to many articles about people is not itself clearly an article about a person. I would expect clearer relationships with geography though -- i.e. an article that links repeatedly to content about a particular region is almost certainly relevant to that region. Thus, a careful aggregation all of the countries (by way of articles) that a given article links to should provide a high-confidence classifier for identifying additional relevant countries for an article.

After further experimentation, I am less certain that this will actually solve the many-geography problem (like with World War II) but it still is a far simpler and easier-to-maintain approach than a trained model. Namely, a learned model would have a limited, static vocabulary that would need re-trained regularly whereas an approach that just relies on knowing what countries are associated with each linked article can be easily updated as new articles are created and content is edited. Further, the gaps in the Wikidata-based groundtruth are highly topically-biased -- i.e. the countries are not missing at random but highly concentrated in certain topical areas like flora/fauna. This would make it difficult for a model to learn to fill in these gaps because the right patterns would not necessarily be present in the training data.