Research:Long Tail Topic Ontology

Duration:  2019-03 – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The Long Tail Topic Ontology explores methodologies for programmatically identifying "knowledge gaps", or systematically underrepresented areas of the Wikipedia. This project consists of three distinct parts: 1) developing an scalable algorithm that generates sufficiently specific categories, 2) mapping metrics to these categories to illustrate the "completeness" of topical areas, and 3) developing visualizations to effectively explore and navigate the resulting data set. A presentation of the project and presenters notes are available here.



At present, Wikipedia does not have a comprehensive category structure or a visualization strategy that can be leveraged to understand the balance between over represented and underrepresented topics, or what information exists within the encyclopedia and what information does not. Existing approaches to categorize the encyclopedia either do not provide sufficient depth, or they likely reflect and perpetuate existing biases towards existing topics. For instance, the ORES Draft Topic model can identify 64 different broad topics, but these topics are not sufficiently specific enough to identify missing content. In this case we may know that Women are an underrepresented category, but we cannot tell whether these women are predominantly female scientists or female athletes. Conversely, while the WikiProject taxonomy provides sufficient depth in some areas, it likely suffers from the same production biases that plague the encyclopedia as a whole. Branches of the taxonomy that receive more attention are likely over specified, while underproduced topics may also fall under overly broad categories.



In this project we first aim to develop a comprehensive category structure that provides enough breadth and depth to understand the distribution of information within the encyclopedia, and how volunteer labor is distributed over those categories. In other words, we aim to understand 1) what topics are covered within the encyclopedia, 2) the quantity and diversity of articles that belong to topics, 3) the relationships between topics, and 4) who contributes to given topics. This taxonomy should be:

  • Specific: Lowest level categories are smaller and more precise than our existing models.
  • Cohesive: Each category should contain only articles about a particular topic.
  • Comprehensive: The taxonomy can be trivially applied to all Wikipedia articles in multiple language editions.

Metrics & Use Cases


We then join various completeness metrics to the topical clusters developed during the categorization step. Completeness metrics are based on use cases that focus on--but are not limited to--identifying systematic knowledge gaps. For instance, calculating the mean misalignment per cluster would allow us to see which topical areas have relatively high numbers of page views but low quality.

For each of these use cases, integrating EdCast’s clustering strategy allows us to identify topical areas of the encyclopedia that correlate with the associated completeness metric. Whereas the metric itself can pinpoint specific outlier articles, pairing a metric with EdCast’s clustering strategy allows us to explore topical areas that contain many outliers. For our purposes, these topical areas may represent different kinds of knowledge gaps.

Existing and more refined taxonomies could be leveraged for a similar purpose, but the depth, granularity, and scalability of EdCast’s clustering method provides a distinct advantage. While the ORES DraftTopic model could tell us at a high level which parts of the encyclopedia are systematically underrepresented, it does not provide the level of detail necessary to understand or begin to fix the issue.



In order to better understand the shape of the resultant dataset, we develop and test a series of visualizations designed to explore the clustered data. These visualizations highlight existing underrepresented areas of Wikipedia and allow the user to navigate through the hierarchical clusters. These visualizations are hosted on Toolforge at



All data collection, processing, clustering, and validation code is located at A non-proprietary but untested version of EdCast’s processing and clustering methodology is available at Visualization code is located at



As of 7/19/20 we are using a dataset of 62043 English Wikipedia articles. We started by collecting all articles contained in Wikipedia’s Vital 1000. Next, we expanded the dataset to contain all articles that are linked to from the Vital 1000 articles.

Using articles from the Vital 1000 allows us to validate our model against the Vital 1000 taxonomy, and this collection of articles of relatively high quality. However, this small subset of articles does not contain enough data to test the viability of our clustering approach, and we therefore expanded the dataset.



In order to generate clusters, we use a basic k-means clustering algorithm. We use cosine similarity as our metric to determine the similarity of two documents. At present the optimal number of clusters is 1300, which we determined using silhouette analysis. We algorithmically generate labels for each cluster by computing the top 4 most common words in each cluster. We use the scikit-learn implementation of k-means and silhouette score.

In order to create a hierarchical taxonomy, we have experimented with recursive clustering. Using k-nearest neighbors we recursively combine k clusters into a higher level category until the number of clusters is equal to or less than k. We have experimented with values of k = [2, 8], although ultimately k should be determined dynamically for each cluster to improve coherence. Labels are generated as before using the top 4 unigrams of the combined clusters.

A non-proprietary but untested version of EdCast’s processing and clustering methodology is available at



The current iteration of this project focuses on:

  • pageview/quality misalignment
  • quality
  • pageviews
  • number of editors.

Most metrics are collected from public APIs or Datasets, but we calculate misalignment using Warncke-Wang et al.’s methodology. Other possible metrics are outlined in this google doc.