Research:Content gaps on Wikipedia/Taxonomy

One of the original stated goals of this project was to develop a taxonomy of content gaps. A taxonomy in the biological sense: a heirarchical list of categories, sub-categories, and instances that describes a domain.

This has proven to be a challenging endeavor. It is difficult, if not impossible, to develop a comprehensive, hierarchically-structured, and canonical representation of Wikipedia content. Wikipedia's own category structure for content is heterarchical, not hierarchical, and it is a rife with inconsistencies (and gaps!).

The task of developing a taxonomy of content gaps—differences in coverage between concepts/topics/domains (all arbitrary units!)—is even more difficult. Any definition of a ‘content gap’ necessarily require a definition of scope (a unit of analysis, and a principle for grouping those units into one or more ‘topic’ buckets), a basis for comparison (a standpoint from which the gap will be assessed), and a measure of difference (a metric for assessing relative coverage).

All of these formalizations require subjective judgements. As someone who participated in a 2015 workshop aimed at defining Wikipedia Gaps noted, “We’ve got not just content gaps, but a bestiary of gaps.”[1] A bestiary here probably means “an unusual or whimsical collection”, similar to a cabinet of curiosities—a somewhat arbitrary collection of things where the organizing principle reflects how interesting or relevant they seem to the person who put the collection together, not necessarily their inherent similarity or comprehensiveness. In other words, defining gaps in Wikipedia content using a traditional taxonomic approach turns out to be more complicated than representing the content itself as a taxonomy.

That said, I did attempt to develop a taxonomy of Wikipedia content gaps, in which any topic (an article, or a group of conceptually-related articles and other media and metadata) can be classified. This page contains my work on the taxonomy so far.

Overview of the taxonomyEdit

This taxonomy is based many assumptions, some of which are described in the definitions of terms like ‘coverage’ and ‘topic’ outlined in "Key terms". The other big assumption that underlies this taxonomy is that you can divide human knowledge into three categories: content related to individual identity characteristics, to group identity characteristics, and to content that is of common interest or universal relevance to everyone. The top level categories (individual, group, and common interest gaps) are intended to be comprehensive, and mutually-exclusive as much as possible. All topics on Wikipedia should be classifiable according to one of these categories (although they may fit more than one), and therefore any content gap should be able to be characterized according to (at least) one category. The second level categories (e.g. gender gaps, cultural context gaps) are not intended to be comprehensive or mutually-exclusive: that is, a gap that is characterized as a gender gap may also reflect a cultural context gap, or another type of gap that is not described here at all.

The examples (nominally, a third level of the taxonomy) are intended to illustrate the kinds of specific gaps that can be classified under each second-level category. They are mutually-exclusive, but not comprehensive.

However imperfect this taxonomy is (and it is very imperfect) it may still be an effective tool for thinking about underlying factors that influence observed gaps, and/or hypothesizing the characteristics of potential gaps that you have not yet observed.

Content gap taxonomy vs content gap matrixEdit

The content gap taxonomy is intended to be orthogonal to the content gap matrix. That is, you should be able to take any well-described example in this taxonomy (third-level) and characterize it according to the dimension of coverage and basis for comparison. The taxonomy is a framework for organizing content; the matrix is a framework for organizing gaps.

A taxonomy of content gapsEdit

Individual identity gapsEdit

Gaps in content related to individual human qualities, life experiences, or identity characteristics Content can be related to qualities of individual people that are both personal and universal. Personal in the sense that they shape thought and behavior, universal in that they are shared by all people, across cultures and throughout history. Sometimes content on Wikipedia describes these personal qualities, or is classified according to them. Content may also be more or less relevant to people, depending on their personal qualities. Gaps occur when content on Wikipedia represents people differently or is differently relevant to people based on these personal qualities.

Gender gapsEdit

Coverage differences about or relevant to people of different biological sexes or gender identities


  • Coverage differences between articles about political office holders of different genders
  • Coverage differences between topics of differential interest to men and women

Generation gapsEdit

Coverage differences about or relevant to people of different ages or stages in life.


  • Coverage differences in topics of interest to people who are currently in their 60s vs. 20s
  • Coverage differences related to the lived experience of people who witnessed historical events vs. those who did not directly experience those events

Group identity gapsEdit

Coverage differences related to current or historical group identities, culturally-mediated experiences of the world, and cultural properties.

Content can be related to characteristics that have developed among particular groups of people with shared experiences. These cultural characteristics shape thought and behavior, reflect shared understandings of the world, and are often used by the groups themselves (or by outsiders) to separate “us” from “them”.

Sometimes content on Wikipedia describes these cultural characteristics. Content also describes, or is classified according to, artifacts, events, ideas, and institutions that are shaped by them. Content may be more or less relevant to people from different cultures, depending on these characteristics. Gaps occur when content on Wikipedia represents groups differently or is differently relevant to group members based on these cultural characteristics.

Epistemic gapsEdit

Coverage differences that represent or reflect culturally-mediated understandings of encyclopedic knowledge.


  • Coverage differences between topics related to so-called “Western” and “Eastern” medicine
  • Coverage differences between aboriginal and non-aboriginal accounts of Australian history before European colonization.

Cultural context gapsEdit

Coverage differences related to the history, heritage, and characteristics of a current or former cultural group


  • Gaps in the way in which religious movements (the Protestant Reformation vs the Ghost Dance movement) are represented on Wikipedia
  • Gaps in the way in which historical events related to encounters between cultural groups (the Crusades, the Vietnam War) are represented on Wikipedia
  • Gaps in coverage of labor movements in the United States vs. Poland in English Wikipedia vs. Polish Wikipedia

Common interest gapsEdit

Coverage gaps related to content that can be considered to be universally relevant to all people and which is within the scope of an encyclopedia.

Content can be considered to be in the common interest if it is universally relevant to all humans, regardless of individual or group identities, because all humans occupy the same physical world, as well as many shared personal traits and lived experiences. What is considered to be ‘universally relevant’, in the context of Wikipedia content, also reflects implicit expectations of what Wikipedia is. Defining common interest content requires asking the question: what kind of content is both universally relevant and suitable for an encyclopedia?

Content related to the history and nature of the universe and the non-human world, and in some cases even human history and individual and group behavior, may be considered to be universally relevant and of common interest to all people. Gaps occur when content on Wikipedia represents different common interest topics differently.

Note: Accepting this category as valid (and distinct from the categories above) requires accepting a fundamental set of metaphysical, ontological, and epistemological assumptions rooted primarily in so-called Western culture (think science, logic, mathematics). Since encyclopedias in general, and Wikipedia in particular, are a product of this culture, these assumptions are inextricably bound up in the definition of an encyclopedia. They shape expectations of what information is or is not ‘encyclopedic’, and how that information should be organized. Therefore, what is considered to be ‘common interest’ knowledge in terms of Wikipedia cannot be considered truly objective or culturally-neutral. It will always be contested, and will change over time. At best, we can draw a hazy general consensus around a variety of topics that many people across many individual/cultural/historical axes have agreed upon.

Geographical gapsEdit

Coverage differences in topics related to geographic regions or population distribution.


  • Coverage differences in topics related to the so-called global north vs. the global south
  • Coverage differences related to content about urban vs. rural locales

Genre gapsEdit

Coverage differences among common interest topics implicitly considered equally within the scope of an encyclopedia


  • Coverage differences between content about 21st century natural disasters vs. 11th century natural disasters
  • Coverage differences between content about ecology vs. genetics