Research:Content gaps on Wikipedia

Tracked in Phabricator:
Task T235544

Contact

Jonathan Morgan

Wikimedia Foundation

Duration: 2019-November – December-2019

Research:Projects

This page documents a completed research project.

Goals

Summarize findings from a body of relevant academic and industry research focused on content gaps related to the selection, extent, and framing of hypertextual Wikipedia content (e.g. text, links and citations, structured meta data, but not multimedia)
Identify the empirical methods used in these various studies, and their advantages and limitations with respect to their general applicability for large-scale analysis of content gaps across different languages of Wikipedia and for different forms of hypertext-based information
Identify the potential causes of content gaps described in these various studies, and the supporting evidence for each
Develop a taxonomy of content gaps
Provide recommendations for topic-, language-, and format-agnostic metrics and measurement techniques that can support the evaluation of both technological and programmatic interventions to close content gaps.

Guiding questions

What are the selection, extent, and framing gaps that have been identified in previous literature?
Which of the proposed causes for these gaps are best supported by currently available evidence?
What are the characteristics of previous programmatic and technological interventions that have shown some success at addressing these content gaps?
What metrics have been used to quantify extent or change over time in content gaps, and which of these metrics show most promise for general applicability—beyond a specific topic, language, or type of content?

Literature review

This project began with a literature review of previous work related to content gaps. The literature review focused on:

Identifying different categories of content gap, to inform the development of a taxonomy (or other classification system)
Identifying methods and metrics used to identify different kinds of gaps in previous research, and compare and contrast the benefits and limitations of these methods
Identifying potential causes of content gaps, and evaluate the evidence provided for these proposed causes in previous research

This literature review focused on content gaps in information presented in Wikipedia. Gaps specific to WikiData, Commons, and other Wikimedia projects (e.g. Wiktionary) are beyond the scope of this literature review, and may be addressed in a separate study. However, I hope that the content gap taxonomy and content gap matrix that are the primary outputs of this project will also be useful for characterizing those content gaps as well.

For the full list of papers reviewed for this project, see Research:Content gaps on Wikipedia/Literature.

For an annotated bibliography of selected papers, see Research:Content gaps on Wikipedia/Assessment. These papers have been classified according to the taxonomy, matrix, and the causes and metrics for the gaps investigated in the paper have been summarized (see also Gap assessments below).

Defining content gaps: key terms

In order to classify content gaps, we first need to define what we mean when we talk about a 'content gap'. The definitions below were developed to assure that all content gaps discussed here are described in a precise and consistent way.

Topic: something that could be the focus of one or more Wikipedia articles
Content: information about a topic
Coverage: how well Wikimedia project content addresses a particular topic according to some dimensions of coverage:
- Selection: whether the content is present or not
- Extent: how much content the topic has overall, or how much of a particular kind of content there is about the topic
- Framing: whose priorities and perspectives are reflected in the content
Content gaps: Differences in coverage of one or more topics, relative to some basis for comparison.
- Internal comparison: coverage differences among one or more Wikimedia projects
- External comparison: coverage differences between Wikimedia projects and an external repository or corpus
- Reader needs: differences between Wikipedia content and the needs or expectations of Wikipedia’s direct readership

Notes on these key terms (click to view)

Scope of a topic
1. For practical purposes, an extant or plausible Wikipedia article is the minimum unit of analysis for a topic. Basically, we are using Wikipedia’s own, albeit fuzzy, inclusion criteria to avoid making the term topic an infinitely divisible unit of information, or an arbitrary label that applies to any kind of information whatsoever. So “Kate Middelton’s Wedding Dress” is a topic, even though it is very narrow one, and so is “Science”, even though it is very broad. But “Pre-wedding speculation about Kate Middelton’s Wedding Dress” (a section in the wedding dress article) is probably too small to be a topic, and “Jonathan Morgan’s five favorite scientists” isn’t sufficiently encyclopedic (yet?).
2. Topics can also contain an arbitrarily large number of articles, even "all the articles in Wikipedia".
3. Topics are nested. So “Science” is a topic, and so is “Biology”, and so is “Science of underwater diving” (a real article on English Wikipedia!)
Selection vs. extent coverage dimensions
1. To characterize any particular content gap, it is necessary to set boundaries between what the dimension of “selection” and “extent” mean in that context. Selection criteria are used to define a comparison set, and extent criteria are used to measure differences. These terms, along with "framing" were introduced in the Knowledge Gaps white paper. I have tried to formally define them here in a way that aligns with the way they are used in the white paper.
  - If the content gap is related to the number of articles about women scientists, then “selection” will be a binary value (whether or not a biography of a given women scientist exists on Wikipedia), and “extent” can be used for any number of countable features of articles about women scientists: the number of citations, bytes, words, etc, or composite, second-order values like predicted ORES article quality scores.
  - If a gap is related to content that is potentially spread across multiple articles, then extent needs to be defined differently. For example, if the topic is “Indigenous perspectives on the colonization of the Americas” (which is not an article in itself), then the selection criterion could be whether there were any information at all about this topic exists on a particular Wikipedia (even if that information is spread across several articles), and the extent criterion would be a measurement of some feature of this content—for example, the number of oral history recordings from tribal members cited in those articles on that Wikipedia.
Internal vs. external comparison
1. Internal comparison means you are making a completely endogenous comparison, i.e. comparing (the selection, extent, or framing of) some content on a topic within a Wikipedia with some other content within that project or a sister project. All Wikimedia projects share a common mission, and all Wikimedia content is produced by volunteers, according to a shared set of policies and norms, using a common technological platform. Therefore all content within Wikimedia projects shares a set of inherent similarities that set it apart from curated content in any other knowledgebase or repository.
2. External comparison means comparing Wikimedia content with any repository that is not a Wikimedia project, such as an existing database or reference work, or a corpus collected according to different curation principle and/or produced by different means.
Specific focus on direct readership
1. We have decided to focus on the needs of Wikipedia’s direct readership in defining content gaps, as opposed to indirect readership. This distinction comes directly out the of the Knowledge Gaps white paper. By direct readership we mean people who are consuming content on a Wikimedia project or platform, rather than a) companies that re-use or syndicate Wikipedia content, like Google with their knowledge panel or Amazon with Alexa, or b) people who consume Wikipedia content through those platforms. Needs of indirect readership is a potential future basis for comparison, but for now we know little about the needs of these users, and do not have any practical way of eliciting or inferring those needs.

Content gap matrix


comparison basis	coverage dimension
	Selection	Extent	Framing
Internal comparison	IC x S	IC x E	IC x F
External comparison	EC x S	EC x E	EC x F
Reader needs	RN x S	RN x E	EN X F

The content gap matrix is a tool for classifying any potential content gap, regardless of topic. Use this matrix to define, identify or compare content gaps whether they’re related to gender, geography, cultural context, or something else entirely—in terms of the dimension of coverage (selection, extent, or framing) and the basis for comparison (internal comparison, external comparison, or reader needs). See "Key Terms" above for definitions of these sub-types.

Different combinations of coverage dimension and basis for comparison for any given topic yield different content gaps. This matrix could be useful in several ways:

Defining identified content gaps more concretely
Inferring the existence or characteristics of unidentified gaps related to a topic, based on the characteristics of identified gaps.
Identifying metrics or measurement techniques that can quantify similarly-classified classified gaps on different topics.
Identifying common causes and consequences of similarly-classified content gaps on different topics.

For examples of completed content gap matrices for different topics, see Research:Content gaps on Wikipedia/Matrix.

Content gap taxonomy


Gap type	Gap subtype (example)
Individual identity gaps	Gender gap
Individual identity gaps	Generation gap
Group identity gaps	Epistemic gap
Group identity gaps	Cultural context gap
Common interest gaps	Geographical gap
Common interest gaps	Genre expectations gap

Creating a content gap taxonomy that is a) comprehensive b) not arbitrary and c) not simply a list of all possible topics in Wikipedia with the word "gap" after each name has proven to be challenging. Nevertheless, I have attempted to develop a taxonomy of Wikipedia content gaps, in which any topic (an article, or a group of conceptually-related articles or other other media and metadata) can be classified.

For the current content gap taxonomy with definitions, examples, and lots of caveats see Research:Content gaps on Wikipedia/Taxonomy.

Gap assessments

The content gap matrix and taxonomy were tested by applying them to a selection of the reviewed research papers related to different kinds of content gaps. I have assessed the content gaps implicated in each of these papers by...

attempting to classify (at least one of) the content gaps evaluated in the paper according to both the content gap matrix and the content gap taxonomy.
calling out any potential causes of the gap (whether demonstrated in the analysis or postulated by the researchers)
describing the methodology the researchers used to measure the gap

Example: gap assessment of Menking et al. 2017) (click to view)

Menking, A., McDonald, D. W., & Zachry, M. (2017). Who Wants to Read This?: A Method for Measuring Topical Representativeness in User Generated Content Systems. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW ’17, 2068–2081. https://doi.org/10.1145/2998181.2998254

Gap matrix ID: Reader needs x external comparison. The study uses Wikipedia’s internal search to test the results of keywords that are presumed to have a different interest valence for men and women. This simulates the search behavior of real Wikipedia readers (reader needs). The sourcing of the keywords from a non-Wikipedia reference corpus (articles from magazines marketed to men or women) makes the comparison external.

Gap taxonomy ID: Individual identity --> Gender gap. Coverage differences relevant to people of different biological sexes or gender identities.

Causes of the gap: (proposed) The gender gap in content of particular interest to women is mediated by lower participation by women in editing.

Measurement strategy: The researchers take mixed-methods approach. The are looking for articles that match topic keywords sourced from the EBSCO database of magazine articles, for four magazines targeted at women and four magazines targeted at men. They assess whether or not searching on a topic keyword with a gendered valence led to a) a direct match with a Wikipedia article title, b) an automatic redirect to the article (indicating that the topic keyword directly matched a redirect page title) c) a direct match with a Wikipedia disambiguation page, in which one of the options was directly equivalent to the topic keyword (based on human judgement) d) a SERP where one of the articles listed was directly equivalent to the topic keyword (based on human judgement).

For more examples of assessed research papers, see Research:Content gaps on Wikipedia/Assessment.