Research:Content gaps on Wikipedia/Assessment

Assessment of the content gap taxonomy and content gap matrix on a selection of reviewed papers.

Menking et al. 2017Edit

Menking, A., McDonald, D. W., & Zachry, M. (2017). Who Wants to Read This?: A Method for Measuring Topical Representativeness in User Generated Content Systems. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW ’17, 2068–2081. https://doi.org/10.1145/2998181.2998254

Gap matrix ID
Reader needs x external comparison. The study uses Wikipedia’s internal search to test the results of keywords that are presumed to have a different interest valence for men and women. This simulates the search behavior of real Wikipedia readers (reader needs). The sourcing of the keywords from a non-Wikipedia reference corpus (articles from magazines marketed to men or women) makes the comparison external.
Gap taxonomy ID
Individual identity --> Gender gap. Coverage differences relevant to people of different biological sexes or gender identities.
Causes of the gap
(proposed) The gender gap in content of particular interest to women is mediated by lower participation by women in editing.
Measurement strategy
The researchers take mixed-methods approach. The are looking for articles that match topic keywords sourced from the EBSCO database of magazine articles, for four magazines targeted at women and four magazines targeted at men. They assess whether or not searching on a topic keyword with a gendered valence led to a) a direct match with a Wikipedia article title, b) an automatic redirect to the article (indicating that the topic keyword directly matched a redirect page title) c) a direct match with a Wikipedia disambiguation page, in which one of the options was directly equivalent to the topic keyword (based on human judgement) d) a SERP where one of the articles listed was directly equivalent to the topic keyword (based on human judgement).


Miquel-Ribé & Laniado 2018Edit

Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics, 6(JUN). https://doi.org/10.3389/fphy.2018.00054

Gap matrix ID
Selection x internal comparison. Comparing the number and relative proportion of cultural context articles per Wikipedia language (selection) across multiple Wikipedia language editions (internal comparison).
Gap taxonomy ID
Group identity --> Cultural context gap–Coverage differences related to the history, heritage, and characteristics of a current or former cultural group.
Causes of the gap
(proposed) Self-focus bias and proximity bias: people tend to write articles that are related to their culturally-mediated experiences of the world and identity, and about places that are nearby.
Measurement strategy
The researchers are attempting to compare sets of articles in each language edition of Wikipedia that reflect the unique cultural context (“Cultural context content”, or CCC) of geographic territories where that language is predominant. CCC articles include articles about concepts that a) originate in a language region or b) are located in the context and have considerable influence there. Multiple strategies were used to create the CCC article set for each language: 1) articles with geocoordinates that place the article subject within the boundaries of a region where that language is predominant; 2) article title keywords that match the language itself or the names of regions or places within a language-predominant territory; 3) article category keywords that match the language itself or the names of regions or places within a language-predominant territory (also including keywords from supercategories of the article’s categories). The resulting CCC datasets were used to make multiple cross-wiki comparisons: 1) the relative proportion of CCC articles 2) the growth patterns of CCC articles 3) overlap between CCC article sets.

Hogan et al. 2015Edit

Graham, M., Straumann, R. K., & Hogan, B. (2015). Digital Divisions of Labor and Informational Magnetism: Mapping Participation in Wikipedia. Annals of the Association of American Geographers, 105(6), 1158–1178. https://doi.org/10.1080/00045608.2015.1072791

Gap matrix ID
Framing x internal comparison. Assesses whether articles are being edited by people within the same country or region as the topic the article describes; location is a proxy for local knowledge and priorities.
Gap taxonomy ID
Common interest --> Geographical gap–Coverage differences in topics related to geographic regions or population distribution.
Causes of the gap
Broadband availability and other socioeconomic factors.
Measurement strategy
The researchers compared the locations associated with the IP addresses of unregistered editors (and the stated geographic location of registered editors, gleaned from their user pages) and the countries and regions associated with geotagged articles. They focused primarily on English Wikipedia.

Halfaker 2017Edit

Halfaker, A. (2017). Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. Proceedings of the 13th International Symposium on Open Collaboration - OpenSym ’17, 1–9. https://doi.org/10.1145/3125433.3125475

Gap matrix ID
Extent x internal comparison. Compare the quality of English Wikipedia articles about women scientists with English Wikipedia as a whole.
Gap taxonomy ID
Individual identity --> Gender gap. Coverage differences about people of different biological sexes or gender identities.
Proposed causes of the gap
(proposed) Gender disparity of Wikipedia editors, leading to less effort devoted to articles about women.
Measurement strategy
Compare ORES scores of articles claimed by WikiProject Women Scientists vs. the mean and weighted sum ORES scores for the rest of Wikipedia, over a period of months (during which Wikiproject Women Scientist members targeted these articles for improvement, substantially lifting their scores).

Graells-Garrido et al. 2015Edit

Graells-Garrido, E., Lalmas, M., & Menczer, F. (2015). First Women, Second Sex: Gender bias in Wikipedia. Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, 165–174. https://doi.org/10.1145/2700171.2791036

Gap matrix ID
Framing x internal comparison. How women are portrayed in their biographies vs. how men are portrayed. They also found differences in extent of coverage and selection differences, but these were not the primary focus of the study (and have already been well-documented in previous studies).
Gap taxonomy ID
Individual identity --> Gender gap. Coverage differences about people of different biological sexes or gender identities.
Causes of the gap
(proposed) Biases in the editing population (endogenous) and biases that reflect broader social and historical factors (exogenous).
Measurement strategy
They examined 1) whether there were differences in the classes of words (per Linguistic Inference & Word Count (LIWC) dictionary) used to describe the notable characteristics of biographical subjects, divided by gender; 2) whether there were difference in gender-related metadata (e.g. 'spouse' property) related to male/female biographies in DBpedia; 3) differences in network centrality of male/female biographies (a measure of implied importance to the editing population) based on the wikilink graph.


Callahan & Herring 2011Edit

Callahan, E. S., & Herring, S. C. (2011). Cultural bias in Wikipedia content on famous persons. Journal of the American Society for Information Science and Technology, 62(10), 1899–1915. https://doi.org/10.1002/asi.21577

Gap matrix ID
Framing x internal comparison. How famous Poles and Americans are portrayed in, respectively, the Polish and American Wikipedias—what information is considered notable or appropriate for inclusion, and how that information is presented. They also found differences in extent of coverage, but these were not the primary focus of the study.
Gap taxonomy ID
Group identity --> Cultural context gap. Coverage differences related to the history, heritage, and characteristics of a current or former cultural group.
Causes of the gap
(proposed) Culturally-mediated differences in what the editing population cares about or considers notable about famous people from their own culture, vs. famous people from other cultures.
Measurement strategy
They used qualitative content analysis to compare a matched dataset of biographies of famous people in the English and Polish Wikipedias. They focused on structural properties (e.g. presence of certain section headings, or images), what kinds of personal details are included (e.g. accomplishments, ethnicity, or sexual orientation, and the overall tone of the information presented (e.g how positive the coverage was).

Jemielniak & Wilamowski 2017Edit

Jemielniak, D., & Wilamowski, M. (2017). Cultural diversity of quality of information on Wikipedias. Journal of the Association for Information Science and Technology, 68(10), 2460–2470. https://doi.org/10.1002/asi.23901

Gap matrix ID
(Framing, extent) x internal comparison. Comparing various structural features of Wikipedia articles across 8 major language editions. The authors assert that some or all of these differences may reflect cultural preferences, but don't develop specific hypotheses or point to specific differences as clearly culturally-mediated. Therefore, this could be classified as framing or extent.
Gap taxonomy ID
Group identity --> Cultural context gap. Coverage differences related to the history, heritage, and characteristics of a current or former cultural group.
Causes of the gap
(proposed) Culturally-mediated differences in what the editing population cares about: "More research is needed to more precisely confirm and verify the preferences of different language cultures for different information formats and standards."
Measurement strategy
The top 300 articles Good and Featured articles (by word count) across 8 languages; All articles across 8 languages. Average # of words, characters, images, references, internal links, and external links (relative to maximum value of each metric in the 300 article per-language set).

Lam et al. 2011Edit

Lam, S. (Tony) K., Uduwage, A., Dong, Z., Sen, S., Musicant, D. R., Terveen, L., & Riedl, J. (2011). WP:clubhouse?: an exploration of Wikipedia’s gender imbalance. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (pp. 1–10). New York, NY, USA: ACM. https://doi.org/10.1145/2038558.2038560

Gap matrix ID
Extent vs. reader needs. They compared the length of Wikipedia articles about movies (extent) with the average ratings of those movies by men and women on the MovieLens platform. MovieLens rating by gender is used to reflect potential gender-mediated interest in information about the movie (reader needs).
Gap taxonomy ID
Individual identity --> Gender gap. Coverage differences relevant to people of different biological sexes or gender identities.
Causes of the gap
(proposed) The gender gap in content of particular interest to women is mediated by lower participation by women in editing.
Measurement strategy
Regression model comparing average MovieLens rating of movie (divided by gender of rater) to length of Wikipedia article about that movie—controlling for movie popularity, movie quality, and movie age.

Warncke-Wang et al. 2015Edit

Warncke-Wang, M., Ranjan, V., Terveen, L., & Hecht, B. (2015). Misalignment between supply and demand of quality content in peer production communities. Proceedings of the 9th International Conference on Web and Social Media, ICWSM 2015, 493–502.

Gap matrix ID
Extent vs. reader needs. They compared the quality of all Wikipedia articles in four languages (extent) with their cumulative pageviews, a measure of how interesting those articles are to Wikipedia’s direct readership (reader needs).
Gap taxonomy ID
Multiple! They found the disparity between extent of information and reader needs existed across a variety of topics. Some of those, like Comedy or Psychology, are could qualify as Common interest. Others, like LGBT studies, related to individual identity. Still other poorly covered topics, like Countries, could reflect group identity gaps.
Causes of the gap
General misalignment between the interests of Wikipedia editors and the interests of Wikipedia readers.
Measurement strategy
Multiple. For comparisons of misalignment quality and pageviews between specific topics, they computed a relative risk score: the likelihood that someone reading Wikipedia will encounter low-quality articles, based on the popularity of the topic.