Research:Analyzing sources on Wikipedia


This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Sources, particularly reliable sources, are key to Wikipedia. They are the primary mechanism for ensuring verifiability and therefore maintaining knowledge integrity and removing misinformation.[1][2] They also present a major barrier to expanding coverage of marginalized communities[3] and many knowledge gaps on Wikipedia arise in part due to a lack of reliable sources.[4] The specific sources that underpin an article can also determine whose point of view is represented[5][6] -- a particularly important question when considering the role that Wikipedia plays with respect to digital colonialism.[7][8]

Despite the important role of sources on Wikipedia and many community discussions / concerns, they are generally understudied. A large factor in this is likely their inaccessibility for large-scale data analyses. This project will work to devise methods to overcome some of these challenges:

  • There is no standard format for references: common approaches to adding references include bare ref tags, citation templates, and shortened footnote templates. While extracting ref tags from Wikipedia is relatively straightforward, the content within them can still be unstructured and difficult to parse. The latter templates bring useful structure but there are many types and template names / parameter names will vary across languages. Any quantitative research on references on Wikipedia then generally comes with high start-up costs to extract the desired information. The mwrefs Python library helps with some of this -- it extracts ref tags, any associated citation template, external links, and identifiers (DOI, ISBN, pubmed, arxiv).
  • The citation only tells you so much: while you can extract certain information (does this reference have a DOI? is there a URL suggesting that the source is digitized? etc.), many interesting aspects of a reference (source country, language, open-access or paywall, etc.) can only be determined from consulting external databases or scraping content from external websites.
  • They are not tracked by any external tables: analyzing references requires going directly to the wikitext (or parsed HTML) of an article and doing the above extraction. There are no logging tables or links tables (maybe externallinks though external links can appear outside of references) that provide an easy entrypoint to analysis as is the case for studying images, links, categories, etc. on Wikipedia.

Aspects of Source QualityEdit

This is a non-exhaustive attempt at listing various ways in which we might think about the quality of sources:

  • Features of individual sources:
    • Level of source: primary, secondary, or tertiary. None are explicitly disallowed but secondary are preferred.
    • Reliability: is the source considered reliable on that wiki? Examples of some of these discussions: en:Perennial sources.
  • Diversity of all sources in an article:
    • Geography: while the geography of any individual source does not tell you much about its quality, the diversity of regions represented in an article's sources is an interesting aspect of reliability. Greater diversity of regions generally is a positive with some priority to having some local sources as well.
    • Type of source: no source type is inherently worse -- e.g., books vs. newspapers -- but a diversity of different source types is generally better.
    • Accessibility: how easy is it to verify the sources? Just because a source is non-digital or behind a paywall does not make it a bad source, but ideally at least some of the source are easily verifiable.
    • Recency: older sources are not worse, but for topics for which there may have been recent developments, a lack of recent sources could be a warning sign that the content is stale.

Source Geo-ProvenanceEdit

The first stage of this project will examine the geo-provenance of sources across Wikipedia -- i.e. what countries are associated with the sources that are used on Wikipedia? This work will largely be a replication and extension of Sen et al.[9]

Geo-Provenance MethodsEdit

An initial API for English Wikipedia that analyzes an article's sources and their geographic distribution can be found here:

See AlsoEdit


  1. Cohen, Noam (2021-09-07). "One Woman’s Mission to Rewrite Nazi History on Wikipedia". Wired. Wired. Retrieved 2022-03-30. 
  2. Grabowski, Jan; Klein, Shira (2023-02-09). "Wikipedia’s Intentional Distortion of the History of the Holocaust". The Journal of Holocaust Research 0 (0): 1–58. ISSN 2578-5648. doi:10.1080/25785648.2023.2168939. 
  3. Berson, Amber; Monika, Sengul-Jones; Tamani, Melissa (June 2021). "Unreliable Guidelines: Reliable Sources and Marginalized Communities in French, English and Spanish Wikipedias" (PDF). Art + Feminism. Retrieved 2022-03-30. 
  4. "Wikipedia is a mirror of the world’s gender biases". Wikimedia Foundation (in en-US). 2018-10-18. Retrieved 2022-03-30. 
  5. Luyt, Brendan; Tan, Daniel (2010). "Improving Wikipedia's credibility: References and citations in a sample of history articles". Journal of the American Society for Information Science and Technology: n/a–n/a. ISSN 1532-2882. doi:10.1002/asi.21304. 
  6. Ford, Heather; Sen, Shilad; Musicant, David R.; Miller, Nathaniel (2013-08-05). "Getting to the source: where does Wikipedia get its information from?". Proceedings of the 9th International Symposium on Open Collaboration. WikiSym '13 (New York, NY, USA: Association for Computing Machinery): 1–10. ISBN 978-1-4503-1852-5. doi:10.1145/2491055.2491064. 
  7. Duncan, Alexandra (2020). "Towards an activist research: is Wikipedia the problem or the solution?" (PDF). Art Libraries Journal. ISSN 0307-4722. Retrieved 2022-03-30. 
  8. "Decolonizing the Internet". Whose Knowledge (in en-US). Retrieved 2022-03-30. 
  9. Sen, Shilad W.; Ford, Heather; Musicant, David R.; Graham, Mark; Keyes, Os; Hecht, Brent (2015-04-18). "Barriers to the Localness of Volunteered Geographic Information". Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI '15 (New York, NY, USA: Association for Computing Machinery): 197–206. ISBN 978-1-4503-3145-6. doi:10.1145/2702123.2702170.