Research:Analyzing sources on Wikipedia

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Sources, particularly reliable sources, are key to Wikipedia. They are the primary mechanism for ensuring verifiability and therefore maintaining knowledge integrity and removing misinformation.[1][2] They also present a major barrier to expanding coverage of marginalized communities[3] and many knowledge gaps on Wikipedia arise in part due to a lack of reliable sources.[4] The specific sources that underpin an article can also determine whose point of view is represented[5][6] -- a particularly important question when considering the role that Wikipedia plays with respect to digital colonialism.[7][8]

Despite the important role of sources on Wikipedia and many community discussions / concerns, they are generally understudied. A large factor in this is likely their inaccessibility for large-scale data analyses. This project will work to devise methods to overcome some of these challenges:

  • There is no standard format for references: common approaches to adding references include bare ref tags, citation templates, and shortened footnote templates. While extracting ref tags from Wikipedia is relatively straightforward, the content within them can still be unstructured and difficult to parse. The latter templates bring useful structure but there are many types and template names / parameter names will vary across languages. Any quantitative research on references on Wikipedia then generally comes with high start-up costs to extract the desired information. The mwrefs Python library helps with some of this -- it extracts ref tags, any associated citation template, external links, and identifiers (DOI, ISBN, pubmed, arxiv).
  • The citation only tells you so much: while you can extract certain information (does this reference have a DOI? is there a URL suggesting that the source is digitized? etc.), many interesting aspects of a reference (source country, language, open-access or paywall, etc.) can only be determined from consulting external databases or scraping content from external websites.
  • They are not tracked by any external tables: analyzing references requires going directly to the wikitext (or parsed HTML) of an article and doing the above extraction. There are no logging tables or links tables (maybe externallinks though external links can appear outside of references) that provide an easy entrypoint to analysis as is the case for studying images, links, categories, etc. on Wikipedia.

Characterizing sources edit

This is a non-exhaustive way to approach the characterization of sources in ways that are useful to patrollers or readers looking to assess verifiability/reliability of content. Editors have already devised ways to assess and label many of these characteristics via templates, user-scripts, and tools, but the lack of structure around citations makes it difficult to keep these assessments standardized and up-to-date. I have identified at least three main aspects to this, as described below.

Source-level metadata edit

There are many features specific to a source that are useful when evaluating it even ignoring the context in which it appears:

  • Medium: is the source a newspaper or book or website etc.?
  • Availability/accessibility: is the source available online or only as a physical artifact? If it is available online, is the URL still live, archived, and/or no longer available? What is the depth of the URL -- e.g., a link to a specific article or a general domain that is likely to change? Is the source behind a paywall? Is it transcribed into text and therefore easily searchable or just a scan of content? More accessible sources are not necessarily better sources, but they often make verifiability easier.
  • Recency: when was the source created? Newer sources are not necessarily better but older sources for content areas that are fast evolving can miss important details.
  • Level of source: is the source primary, secondary, or tertiary? None are explicitly disallowed but secondary are preferred.
  • Reliability: is the source considered reliable on that wiki? Examples of some of these discussions: en:Perennial sources.
  • Geoprovenance: what region or culture is associated with a particular source? This is generally operationalized at the country level and while it alone cannot tell you much about an individual source, it's much more useful for understanding the broader set of sources being considered (as noted below). A prototype API for English Wikipedia that is largely a replication and extension of Sen et al.[9] that analyzes an article's sources and their geographic distribution can be found here: https://geo-provenance.wmcloud.org/api/v1/geo-provenance?lang=en&title=Climate_change

Relationship between source and article content edit

Additional features about the appropriateness of a source only become clear when viewing it in context of the content that it is supposed to support:

  • Correctness: does the source in fact support the statement that it is supposed to verify? Tools such as verify or Side can help with this.
  • Language: is the source in the local language or would require translation to be accessible to the reader?

Relationship between source and article history edit

Understanding how a source came to be present in the current state of an article can help in assessing whether it warrants further evaluation:

  • Source provenance: who added the source -- e.g., akin to Who Wrote That? for article content. Understanding who added a given source and when can help moderators understand its context.
  • Stability/acceptance: how controversial is the source? Highly controversial sources might also be associated with previous discussions on talk pages or elsewhere about the appropriateness of the source.

Relationship between source and other sources edit

Viewing the source in relation to all other sources in an article or language edition or project can reveal even more information about the overall state of Wikipedia and gaps in what knowledge is represented:

  • Diversity: how similar is the source to others being used to support content? This can have strong implications for presenting a neutral point of view. Diversity has many aspects including the geoprovenance, medium, recency, level, publisher, and aspects related to the authors.
  • Frequency: how often and where else does this source appear on the Wikimedia projects? A single instance of a source does not make it bad -- many are hyperspecific to a topic -- but understanding usage both can help in assessing the reliability of a source and, conversely, reassessing usage if the reliability is thrown into question. Tools like citation finder or Global Search can help with this.

See Also edit

References edit

  1. Cohen, Noam (2021-09-07). "One Woman’s Mission to Rewrite Nazi History on Wikipedia". Wired. Wired. Retrieved 2022-03-30. 
  2. Grabowski, Jan; Klein, Shira (2023-02-09). "Wikipedia’s Intentional Distortion of the History of the Holocaust". The Journal of Holocaust Research 0 (0): 1–58. ISSN 2578-5648. doi:10.1080/25785648.2023.2168939. 
  3. Berson, Amber; Monika, Sengul-Jones; Tamani, Melissa (June 2021). "Unreliable Guidelines: Reliable Sources and Marginalized Communities in French, English and Spanish Wikipedias" (PDF). Art + Feminism. Retrieved 2022-03-30. 
  4. "Wikipedia is a mirror of the world’s gender biases". Wikimedia Foundation (in en-US). 2018-10-18. Retrieved 2022-03-30. 
  5. Luyt, Brendan; Tan, Daniel (2010). "Improving Wikipedia's credibility: References and citations in a sample of history articles". Journal of the American Society for Information Science and Technology: n/a–n/a. ISSN 1532-2882. doi:10.1002/asi.21304. 
  6. Ford, Heather; Sen, Shilad; Musicant, David R.; Miller, Nathaniel (2013-08-05). "Getting to the source: where does Wikipedia get its information from?". Proceedings of the 9th International Symposium on Open Collaboration. WikiSym '13 (New York, NY, USA: Association for Computing Machinery): 1–10. ISBN 978-1-4503-1852-5. doi:10.1145/2491055.2491064. 
  7. Duncan, Alexandra (2020). "Towards an activist research: is Wikipedia the problem or the solution?" (PDF). Art Libraries Journal. ISSN 0307-4722. Retrieved 2022-03-30. 
  8. "Decolonizing the Internet". Whose Knowledge (in en-US). Retrieved 2022-03-30. 
  9. Sen, Shilad W.; Ford, Heather; Musicant, David R.; Graham, Mark; Keyes, Os; Hecht, Brent (2015-04-18). "Barriers to the Localness of Volunteered Geographic Information". Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI '15 (New York, NY, USA: Association for Computing Machinery): 197–206. ISBN 978-1-4503-3145-6. doi:10.1145/2702123.2702170.