Research:Data introduction/Analyzing content

This page provides an introduction for researchers and data scientists who want to get started analyzing Wikimedia content. It helps you understand the basic concepts, potential pitfalls, and canonical data sources for accessing and analyzing the content of Wikimedia wikis.

Before you start edit

This page assumes you already understand the concepts in Research:Data_introduction.

What is wiki content? edit

In the context of Wikimedia data, "content" refers to the text, media, or data stored in the wiki project's MediaWiki database. Depending on the project and data source, "content" can mean raw Wikitext, parsed HTML of wiki pages, Wikibase JSON, images, or other types of content.

Content vs. metadata edit

The MediaWiki database structure has separate tables for text, revisions, and pages. When a user edits a page, the act of editing creates a revision. A revision record contains metadata about the page change, but not the changed content itself.

In technical terms, an edit creates a row in the revision table of that wiki's MediaWiki database. The text of that revision is the "content" of the page, but the text of the revision itself is stored in the text table, a different table than the revision table, and a separate table from the page itself. This data structure means you can analyze contributions and user activities without dealing with large amounts of stored raw content, but if you do need the raw content, you have to combine data from multiple tables (or use a data source that has already combined them).

✅ Examples of content ❌ Not content
  • Revision text (Wikitext)
  • Parsed HTML content of Wikimedia articles
  • Wikidata items as JSON objects
  • Images from Wikimedia Commons


Namespaces edit

A content namespace is a namespace that contains the content of a Wikimedia project. But, what is content? In Wikipedia, "content" is traditionally limited to articles. In technical terms, that means pages in the namespace zero or the main namespace or ns0 (based on the numeric identifier that the software assigns to it).

However, wikis may designate different namespaces as content namespaces. Not all pages in the Main namespace contain content, and not all content that may be relevant for your analysis is in article pages. For example:

  • In English Wikipedia's data model, the "Portal" or "Category" namespace are considered to be content namespaces, but the pages they contain are not "articles".
  • Wikisource has an Author content namespace (example)
  • Spanish Wikipedia has an Anexo (Appendix) content namespace (example)

For more details, and tips for identifying the content namespaces in a given wiki, see Research:Content_namespace.

Why does this matter for your analysis?

Understanding namespaces is important because different wikis use namespaces differently, so you should pay attention to, and explicitly decide, which namespaces in each wiki contain content relevant for your analysis.

  • For example, if you're analyzing contributions in a given topic area, will you only include text contributions to article pages? What about media uploads, Talk page discussions, or edits to Wikidata items in that topic area?

Historical vs. current content edit

The life of a piece of content

Wiki content formats, parsing, and rendering edit

Wikitext vs. HTML

Content models / From a technical perspective: pages are content objects which may instantiate different content models https://www.mediawiki.org/wiki/Content_handlers https://www.mediawiki.org/wiki/Manual:Database_layout

Content reuse, templates, and transclusion edit

Some types of Wikidata transcluded content may show up differently (or not at all) on different platforms (source) https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion#Approaches

(TIL) Changes to Wikidata items in a transcluded template will show as Recent Changes even if they they have no impact on the article

Sample MediaWiki API query: which pages use (transclude) the Template:Infobox?

Translated content edit

Several processes and large-scale systems exist for translating wiki content. "Even though one language version of a Wikipedia articles may be a translation of a Wikipedia article in another language, the wikitext is not necessarily sentence-aligned."

Anatomy of a wiki page edit

What elements might you encounter? TODO: see if you can find a nice diagram

Text of the page
Usually Wikitext or HTML

TIP: mwparserfromhtml library extracts some properties of the elements that end-users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.

Links
  • Wikilinks (or internal links). TIP: mwparserfromhtml library has annotations about the namespace of the target link, whether it is disambiguation page, redirect, red link, or interwiki link.) NOTE: inter-language links are not necessarily bi-directional
  • External links. (TIP: mwparserfromhtml library distinguishes whether external link is named, numbered, or autolinked).
Translate tags and translated content
See above: Translated content.
Templates and transcluded content
See above: Content reuse, templates, and transclusion.
Media
TODO: how to see Commons content used in Wikipedia. Different types of media may exist on a page (image, audio, or video), and media may be accompanied by caption and alt text.
Categories
https://github.com/blancadesal/wm-tutorials-and-blogposts/blob/main/mwsql-blogpost/mwsql-blogpost.ipynb
Structured data
Geotagged content "Some Wikipedia articles (and Commons media) have markup with geographical coordinates...Wikidata structured content including geotagging can provide a plethora of information for enriching maps, e.g., one can use OpenStreetMap to combine Wikidata’s entities with geotags and images to render images on a map. Examples:

"The Wikidata-based service Wiki ShootMe! lists geographical items with missing images on Wikidata based on a query coordinate given by the user. A Wikipedian can use it to identify photo opportunities nearby. The MediaWiki software embeds similar functionality with the ‘Special:Nearby’ page that lists nearby pages and associated images." "Wikipedia lets users markup a page with geographical coordinates via the use of MediaWiki templates. Through the OpenStreetMap SimpleMap MediaWiki extension, articles with geographical coordinates can display maps." (TODO: find where these quotes come from in my original notes doc/draft)

Citations / references
TODO: maybe this is too detailed, but people do love to analyze citations

Quickstarts for common content analysis tasks edit

Comparing diffs edit

Extracting references edit

Research topics in wiki content analysis edit

TODO