Research:Data introduction/Analyzing content
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
This page provides an introduction for researchers and data scientists who want to get started analyzing Wikimedia content. It helps you understand the basic concepts, potential pitfalls, and canonical data sources for accessing and analyzing the content of Wikimedia wikis.
Before you start edit
This page assumes you already understand the concepts in Research:Data_introduction.
What is wiki content? edit
In the context of Wikimedia data, "content" refers to the text, media, or data stored in the wiki project's MediaWiki database. Depending on the project and data source, "content" can mean raw Wikitext, parsed HTML of wiki pages, Wikibase JSON, images, or other types of content.
Content vs. metadata edit
The MediaWiki database structure has separate tables for text, revisions, and pages. When a user edits a page, the act of editing creates a revision. A revision record contains metadata about the page change, but not the changed content itself.
In technical terms, an edit creates a row in the revision table of that wiki's MediaWiki database. The text of that revision is the "content" of the page, but the text of the revision itself is stored in the text table, a different table than the revision table, and a separate table from the page itself. This data structure means you can analyze contributions and user activities without dealing with large amounts of stored raw content, but if you do need the raw content, you have to combine data from multiple tables (or use a data source that has already combined them).
✅ Examples of content | ❌ Not content |
---|---|
|
|
Namespaces edit
A content namespace is a namespace that contains the content of a Wikimedia project. But, what is content? In Wikipedia, "content" is traditionally limited to articles. In technical terms, that means pages in the namespace zero or the main namespace or ns0 (based on the numeric identifier that the software assigns to it).
However, wikis may designate different namespaces as content namespaces. Not all pages in the Main namespace contain content, and not all content that may be relevant for your analysis is in article pages. For example:
- In English Wikipedia's data model, the "Portal" or "Category" namespace are considered to be content namespaces, but the pages they contain are not "articles".
- Wikisource has an Author content namespace (example)
- Spanish Wikipedia has an Anexo (Appendix) content namespace (example)
For more details, and tips for identifying the content namespaces in a given wiki, see Research:Content_namespace.
Why does this matter for your analysis?
Understanding namespaces is important because different wikis use namespaces differently, so you should pay attention to, and explicitly decide, which namespaces in each wiki contain content relevant for your analysis.
- For example, if you're analyzing contributions in a given topic area, will you only include text contributions to article pages? What about media uploads, Talk page discussions, or edits to Wikidata items in that topic area?
Historical vs. current content edit
The life of a piece of content
Wiki content formats, parsing, and rendering edit
Wikitext vs. HTML
Content models / From a technical perspective: pages are content objects which may instantiate different content models https://www.mediawiki.org/wiki/Content_handlers https://www.mediawiki.org/wiki/Manual:Database_layout
Core MediaWiki functionality handles most aspects of article parsing, but wiki-specific Extensions can alter article content in ways that are more difficult to track. |
Content reuse, templates, and transclusion edit
Some types of Wikidata transcluded content may show up differently (or not at all) on different platforms (source) https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion#Approaches
(TIL) Changes to Wikidata items in a transcluded template will show as Recent Changes even if they they have no impact on the article
Sample MediaWiki API query: which pages use (transclude) the Template:Infobox?
Translated content edit
Several processes and large-scale systems exist for translating wiki content. "Even though one language version of a Wikipedia articles may be a translation of a Wikipedia article in another language, the wikitext is not necessarily sentence-aligned."
Anatomy of a wiki page edit
What elements might you encounter? TODO: see if you can find a nice diagram
- Text of the page
- Usually Wikitext or HTML
TIP: mwparserfromhtml library extracts some properties of the elements that end-users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.
- Links
- Wikilinks (or internal links). TIP: mwparserfromhtml library has annotations about the namespace of the target link, whether it is disambiguation page, redirect, red link, or interwiki link.) NOTE: inter-language links are not necessarily bi-directional
- External links. (TIP: mwparserfromhtml library distinguishes whether external link is named, numbered, or autolinked).
- Translate tags and translated content
- See above: Translated content.
- Templates and transcluded content
- See above: Content reuse, templates, and transclusion.
- Media
- TODO: how to see Commons content used in Wikipedia. Different types of media may exist on a page (image, audio, or video), and media may be accompanied by caption and alt text.
- Structured data
- Geotagged content "Some Wikipedia articles (and Commons media) have markup with geographical coordinates...Wikidata structured content including geotagging can provide a plethora of information for enriching maps, e.g., one can use OpenStreetMap to combine Wikidata’s entities with geotags and images to render images on a map. Examples:
"The Wikidata-based service Wiki ShootMe! lists geographical items with missing images on Wikidata based on a query coordinate given by the user. A Wikipedian can use it to identify photo opportunities nearby. The MediaWiki software embeds similar functionality with the ‘Special:Nearby’ page that lists nearby pages and associated images." "Wikipedia lets users markup a page with geographical coordinates via the use of MediaWiki templates. Through the OpenStreetMap SimpleMap MediaWiki extension, articles with geographical coordinates can display maps." (TODO: find where these quotes come from in my original notes doc/draft)
- Citations / references
- TODO: maybe this is too detailed, but people do love to analyze citations
Quickstarts for common content analysis tasks edit
Comparing diffs edit
Extracting references edit
Research topics in wiki content analysis edit
TODO