Research:Data introduction

Tracked in Phabricator:
Task T343146

This page introduces the basic concepts researchers and data scientists should know to start working with Wikimedia data. The goals of this document are:

  • Start building your understanding of the Wikimedia data landscape.
  • Set your expectations about what data is available and the challenges involved in working with it.
  • Alert you to unique qualities of Wikimedia data and infrastructure, and help you avoid common analysis pitfalls and misconceptions.
  • Introduce you to major data domains, and direct you to more specific information about data sources, access methods, and analysis techniques within each domain.

Essential concepts edit

Wikimedia projects are some of the largest, most complex, multilingual knowledge bases ever created. That makes Wikimedia data interesting and useful, but it can also be confusing and complex. To avoid making false assumptions or analysis mistakes, and to save time identifying data sources for your work, you should understand these fundamental characteristics of Wikimedia projects.

Wiki projects are more than Wikipedia edit

Wikipedia may be the most well-known wiki project, but there are 18 official Wikimedia projects. The term "wiki projects" usually refers to all the open knowledge wikis supported by the Wikimedia Foundation, which includes Wikipedias, Commons, Wikidata, Wikisource, Wiktionary, and more.

In addition to those "core content projects", Wikimedia wikis include many other projects. Those projects may also include content, or they may focus on other types of activities and contributions that support the movement.

Why does this matter for your analysis?

Being aware of the full range of wiki projects can help you avoid omitting relevant data, or making incorrect assumptions. For example, if you're analyzing wiki content, you may want to consider how Wikipedia articles use images from Commons, and that both images and articles can have associated Wikidata statements. You may also want to investigate if communities are creating content related to your area of inquiry in sister wiki projects.

Some wikis have language editions, others are multilingual edit

MediaWiki is the software that powers the Wikimedia projects. Some wiki projects, like Wikipedia, have a unique MediaWiki instance for each language. Those different MediaWiki instances are called "language editions". Other projects, like Wikidata, are multilingual: they have only one MediaWiki instance, and content is translated into multiple languages within that instance. Each MediaWiki instance has its own database.

The technologies and tools used by Wikimedians can vary substantially between wikis. Communities manage and customize many aspects of MediaWiki, which results in different user interfaces and functionality for different language editions. Different communities may have different governance practices, administrative rules, and collaboration models. Also, local and offline context can influence the makeup of the editor community, the availability of citation sources, and more.

Why does this matter for your analysis?

Always remember that user behavior and content patterns on large wikis, like English Wikipedia, are not universally applicable to all wiki projects. Making comparisons across language editions requires care and understanding of the many variables across language editions, and in some cases it may not be feasible.

Understanding cross-lingual differences on Wikimedia projects is essential in making accurate comparisons. Especially if you're working on NLP or building multilingual language models, the quality of your multilingual dataset depends on your understanding of how and why content varies between language editions. Learn more about how to approach research using multilingual Wikimedia content.

User interfaces and software functionality vary edit

The functionality and appearance of MediaWiki doesn't just vary by language edition, it can also vary along one or more of the following dimensions:

  • User interface customization: Users and wiki admins can choose from multiple MediaWiki skins, and those can be further customized by user scripts and gadgets. For details, visit mw:Manual:Interface.
  • Device type: Wikis can be viewed and edited through mobile apps, desktop or mobile web browsers (mobile or desktop), or specialized devices.
  • Platforms: Wiki content can be consumed through search engines, voice assistants, specialized apps, and in many other contexts, without a user ever directly accessing a Wikimedia site.
  • Editing interfaces: Wiki content can be edited using VisualEditor or source editing, or via tools or bots that use MediaWiki APIs. See mw:Editor for a full list of editing interfaces.

Why does this matter for your analysis?

In some cases, it might not be possible to compare user behavior or content interactions across different device types or platforms, because the user interface and functionality varies so significantly. For example, some types of Wikidata content reused on Wikipedia, like categories and geographic coordinates, only show up on desktop but not on mobile or apps[1]. If you're researching content consumption or contribution, you should consider and account for the many variables and environments in which those activities occur.

Not all wiki users and editors are humans edit

"Our communities are powered by generous humans and efficient robots" —  Research:Wikistats_metrics

Wiki content is both consumed and edited by humans and by automated agents or scripts called 'bots'. Bots include major search engine crawlers that index wiki content, and simple scripts written by individual users to scrape a couple of wiki pages. Bots can influence traffic data, like pageview metrics, so vandalism can take the form of automated bots that seek to influence which pages appear on most-read lists.

Bots perform large numbers of wiki edits and maintenance tasks. They're an essential part of the wiki ecosystem, and they enable human volunteers to focus their time and energy on tasks that require human attention. The availability of bots or other automated filters to do simple tasks on a wiki varies greatly between language editions (see above).

Why does this matter for your analysis?

For most datasets, you must filter out (or consciously decide to include) bot traffic or automated contributions. Most Wikimedia data sources have separate files for automated traffic, or a field to enable filtering of individual users as automated or bots. For more details about how to identify and work with data related to bots, see the Traffic data and Contributing data overviews[f], or the documentation for specific datasets.

Wiki content is more than articles, and contributions are more than edits edit

Not all wiki projects focus on articles or written text, and many projects include multiple types of content within pages. For example, Wikipedia articles often contain page elements like templates, images, and structured data. All of these elements include meaningful information that enriches the article text, but the work of creating or maintaining this non-text content may happen in another wiki, or in a different namespace.

Building and maintaining wiki projects requires many types of contributions beyond just editing articles. Wiki editing is supported and enabled by a wide range of contributions from administrators, patrollers, translators, developers, and many other community members.

Why does this matter for your analysis?

Depending on the wiki project and community, you could miss many types of interactions and contributions if you ignore the full range of content types, namespaces, and ways of contibuting. When framing and scoping your research, consider all the different types of wiki content, community activities, and how they interact. For example: conversations on article Talk pages are an important part of the editing and maintenance process: should your analysis include text or data from those pages?

Namespaces are a complicated and important topic! For more details about namespaces and how they can impact your analysis, see the guide to Analyzing content.

Privacy is essential, even in public Wikimedia datasets edit

Protecting anonymity is not only an important value of the Wikimedia movement, it's essential to how the Wikimedia Foundation publishes and shares data. The following are examples of data that is private or unavailable:

  • Data about some countries and territories may not be published. See the Country and Territory Protection List.
  • IP addresses and User Agent strings
  • Search or browsing history of a particular user
  • The geographic location of readers on a per-page basis. You can see the total pageviews by country across all Wikimedia sites at stats.wikimedia.org.

Wikimedia public data dumps and datasets contain public information and aggregated or non-personal information from Wikimedia projects. The types of information that are considered "public" are more fully explained in the Wikimedia Foundation Privacy Policy. Since 2021, WMF has used differential privacy on some data releases; learn more about those datasets at Differential_privacy.

Why does this matter for your analysis?

You should understand which information isn't available. For example, there is no way to accurately identify individual readers, and that is on purpose. It's easy to identify individual editors, but not to know their real-life identity.

You should understand how Wikimedia privacy practices influence the assumptions you can make about the data. For example, you can't assume that one user account represents one person, or that an individual user always logs in when making contributions.

Researchers who have a formal collaboration agreement with the Wikimedia Foundation may gain access to additional data sources, and must follow Data Access Guidelines, Data Publication Guidelines, and the Open Access Policy.

Wiki projects have knowledge gaps edit

As community-curated knowledge bases, the content of Wikimedia projects reflects the knowledge, interests, and priorities of its contributors. There are many wiki communities and projects focused on understanding and reducing inequities in how wiki content is generated and maintained, how wiki communities function, and how larger issues like the digital divide and access to knowledge impact the Wikimedia movement.

Why does this matter for your analysis?

If you're researching wiki content, always keep in mind the possibility of knowledge gaps in areas like gender representation, geographical and linguistic coverage of topics, and more.

If you're analyzing traffic, contributions, or contributors, remember that complex and intersecting socio-technical variables impact individuals and wiki communities. Consider both the offline and online contexts in which people consume wiki content and contribute to the wiki projects. For example, Wikimedia Apps may provide offline access to content, or readers may access articles offline through Kiwix deployments.

Data domains edit

This section describes the major types of publicly-available, open-licensed data about Wikimedia projects. You should understand the different types of data so you can narrow down your analysis to specific datasets and relevant tools.

Traffic and readership data edit

"Traffic data" usually refers to data that is purely about content consumption. Other terms for this type of data are analytics, pageviews, readership, or site usage. This data represents actions or user behaviors that don't modify the status of a wiki. In contrast, actions that change the state of the wiki, and are recorded by the MediaWiki software in its various database tables, are better represented by datasets in other domains, like Contributing or Content.

✅ Example: events that are traffic ❌ Example: events that are not traffic
A user views a file on Commons A user creates an account
A web scraper accesses an article on Wikipedia A bot adds a category to some pages

Traffic and readership data sources edit

APIs
Dumps
Specialized datasets
MediaWiki database tables
  • N/A
Dashboards

To learn more about available datasets, tools, and analysis methods in this domain, visit the TODO Traffic data analysis overview.

Content edit


If you're interested in analyzing editing patterns, but you don't need the full content of wiki pages, see the Contributions and contributors section.

In the context of Wikimedia data, "content" refers to the text, media, or data stored in the wiki project's MediaWiki database. Depending on the project and data source, "content" can mean raw Wikitext, parsed HTML of wiki pages, Wikibase JSON, images, or other types of content.

The MediaWiki database structure has separate tables for text, revisions, and pages. When a user edits a page, the act of editing creates a revision. In technical terms, an edit creates a row in the revision table of that wiki's MediaWiki database. The text of that revision is the "content" of the page. The text of the page itself is stored in the text table, a different table than the revision table, and a separate table from the page itself. This data structure means you can analyze contributions and user activities without dealing with large amounts of stored raw content, but if you do need the raw content, you have to work with different or combined data sources.

✅ Examples of content ❌ Not content
  • Revision text (Wikitext)
  • Parsed HTML content of Wikimedia articles
  • Wikidata items as JSON objects
  • Images from Wikimedia Commons


Content data sources edit


APIs
Dumps
Specialized datasets
MediaWiki database tables
Dashboards
  • N/A


To learn more about available datasets, tools, and analysis methods in this domain, visit the Content analysis overview.


Contributions and contributors edit

If you want to analyze the full content of revisions, and not just metadata about them, see the Content section above.

Contributions to the wiki projects take many forms, but editing the content of a wiki is one of the most frequently-analyzed type of contribution. Other types of contributor activities include patrolling, account management, organizing events, and technical work like coding gadgets or extensions.

Analyzing edits or editor activities usually involves working only with metadata, as opposed to the full content of wiki pages. Metadata includes information about pages, users, and revisions, but does not include the content of the revision itself.

✅ Examples of contributions ❌ Not contributions
  • Editing a wiki page
  • Uploading an image to Commons
  • Patrolling and fighting vandalism
  • Curating categories on a wiki
  • Responding to questions at Village Pump, on IRC, or other forums
  • Creating a bot to automate wiki activities
  • Adding translations through translatewiki.net
  • Submitting a patch to a Wikimedia codebase
  • Organizing a wiki community event
  • ...so much more!
  • Reading or consuming content from a wiki (see: Traffic)

Contributions and contributors data sources edit

Contributions / edits Contributors / editors
APIs
Dumps
Specialized datasets
MediaWiki database tables
Dashboards

To learn more about available datasets, tools, and analysis methods in this domain, visit the Analyzing contributions and contributors overview.

Tools and data access methods edit

You should choose a data access method based on your analysis goals, your research domain, the datasets you need, and various other factors. Most research and analysis use cases require combining several data sources, parsing, formatting, and filtering the data before you can analyze it.

In general, Wikimedia datasets are very large, and different sources use different file and compression formats. Researchers and developers have created scripts, libraries, and tools to address many of these challenges.

Overview of data access methods edit

This is an overview of the general benefits and constraints of available data access methods. For details about specific APIs, dumps, databases, or dashboards, see the linked pages for each data domain.

🌈 TODO: add link to a full reference table of all data sources by domain

Benefits Constraints More info
APIs Relatively easy to use

Responses in standard formats that are easy to parse

Direct access to the data contained in MediaWiki databases through HTTP requests

Not optimized for bulk access (one page is a single response in some APIs)

Often limited to current or recent data

restricted to small number of requests at a time (rate limited)

APIs for analyzing:
Dumps Query for patterns across large corpus of data

Includes full  content in addition  metadata

Historical data is available

You can use as much computing power as you can provide

Large file size makes parsing and extraction cumbersome

Different dumps use different types of aggregation, requires extra work to compare across datasets

Requires writing your own code (although libraries are available)

You must provide the necessary computing power, which may be a lot

Most use less-common formats (XML or SQL statements)

Dumps for analyzing:
MediaWiki database replicas Connect to shared server resources and query a copy of MediaWiki content databases

Use browser-based query tools like PAWS or Quarry, or command-line tools

Data is raw, not aggregated

Querying multiple wiki projects requires multiple queries

Queries may be too computationally expensive

Limited to current or recent data

May require a Wikimedia developer account

MediaWiki tables for analyzing:
Dashboards Easy to use

Access via web browser

No need to learn MediaWiki database structure

Limited to predefined datasets and filters; may not cover all wiki projects or languages

Often limited to current or recent data

Provide statistics but not raw data

Dashboards for analyzing:

Get started edit

References edit