User:Neil Shah-Quinn (WMF)/Data portal draft

Other languages:

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available. If you have any questions, you might find the answer in the Frequently Asked Questions about Data.

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.


If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

Quick glance edit

Data Dumps (details)

Homepage | Download

Dumps of all WMF projects for backup, offline use, research, etc.

API (details)

Homepage

The API provides direct, high-level access to the data contained in MediaWiki databases through HTTP requests to the web service.

  • Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
  • JSON, WDDX, XML, YAML, and PHP's native serialization format

Toolforge (details)

Homepage

Toolforge allows you to connect to shared server resources and query a copy of the database (with some lag).

  • acts as a standard web server hosting web-based tools
  • command-line tools
  • account required

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.

Analytics Dumps (details)

Homepage

Raw pageview, unique device estimates, mediacounts, etc.

WikiStats (details)

Homepage | Download

Reports in 25+ languages based on data dumps and server log files.

  • Unique visits, page views, active editors and more
  • Intermediate CSV files available.
  • Graphical presentation.
  • Monthly

DBpedia (details)

Homepage

DBpedia extracts structured data from Wikipedia, allows users to run complex queries and link Wikipedia data to other data sets.

  • RDF,N-triplets, SPARQL endpoint, Linked Data
  • billions of triplets of info in a consistent Ontology

DataHub and Figshare (details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

  • smaller (usually one-time) surveys/studies
  • dbpedia lite, DBpedia-Live and others
  • EPIC/Oxford quality assessment

Figshare (datasets taggd 'wikipedia')

Readership data edit

Editing metadata edit

Editing metadata includes information about the users, time, and revision comment, and so on, but does not include the content of the revision itself.

This data is available from:

Raw content data edit

Data that includes the raw content of page revisions is available from:

Structured content data edit

Miscellaneous data edit

Analysis infrastructure edit

In addition to the raw data described above, there is a great deal of helpful infrastructure for research and analysis provided for people contributing to Wikimedia's mission.