User:Neil Shah-Quinn (WMF)/Data portal draft

Languages:

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available. If you have any questions, you might find the answer in the Frequently Asked Questions about Data.

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.


If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

Quick glanceEdit

Data Dumps (details)

Homepage | Download

Dumps of all WMF projects for backup, offline use, research, etc.

API (details)

Homepage

The API provides direct, high-level access to the data contained in MediaWiki databases through HTTP requests to the web service.

  • Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
  • JSON, WDDX, XML, YAML, and PHP's native serialization format

Toolforge (details)

Homepage

Toolforge allows you to connect to shared server resources and query a copy of the database (with some lag).

  • acts as a standard web server hosting web-based tools
  • command-line tools
  • account required

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.

Analytics Dumps (details)

Homepage

Raw pageview, unique device estimates, mediacounts, etc.

WikiStats (details)

Homepage | Download

Reports in 25+ languages based on data dumps and server log files.

  • Unique visits, page views, active editors and more
  • Intermediate CSV files available.
  • Graphical presentation.
  • Monthly

DBpedia (details)

Homepage

DBpedia extracts structured data from Wikipedia, allows users to run complex queries and link Wikipedia data to other data sets.

  • RDF,N-triplets, SPARQL endpoint, Linked Data
  • billions of triplets of info in a consistent Ontology

DataHub and Figshare (details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

  • smaller (usually one-time) surveys/studies
  • dbpedia lite, DBpedia-Live and others
  • EPIC/Oxford quality assessment

Figshare (datasets taggd 'wikipedia')

Readership dataEdit

Editing metadataEdit

Editing metadata includes information about the users, time, and revision comment, and so on, but does not include the content of the revision itself.

This data is available from:

Raw content dataEdit

Data that includes the raw content of page revisions is available from:

Structured content dataEdit

Miscellaneous dataEdit

Analysis infrastructureEdit

In addition to the raw data described above, there is a great deal of helpful infrastructure for research and analysis provided for people contributing to Wikimedia's mission.