Research:Data

Languages:

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.

If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you still have questions, you can email your question to the Analytics mailing list (more information).

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.

If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

See also inspirational example uses.

Also consider searching for datasets at Zenodo, Figshare, Dimensions.ai, Google Dataset Search or Academic Torrents.

Quick glanceEdit

Data Dumps (details)

HomepageDownload

Dumps of all WMF projects for backup, offline use, research, etc.

  • Wiki content, revisions, metadata, and page-to-page and outside links
  • XML and SQL format
  • once/twice a month
  • large file sizes
  • The dumps.wikimedia.org domain also hosts other data
APIs (details)
  • The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases over the web.
    • Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
    • JSON, XML, and PHP's native serialization format
Database access (Toolforge, PAWS, Quarry) (details)

The Toolforge hosting environment allows you to connect to shared server resources and query a copy of the Wikimedia project's content databases.

  • Acts as a standard web server hosting web-based tools
  • Command-line tools
  • Account required

PAWS is a Jupyter Notebook environment within Toolforge that allows e.g. querying database replicas and APIs for analysis.

Quarry is a public web interface allowing SQL queries to database replicas.
Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using Server Sent Events over HTTP.
Analytics Dumps (details)

Homepage

Raw pageviews, unique device estimates, mediacounts, etc.

WikiStats (details)

Homepage

Reports based on data dumps and server log files.

  • Unique visits, page views, active editors and more
  • Intermediate CSV files available
  • Graphical presentation
DBpedia (details)

DBpedia extracts structured data from Wikipedia. It allows users to run complex queries and link Wikipedia data to other data sets.

  • RDF, N-triplets, SPARQL endpoint, Linked Data
  • Billions of triplets of info in a consistent ontology
DataHub and Figshare (details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

Data dumpsEdit

WMF releases data dumps of Wikipedia, Wikidata, and all WMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short URL mappings.

ContentEdit

XML/SQL dumpsEdit

  • Text of current and/or all revisions of all pages, in XML format (schema)
  • Metadata for current and/or all revisions of all pages, in XML format (schema)
  • Most database tables as SQL files
    • Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
    • Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
    • Media metadata (image, oldimage tables)
    • Info about each page (page, page_props, page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki, site_stats, user_groups tables)
  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content

Other dumpsEdit

See the full list of what is available for download.

DownloadEdit

You can download the latest dumps for the last year (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc). Download mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

There are also archives. Many older dumps can also be found at the Internet Archive.

Data formatEdit

XML dumps are in the wrapper format described at Export format (schema). Files are compressed in gzip (.gz), bzip2/lbzip2 (.bz2) and .7z formats.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examplesEdit

See examples of importing dumps in a MySQL database with step-by-step instructions.

Existing toolsEdit

Some tools are listed on the following pages, but these tools are mostly outdated and non-functional:

LicenseEdit

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

SupportEdit

MediaWiki APIEdit

The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

ContentEdit

EndpointEdit

To query the database you send a HTTP GET request to the desired endpoint (example https://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to query and defining the query details the URL.

How to and examplesEdit

Existing toolsEdit

To try out the API interactively on English Wikipedia, use the API Sandbox.

AccessEdit

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

LicenseEdit

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

SupportEdit

Toolforge and PAWSEdit

Toolforge hosts command line or web-based tools, which can query copies of the database. Copies are generally real-time but sometimes replication lag occurs.

PAWS is a Jupyter Notebook environment within Toolforge that allows e.g. querying database replicas for analysis.

ContentEdit

Toolforge hosts copies of the databases of all Wikimedia projects including Commons. You can use the contents of the databases under the Toolforge rules.

Data formatEdit

Explore the database schema of the MediaWiki software.

How toEdit

Using Toolforge requires familiarity with Unix/Linux command line, SSH keys, SQL/databases, and some programming.

To start using the Toolforge, see this Quickstart guide.

Existing toolsEdit

See https://admin.toolforge.org/

SupportEdit

See wikitech:Help:Cloud Services introduction#Communication and support

Recent changes streamEdit

See EventStreams to subscribe to Recent changes on all Wikimedia wikis. This broadcasts edits and other changes as they happen.

Existing toolsEdit

See wikitech:Event Platform/EventStreams/Powered By

Analytics dumpsEdit

Analytics datasets offer data on pageviews, mediacounts, unique devices, revision history, data by country, and Wikidata QRanks.

Pageview statisticsEdit

Pageview statistics are one example. Each request of a page reaches one of Wikimedia's Varnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly.

Files starting with "project" contain total hits per project per hour statistics.

Data formatEdit

See the README for details on the format.

Existing toolsEdit

You can interactively browse the page view statistics at https://pageviews.toolforge.org. More documentation on the Pageviews Analysis tool is available.

WikiStatsEdit

Wikistats is an informal but widely recognized name for a set of reports which provide monthly trend information for all Wikimedia projects and wikis.

ContentEdit

Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:

  • unique visitors
  • page views (overall and mobile only)
  • editor activity
  • article count

Data formatEdit

Data is presented as charts with the option to download the underlying data.

SupportEdit

For more details on Wikistats, see wikitech:Analytics/Systems/Wikistats_2.

DBpediaEdit

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

ContentEdit

The English version of the DBpedia knowledge base describes millions of things, and the majority of items are classified in a consistent ontology (persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.). Localized versions of DBpedia in more than hundred languages describe millions of things.

The data set also features:

  • about 2 billion pieces of information (RDF triples)
  • labels and abstracts for >10 million unique things in up to 111 different languages
  • millions of links to images, links to external web pages, data links into external RDF datasets, links to Wikipedia categories, YAGO categories
  • https://www.dbpedia.org/resources/ has download links for all the data sets, different formats and languages.

Data formatEdit

  • RDF/XML
  • Turtle
  • N-Triplets
  • SPARQL endpoint

AccessEdit

LicenseEdit

SupportEdit

DataHubEdit

The Wikimedia organization on the Open Knowledge Foundation's DataHub is a collection of datasets about Wikipedia and other projects run by the Wikimedia Foundation.

The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.

Wikivoyage also maintains data on its own DataHub:

  • Hotels/restaurants/attractions data as CSV/OSM/OBF
  • Tourism guide for offline use