Data dumps

The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps. Please volunteer to host a mirror if you have access to sufficient storage and bandwidth.

Summary

Description

WMF publishes data dumps of Wikipedia and all WMF projects on regular bases. English Wikipedia is dumped once a month, while smaller projects are often dumped twice a month.

Content

  • Text and metadata of current or all revisions of all pages as XML files
  • Most database tables as sql files
    • Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
    • Lists of pages with links oustide of the project (externallinks, iwlinks, langlinks tables)
    • Media metadata (image, oldimage tables)
    • Info about each page (page, page_props, page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki, site_stats, user_groups tables)
  • experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps

http://dumps.wikimedia.org/other/incr/

  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
  • Media bundles for each project, separated into files uploaded to the project and files from Commons

images : http://meta.wikimedia.org/wiki/Database_dump#Downloading_Images

  • Static HTML dumps for 2007-2008

http://dumps.wikimedia.org/other/static_html_dumps/

(see more)

Download

You can download the latest dumps (for the last year) here (http://dumps.wikimedia.org/enwiki/ for English Wikipedia, http://dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Archives : http://dumps.wikimedia.org/archive/

Current mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

Data format

XML dumps since 2010 are in the wrapper format described at Export format( schema ). Files are compressed in bzip2 (.bz2) and .7z format.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

https://meta.wikimedia.org/wiki/Data_dumps/Dump_format

How to and examples

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Existing tools

Available tools are listed in the following locations, but information is not always up-to-date:

Access

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support

Maintainer: Ariel Glenn

Mailing list: xmldatadumps-l

Research projects using data from this source


↑Jump back a section

What is this all about?

Wikimedia provides public dumps of our wikis' content:

  • for archival/backup purposes
  • for offline use
  • for academic research
  • for bot use
  • for republishing (don't forget to follow the license terms)
  • for fun!

Please follow the XML Data Dumps mailing list by reading the archives or subscribing, for up to date news about the dumps; you can also make inquiries about them there. If you cannot download the dump you want because it no longer exists, or if you have other issues with the files, you can ping the developers there.

↑Jump back a section

Warning on Time and Size

Before attempting to download any of the Wikis or their components, PLEASE READ CAREFULLY the time and space scale information below! Because of the size of some file collections (TERAbytes), downloads can take days, or even weeks. (See also our FAQ on the size of the English language WIkipedia dumps.) Be sure you understand your storage capabilities before attempting downloads. Notice (below) that there are a number of versions that are "friendlier" in size and content, which you can customize to your scalability by using or not using images, using or not using talk pages, etc. A careful read of the info below will save a lot of headaches compared to jumping right into downloads.

↑Jump back a section

What's available and where

It's all explained here: what's available and where you can download it.

↑Jump back a section

How often dumps are produced

All databases are dumped via 3 groups of processes which run simultaneously. The largest database, enwiki, takes 8 or 9 days for a full run to complete, and is run once a month. A second set of 'large' wikis runs in a continous loop with the aim of getting dumps for those out twice a month; the rest we shoot for three times a month, also on a rolling basis. Failures in the dump process are generally dealt with by rerunning the portion of the dump that failed. See the wikitech page for more information about the processes and the dump architecture.

Larger databases such as jawiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history or creating split stub dumps. If you see a dump seemingly stuck on one of these for a few hours, or days, it's likely not dead, but simply processing a lot of data. You can check that file sizes are increasing or that more revisions are being processed, by reloading the web page for the dump.

The download site shows the status of each dump: if it's in progress, when it was last dumped, etc.'

↑Jump back a section

Format of the dump files

The format of the various files available for download is explained here.

↑Jump back a section

Download Tools

You can download the XML/SQL files and the media bundles using a web client of your choice, but there are also tools for bulk downloading you may wish to use.

↑Jump back a section

Tools for import

Here's your basic list of tools for importing.

↑Jump back a section

Other tools

Check out and/or add to this partial list of other tools for working with the dumps, including parsers and offline readers.

↑Jump back a section

Producing your own dumps

MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php [1] which can be used to produce XML dumps directly, with or without page history.

The programs which manage our multi-database dump process are available in our source repository but would need some tweaking to be used outside of Wikimedia.

You can generate dumps from public wikis using WikiTeam tools.

↑Jump back a section

Step by step importing

We documented the process to set up a small non-english-language wiki with not too many fancy extensions, using the standard MySQL database backend, on a Linux platform. Read the example or add your own.

See also the MediaWiki manual page on importing XML dumps.

↑Jump back a section

Where to go for help

If you have trouble importing the files, or problems with the appearance of the pages after import, check our import issues list.

If you don't find the answer there or you have other problems with the dump files, you can:

  • Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
  • Ask on the xmldatadumps-l (quicker) or the wikitech-l mailing lists.

Alternatively, if you have a specific bug to report:

For French speaking people, see also fr:Wikipédia:Requêtes XML

↑Jump back a section

FAQ

Some questions come up often enough that we have a FAQ for you to check out.

↑Jump back a section

See also

On the dumps:

On related projects:

↑Jump back a section
Last modified on 23 April 2013, at 15:58