WMF publishes data dumps of Wikipedia and all WMF projects on a regular basis. English Wikipedia is dumped once a month, while smaller projects are often dumped twice a month.
- Text and metadata of current or all revisions of all pages as XML files
- Most database tables as sql files
- Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
- Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
- Media metadata (image, oldimage tables)
- Info about each page (page, page_props, page_restrictions tables)
- Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
- List of all pages that are redirects and their targets (redirect table)
- Log data, including blocks, protection, deletion, uploads (logging table)
- Misc bits (interwiki, site_stats, user_groups tables)
- experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps
- Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
- Media bundles for each project, separated into files uploaded to the project and files from Commons
- Static HTML dumps for 2007-2008
Archives : dumps.wikimedia.org/archive/
Current mirrors offer an alternative to the download page.
Due to large file sizes, using a download tool is recommended.
SQL dumps are provided as dumps of entire tables, using mysqldump.
Some older dumps exist in various formats.
How to and examplesEdit
See examples of importing dumps in a MySQL database with step-by-step instructions here .
Available tools are listed in the following locations, but information is not always up-to-date:
All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.
Maintainer: Ariel Glenn
Mailing list: xmldatadumps-l
Research projects using data from this sourceEdit
- "A Breakdown of Quality Flaws in Wikipedia" examines cleanup tags on the English Wikipedia using a January 2011 dump
- "There is No Deadline – Time Evolution of Wikipedia Discussions" looks at the time evolution of Wikipedia discussions, and how it correlates to editing activity, based on 9.4 million comments from the March 12, 2010 dump
- "Understanding collaboration in Wikipedia" mines a complete dump of the English Wikipedia (225 million article edits) for insights into open collaboration
- "Dynamics of Conflicts in Wikipedia" takes the revision history from the dump to extract the reverts based on the text comparison to study the dynamics of editorial wars in multiple language editions
What is this all about?Edit
Wikimedia provides public dumps of our wikis' content:
- for archival/backup purposes
- for offline use
- for academic research
- for bot use
- for republishing (don't forget to follow the license terms)
- for fun!
Please follow the XML Data Dumps mailing list by reading the archives or subscribing, for up to date news about the dumps; you can also make inquiries about them there. If you cannot download the dump you want because it no longer exists, or if you have other issues with the files, you can ping the developers there.
Warning on time and sizeEdit
Before attempting to download any of the Wikis or their components, PLEASE READ CAREFULLY the time and space scale information below! Because of the size of some file collections (TERAbytes), downloads can take days, or even weeks. (See also our FAQ on the size of the English language Wikipedia dumps.) Be sure you understand your storage capabilities before attempting downloads. Notice (below) that there are a number of versions that are "friendlier" in size and content, which you can customize to your scalability by using or not using images, using or not using talk pages, etc. A careful read of the info below will save a lot of headaches compared to jumping right into downloads.
What's available and whereEdit
It's all explained here: what's available and where you can download it.
How often dumps are producedEdit
All databases are dumped via 3 groups of processes which run simultaneously. The largest database, enwiki, takes 8 or 9 days for a full run to complete, and is run once a month. A second set of 'large' wikis runs in a continuous loop with the aim of getting dumps for those out twice a month; the rest we shoot for three times a month, also on a rolling basis. Failures in the dump process are generally dealt with by rerunning the portion of the dump that failed. See the wikitech page for more information about the processes and the dump architecture.
- Larger databases such as jawiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history or creating split stub dumps. If you see a dump seemingly stuck on one of these for a few hours, or days, it's likely not dead, but simply processing a lot of data. You can check that file sizes are increasing or that more revisions are being processed, by reloading the web page for the dump.
Feeds for last dump producedEdit
If you're interested in a file, you can subscribe to the RSS feed for it, so that you know when a new version is produced. No more time spent opening the web page, no more dumps missed and hungry bots without their XML ration.
The URL can be found in the
latest/ directory for the wiki (database name) in question: for instance
contains the feed
for the last *-pages-meta-history.xml.bz2 dump produced.
Format of the dump filesEdit
The format of the various files available for download is explained here.
You can download the XML/SQL files and the media bundles using a web client of your choice, but there are also tools for bulk downloading you may wish to use.
Tools for importEdit
Here's your basic list of tools for importing.
Check out and/or add to this partial list of other tools for working with the dumps, including parsers and offline readers.
Producing your own dumpsEdit
MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php  which can be used to produce XML dumps directly, with or without page history.
The programs which manage our multi-database dump process are available in our source repository but would need some tweaking to be used outside of Wikimedia.
You can generate dumps from public wikis using WikiTeam tools.
Step by step importingEdit
We documented the process to set up a small non-English-language wiki with not too many fancy extensions, using the standard MySQL database backend, on a Linux platform. Read the example or add your own.
See also the MediaWiki manual page on importing XML dumps.
Where to go for helpEdit
If you have trouble importing the files, or problems with the appearance of the pages after import, check our import issues list.
If you don't find the answer there or you have other problems with the dump files, you can:
- Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
- Ask on the xmldatadumps-l (quicker) or the wikitech-l mailing lists.
Alternatively, if you have a specific bug to report:
- File a bug at Bugzilla under the Product "Datasets"
For French speaking people, see also fr:Wikipédia:Requêtes XML
Some questions come up often enough that we have a FAQ for you to check out.
On the dumps:
- mw:Manual:Importing XML dumps
- mw:Research Data Proposals#Dump
- mw:WMF Projects/Data Dumps
On related projects:
- Datasets - a list of different data sources related to the Wikimedia projects and tools for working with them
- en:User:Emijrp/Wikipedia Archive
- WikiTeam (website) - a group of people who develop software for making backups and archive wikis, also a repository of wikis
- Entropy-based analysis tool (Who Writes Wikipedia?)
- Wikimedia group on The Data Hub for many other data dumps
- mw:Backing_up_a_wiki How to back up your wiki, with a database dumping tool for MySQL/PostgreSQL etc, or with dumpBackup.php