Data dumps/Other tools
There are three options to use the compressed data dumps: decompress them, which is time and memory consuming, reading the compressed files with general purpose library, e.g. Python's Bz2file, or using one of the custom Wikipedia readers/libraries.
WikiXRay Python parserEdit
WikiXRay is a Python tool for automatically processing Wikipedia's XML dumps for research purposes.
It also includes the more complete parser to extract metadata for all revisions and pages in a WIkipedia's XML dump, compressed with 7zip (or any other version). See the WikiXRay page on Meta for more info.
WikiPrep Perl scriptEdit
Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc. Of interest is also new SourceForge page with more updated branches: wikiprep.sf.net
The version described above works only for old (a few years old) dumps, and hence it is not maintained, it WILL break on current dumps. Although, the idea is not abandoned and is maintained by Tomaz Solc undel GPL and it is available here. This version spans multiple processes if required to speed up the process.
Wikipedia Dump ReaderEdit
This program provides a convenient user interface to read the text-only xml compressed dumps.
No conversion is needed, only some index-construction initial step. Written mostly in Python+Qt4, except for the small, very portable bzip2-decompression C code, thus should run on all PyQt4-enabled platform, although tested only on Desktop Linux. Wikicode is reinterpreted, thus it may sometimes display differently than the official php interpreter.
MediaWiki XML ProcessingEdit
This python library is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.
MediaWiki SQL ProcessingEdit
This python library is a collection of utilities for efficiently processing MediaWiki’s SQL database dumps. It is built to be very similar to mwxml but for the SQL dumps.
BzReader (Windows offline reader)Edit
This program allows the Windows users to read Wikipedia offline using compressed dumps.
There is a fast built-in full-text search and the Wiki code is interpreted as HTML. You can also navigate between articles just like in online Wikipedia.
For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.
mwdumper can read the .bz2 files directly, but importDump.php requires piping like so:
bzip2 -dc pages_current.xml.bz2 | php importDump.php
For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:
7za e -so pages_current.xml.7z | php importDump.php
will expand the current pages and pipe them to the importDump.php PHP script.
Even more toolsEdit
- BigDump - A small php-script for importing very large mySQL dumps (Even through web-servers with hard runtime limits or Safe mode!)
- WikiFind - A small program for searching database dumps for user specified keywords (using regexes). Output is a wiki-formatted list (in a text file) of articles containing the keyword. (Program still under development)
- Wikihadoop, DiffDB and WikiPride
- w:Wikipedia:Computer help desk/ParseMediaWikiDump - perl module for parsing the XML dumps and finding articles in the file with certain properties, e.g. all pages in a given category
-  - a .NET library to parse MySQL dumps and make the the resulting information available to calling programs
- Dictionary Builder - a Rust program for generating a list of words and definitions from the XML dumps for one of the Wiktionary projects
- parse-mediawiki-sql – a Rust library for quickly parsing the SQL dump files with minimal memory allocation
- Awk and Nim source code examples for processing Wikipedia XML. The Nim example is an optimized C XML library almost 10x faster than Awk in comparison tests.
And still more...Edit
A number of offline readers of Wikipedia have been developed.
A list of alternative parsers and related tools is available for perusal. Some of these are downloaders, some are parsers of the XML dumps and some are meant to convert wikitext for a single page into rendered HTML.
See also this list of data processing tools intended for use with the Wikimedia XML dumps.