Data dumps/Dump format
Format of the sql filesEdit
These are provided as dumps of entire tables, using mysqldump. They start with various commands to set up the character set correctly for the import; they also turn off certain checks related to indexes, for speed. More importantly however, they contain a DROP TABLE IF EXISTS stanza before the inserts of the actual data. This means that if you import one of these files into an existing wiki, any data you had in that table will be lost.
Each INSERT statement contains several thousand rows of data for speed purposes.
These files contain metadata from wikipedia describing its structure and organization. They do not contain any text from the pages. The precise description of each file can be found in the Mediawiki manual: Database layout; see the list of database tables for links to a general description and the list of fields (schema) for each one.
Format of the XML filesEdit
The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at mw:Help:Export#Export_format.
Three sets of page data are produced for each dump, depending on what you need.
The XML content files contains complete, raw text of some or all revisions, so in particular the full history files can be extremely large. Currently we are compressing the XML content files with bzip2 or lbzip2 (.bz2 files) and additionally for the full history dump 7-Zip (.7z files).
Both stub and content files contain a header which includes a link to the xml schema, the name of the wiki project, the version of MediaWiki which produced the dump, and the number and name of all of the namespaces on the wiki. This last is useful because namespaces are included for each dumped page only by namespace number.
The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases (2004 or so).
zhwiki-20130102-langlinks.sql.gz contained some copy and pasted iso8859-1 "ö" characters; as the langlinks table is generated on parsing, a null edit or forcelinkupdate to the page was enough to fix it.
Some xml dump files have 'multistream' in the name. These files contain identical content as the similarly named files without that string in the name, when uncompressed.
These files consist of multiple bz2 files ("streams") concatenated together. There is one mediawiki/siteinfo header at the beginning of the multistream file, and one mediawiki close tag at the end of the file, just as in the single stream file. But there are multiple bz2 headers and footers, each marking the start and end of a bzip2 file. This means that if you want to, you can split up the multistream file into pieces, each of which will be a single complete bzip2 file that can be decompressed on its own. Each such stream should contain 100 pages, except for the last stream which may have less. (It's possible that a few more streams may have fewer than 100 pages in them, due to the way we generate these in parallel.)
Be aware that older versions of bzip2-compatible utilities may decompress just the first stream and then stop.
These files are accompanied by an index, which has the same name as the multistream file but with the added "-index.txt" in the name. The index consists of lines of the format
where the file offset is the position in the multistream file of the start of the stream containing the specific page id and page title.
These dumps contain 5 files per wiki:
- status -- if the run completes successfully for that wiki, the contents should be 'done:all'
- maxrevid -- the latest revision id in the database at the time the dump is run, minus a configured number of hours, so that we don't dump revisions so recent no one's vetted them
- md5sums -- contains the md5sum of the two files with dump content
- stubs-meta-hist -- contains revision metadata for all new revisions since the previous dump (as computed from the previous run's maxrevid); format is like regular XML stubs files
- pages-meta-hist -- contains some revision metadata and all revision content of new revisions since the previous dump; format is like regular XML content files
For more on the format of the stubs and the revision content files, see the above section 'format of the xml files'.