Data dumps/Dump format
Format of the sql filesEdit
These are provided as dumps of entire tables, using mysqldump. They start with various commands to set up the character set correctly for the import; they also turn off certain checks related to indexes, for speed. More importantly however, they contain a DROP TABLE IF EXISTS stanza before the inserts of the actual data. This means that if you import one of these files into an existing wiki, any data you had in that table will be lost.
Each INSERT statement contains several thousand rows of data for speed purposes.
These files contain metadata from wikipedia describing its structure and organization. They do not contain any text from the pages. The precise description of each file can be found in the Mediawiki manual: Database layout; see the list of database tables for links to a general description and the list of fields (schema) for each one.
Format of the XML filesEdit
The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at Help:Export#Export_format.
Three sets of page data are produced for each dump, depending on what you need.
The XML content files contains complete, raw text of some or all revisions, so in particular the full history files can be extremely large. Currently we are compressing the XML content files with bzip2 or lbzip2 (.bz2 files) and additionally for the full history dump 7-Zip (.7z files).
Both stub and content files contain a header which includes a link to the xml schema, the name of the wiki project, the version of MediaWiki which produced the dump, and the number and name of all of the namespaces on the wiki. This last is useful because namespaces are included for each dumped page only by namespace number.
The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases (2004 or so).
zhwiki-20130102-langlinks.sql.gz contained some copy and pasted iso8859-1 "ö" characters; as the langlinks table is generated on parsing, a null edit or forcelinkupdate to the page was enough to fix it.
These dumps contain 5 files per wiki:
- status -- if the run completes successfully for that wiki, the contents should be 'done:all'
- maxrevid -- the latest revision id in the database at the time the dump is run, minus a configured number of hours, so that we don't dump revisions so recent no one's vetted them
- md5sums -- contains the md5sum of the two files with dump content
- stubs-meta-hist -- contains revision metadata for all new revisions since the previous dump (as computed from the previous run's maxrevid); format is like regular XML stubs files
- pages-meta-hist -- contains some revision metadata and all revision content of new revisions since the previous dump; format is like regular XML content files
For more on the format of the stubs and the revision content files, see the above section 'format of the xml files'.