Data dumps/Misc dumps format
The format of the other dumps produced by Wikimedia is described below.
These are produced in RDF format. For each category, the following information is provided:
- Category title
- If the category is hidden
- Number of pages in the category, excluding subcategories and files
- Number of subcategories
- Urls for each category to which this category belongs
Cirrus search index dumpsEdit
For more information, see the extension documentation on the search schema.
These files contain importable Cirrus search indexes in json format. For each entry, the following is provided:
|type||one of 'page' or 'namespace'|
|<wikiname>||name of specific wiki|
|<wikitype>||name of wiki type (wikipedia, wikivoyage, and so on)|
|auxiliary text||thumbnail captions, tables and a few other things that are searchable but not part of the primary page content|
|category||list of categories to which this page belongs|
|content_model||whether the page content is wikitext, json and so on|
|coordinates||geographical coordinates provided via the parser function '#coordinates', if present|
|create_timestamp||date and time page was first created|
|defaultsort||the sort key for sorting the page in categories which contain it, if set|
|display_title||the value for the DISPLAYTITLE magic word, if set|
|external_link||list of links outside off the wiki projects, made in this page|
|heading||list of entries in the page content surrounded by == (so html h2 headers)|
|incoming_links||number of pages that link to this page|
|language||content language of the page|
|namespace||number of the namespace of this page|
|namespace_text||name of the namespace of this page|
|opening_text||text before the first heading (h2 through h6, i.e. == through ======)|
|outgoing_link||list of links made in this page that lead to other pages on wiki projects|
|redirect||namespace and title of pages which redirect to this page, if any|
|source text||raw text of the revision|
|text||everything but the opening text and auxiliary text (after wikitext expansion)|
|text_bytes||length of revision content, in bytes|
|template||list of templates included by this page|
|timestamp||timestamp of current revision|
|title||title of page|
|version||current revision id|
|version_type||always 'external' when set|
|wiki||name of specific wiki|
|wikibase_item||the Q number of the page on wikidata (how is this even obtained??)|
Content translation dumpsEdit
The content translation dumps are provided in 3 formats, json with html, json with text, and tmx with text. 'Text' in thi context means that any html markup has been stripped out; see the file excerpts below for an example.
For each entry the following are included, with field names varying according to format:
- the language of the source text (the text to be translated)
- the target language
- the source text itself
- the machine translation of the source text, and the machine translation engine used
- the target (human translated text)
For more information, see the extension documentation on published translations.
Image info dumpsEdit
These files come in pairs. The -local- file contains the names and upload date/times of each file uploaded locally to the wiki. The --remote- file contains a list of the files uploaded to commons that are used on the local wiki; this information is retrieved from the MediaWiki globalimagelinks table.
The first line of each file lists the field name(s); the -local- file lists img_name and img_timestamp while the -remote- file lists gil_to.
Timestamps are in YYYYMMDDHHMMSS format. File names are written as they are in the database, so spaces are converted to underscores, for example.
Sample excerpt from orwiki-20190519-local-wikiqueries.gz:
Berhampur-university_logo.png 20120222071753 MKCG_Medical_college_logo_1.svg 20120222111540 SCB_Medical_college_logo.svg 20120224093008 VSS_medical_college_logo.svg 20120227075907
Media and article title dumpsEdit
Each of these files consists of the line 'page_title' as the first line, followed by a list of titles of pages, in alphabetical order. The media titles dump lists titles of all pages in the File: namespace (6), and the page titles dump lists titles of all pages in the main (0) namespace. Titles are dumped as they are found in the database, so spaces have been converted to underscores.
Sample excerpt (from media titles dump):
page_title !!!_(Chk_Chk_Chk)_-_One_Girl_One_Boy_cover_art.jpg !!!_-_!!!_album_cover.jpg !!e!VBQQ!mM_$(KGrHqEOKi8E03iU,-u!BNP3+G6Mqw_1.jpg !0_Trombones_Like_2_Pianos.jpg !Haunu.ogg !Hero_(album).jpg
Short url dumpsEdit
These files contain a list of entries in the following format:
short-url|full-url where the short url https://w.wiki/short-url-here redirects to the full url in links on our wiki projects.
L|https://en.wikipedia.org/wiki/LGBT M|https://en.wikipedia.org/wiki/MediaWiki N|https://en.wikipedia.org/wiki/NetHack P|https://en.wikipedia.org/wiki/Jean-Luc_Picard Q|https://www.wikidata.org/wiki/Help:Items R|https://en.wikipedia.org/wiki/Dennis_Ritchie S|https://sv.wikipedia.org/wiki/Stockholm