Data dumps/Monitoring

Monitoring for humans

edit

The download site shows the status of each dump: if it's in progress, when it was last dumped, and so on. Note that the etAs provided are not always accurate, given that they are generally for the entire dump step and not for whatever small piece is being reported.

Automated monitoring

edit

There are several status files available for each dump run on a per-wiki basis. If you are interested in what's going on with a particular wiki and run date, for example frwiki and the April 1 2019 run, you can check for files in the directory https://dumps.wikimedia.org/frwiki/20190401/

The following json files are available for download for any given dump run of a specific wiki:

dumpruninfo.json

edit

Contents: information for the dump jobs for that wiki's run. For each job, the following is provided:

  • job name ('xmlstubsdump', 'pagetable', and so on)
  • status ('done', 'in-progress', 'failed', 'waiting', 'skipped')
  • updated (UTC timestamp in format YYYY-MM-DD HH:MM:SS)

Excerpt of sample json:

{
    "jobs": {
        "namespaces": {
            "status": "done",
            "updated": "2019-04-25 14:54:33"
        },
        "pagetable": {
            "status": "done",
            "updated": "2019-04-21 18:14:02"
        },
        "xmlflowdump": {
            "status": "done",
            "updated": "2019-04-25 17:07:10"
        },
...

dumpspecialfiles.json

edit

This contains a list of files that don't contain dump content but may be useful for dump users. Examples of such files include the files containing md5 or sha1 sums for all dump content files, the index.html file, and the dumpruninfo.json file mentioned above. For each file, the following information is provided:

  • the filename
  • url for download, relative to the base download (http://dumps.wikimedia.org/)
  • status ('present', 'missing')
  • size in bytes, if the file is present

Excerpt of sample json:

{
    "files": {
        "elwiki-20190420-md5sums.txt": {
            "url": "/elwiki/20190420/elwiki-20190420-md5sums.txt",
            "status": "present",
            "size": 2322
        },
        "elwiki-20190420-sha1sums.json": {
            "url": "/elwiki/20190420/elwiki-20190420-sha1sums.json",
            "status": "present",
            "size": 2772
        },
        "elwiki-20190420-sha1sums.txt": {
            "url": "/elwiki/20190420/elwiki-20190420-sha1sums.txt",
            "status": "present",
            "size": 2586
        },
...

report.json

edit

This contains a list of files which have dump content. For each file, the following information is provided:

  • filename
  • url for download, relative to the base download (http://dumps.wikimedia.org/), if file is present
  • size in bytes, if the file is present

Excerpt of sample json:

{
    "jobs": {
        "metacurrentdump": {
            "files": {
                "elwiki-20190420-pages-meta-current.xml.bz2": {
                    "url": "/elwiki/20190420/elwiki-20190420-pages-meta-current.xml.bz2",
                    "size": 408302576
                }
            }
        },
        "langlinkstable": {
            "files": {
                "elwiki-20190420-langlinks.sql.gz": {
                    "url": "/elwiki/20190420/elwiki-20190420-langlinks.sql.gz",
                    "size": 55762728
                }
            }
        },
        "pagetable": {
            "files": {
                "elwiki-20190420-page.sql.gz": {
                    "url": "/elwiki/20190420/elwiki-20190420-page.sql.gz",
                    "size": 18193060
                }
            }
        },
...

dumpstatus.json

edit

Contents: information on each job, including its status and info on any output files it has produced. For each job the following is provided:

  • job name ('xmlstubsdump', 'pagetable', etc)
  • files (info about each output file produced)
  • status ('done', 'failed', 'waiting', 'skipped', 'in-progress')
  • updated (UTC timestamp in format YYYY-MM-DD HH:MM:SS)

For each file the following information is provided:

  • filename
  • url for download, relative to the base download (http://dumps.wikimedia.org/), if file is present
  • size in bytes, if present
  • md5sum, if present
  • sha1sum, if present

Excerpt of sample json:

{
    "jobs": {
        "namespaces": {
            "files": {
                "elwiki-20190420-siteinfo-namespaces.json.gz": {
                    "url": "/elwiki/20190420/elwiki-20190420-siteinfo-namespaces.json.gz",
                    "size": 6693,
                    "md5": "9539f7d5d1f4f2c489c12f33e6b3ef1c",
                    "sha1": "c498a01d2de749c00bf2228ff4e7db3be478d82e"
                }
            },
            "status": "done",
            "updated": "2019-04-25 14:54:33"
        },
        "pagetable": {
            "files": {
                "elwiki-20190420-page.sql.gz": {
                    "url": "/elwiki/20190420/elwiki-20190420-page.sql.gz",
                    "size": 18193060,
                    "md5": "ec7cd8f01391a0558979023e0407ffe0",
                    "sha1": "a83b50748e8a48c67b48c9290b406926f5012841"
                }
            },
            "status": "done",
            "updated": "2019-04-21 18:14:02"
        },
...

all wikis latest run

edit

There is also a json file covering the latest run of all wikis, at https://dumps.wikimedia.org/index.json; this aggregates the per-wiki dumpstatus.json files. note that this file is quite large, 11 megabytes, so don't just load it up in your browser.