The revert dataset contains information about reverting edits and the revision that was reverted back to. Reverts are identified by looking for revisions with identical content in the history of a page.

Location edit

db42:halfak.revert_20110115

Fields edit

halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN revert_20110115;SELECT * FROM revert_20110115 LIMIT 3" halfak
+---------------+------------+------+-----+---------+-------+
| Field         | Type       | Null | Key | Default | Extra |
+---------------+------------+------+-----+---------+-------+
| rev_id        | int(11)    | YES  | UNI | NULL    |       |
| rvtto_id      | int(11)    | YES  |     | NULL    |       |
| is_vandalism  | tinyint(1) | YES  | MUL | NULL    |       |
| revs_reverted | int(11)    | YES  |     | NULL    |       |
+---------------+------------+------+-----+---------+-------+
+--------+----------+--------------+---------------+
| rev_id | rvtto_id | is_vandalism | revs_reverted |
+--------+----------+--------------+---------------+
|     60 |       43 |            0 |             1 |
|    309 |      308 |            0 |             2 |
|    323 |      318 |            0 |             1 |
+--------+----------+--------------+---------------+

There is a row in this table for every reverting revision in a page that was included in the January, 2011 database dump of enwiki.

  • rev_id: The reverting revision's identifier
  • rvtto_id: The identifier of the revision that was reverted to
  • is_vandalism: True when the D_STRICT/D_LOOSE regexp matched the reverting revision's comment, False otherwise.
  • revs_reverted: The number of revisions between the reverting revision and the reverted to revision. This represents the number of revisions that were reverted. Note that this is limited to < 15 to remove noise.

Reproduction edit

To reproduce this dataset, install Wikimedia Utilities and run this script.

For example:

$ python reverts.py --output_prefix=test1/ enwiki.*.7z
Aug-29 16:51:36 INFO     Starting test1/ run...
Aug-29 16:51:36 INFO     Creating output file: test1/revert.tsv
Aug-29 16:51:36 INFO     Creating output file: test1/reverted.tsv
Aug-29 16:51:36 INFO     Prcoessing...
Aug-29 16:51:36 INFO     Processing dump file enwiki-20110829-pages-meta-history.sample.xml.
|.||||..|.|....|......|........|..........|...........|.|.|.|.|...|.|||...|.|.|

This will produce two output files that can be loaded into the database test1/revert.tsv and test1/reverted.tsv. Create the table to load into:

CREATE TABLE revert (
	rev_id        INT UNSIGNED,
	rvtto_id      INT UNSIGNED,
	is_vandalism  BOOL,
	revs_reverted INT(2)
);

Then load the output file:

$ mysqlimport --local --skip-opt halfak test1/revert.tsv

This gets the base dataset into the MySQL so it can be denormalized. To denormalize, run the following command:

CREATE TABLE revert_denorm
SELECT
        rvtg.rev_id AS reverting_id,
        rvtg.rev_user_text AS reverting_username,
        rvtg.rev_user AS reverting_user_id,
        rvtg.rev_comment AS reverting_comment,
        rvtto.rev_id AS revertto_id,
        rvtto.rev_user_text AS revertto_username,
        rvtto.rev_user AS revertto_user_id,
        rvtto.rev_comment AS revertto_comment,
        revert.revs_reverted
FROM  revert
LEFT JOIN revision rvtg
        ON revert.rev_id = rvtg.rev_id
LEFT JOIN revision rvtto
        ON revert.rvtto_id = rvtto.rev_id;

Finally, move the denormalized table over the original.

DROP TABLE revert;
ALTER TABLE revert_denorm RENAME TO revert;


Enjoy!

Notes edit

The current dataset is complete up to the January 2011 dump.