WSoR datasets/user year month namespace

The user_year_month_namespace dataset was generated by grouping the work that editors perform by the month and namespace in which they performed said work. This table is the foundation of the WikiPride(link needed) visualizations.

Location edit

db42:halfak.user_year_month_namespace


Fields edit

halfak@internproxy:~$ mysql -h db42 -e "EXPLAIN halfak.user_year_month_namespace;SELECT * FROM halfak.user_year_month_namespace LIMIT 3;"
+------------------+------------------+------+-----+---------+-------+
| Field            | Type             | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------+
| user_id          | int(10) unsigned | YES  | MUL | NULL    |       |
| rev_year         | int(4)           | YES  |     | NULL    |       |
| rev_month        | int(2)           | YES  |     | NULL    |       |
| namespace        | int(11)          | YES  |     | NULL    |       |
| first_edit       | varbinary(14)    | YES  |     | NULL    |       |
| first_edit_year  | int(4)           | YES  |     | NULL    |       |
| first_edit_month | int(2)           | YES  |     | NULL    |       |
| reverting_edits  | int(10) unsigned | YES  |     | NULL    |       |
| noop_edits       | int(10) unsigned | YES  |     | NULL    |       |
| add_edits        | int(10) unsigned | YES  |     | NULL    |       |
| remove_edits     | int(10) unsigned | YES  |     | NULL    |       |
| len_added        | int(10) unsigned | YES  |     | NULL    |       |
| len_removed      | int(11)          | YES  |     | NULL    |       |
+------------------+------------------+------+-----+---------+-------+
+---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+
| user_id | rev_year | rev_month | namespace | first_edit     | first_edit_year | first_edit_month | reverting_edits | noop_edits | add_edits | remove_edits | len_added | len_removed |
+---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+
|       0 |        0 |         0 |         0 | first_edit     |               0 |                0 |               0 |          0 |         0 |            0 |         0 |           0 |
|       1 |     2001 |         9 |         0 | 20010929004320 |            2001 |                9 |               0 |          0 |         5 |           10 |      3557 |     -112275 |
|       1 |     2001 |         9 |         4 | 20010929004320 |            2001 |                9 |               0 |          0 |         3 |            0 |      5739 |           0 |
+---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+

Each row represents a revision and the change in article length from the edit that created it.

  • user_id: The identifier of a user. PRIMARY KEY PART
  • rev_year: The year of revisions. PRIMARY KEY PART
  • rev_month: The month of revisions. PRIMARY KEY PART
  • namespace: The namespace of revisions. PRIMARY KEY PART
  • first_edit: The date the user made their first edit.
  • first_edit_year: The year the user made their first edit.
  • first_edit_month: The month the user made their first edit.
  • reverting_edits: The number of reverting edits.
  • noop_edits: The number of edits with len_change of zero.
  • add_edits: The number of edits with a positive len_change.
  • remove_edits: The number of edits with a negative len_change.
  • len_added: The sum of all positive len_changes.
  • len_removed: The sum of all negative len_changes.

Reproduction edit

Depends on:

  • rev_len_changed - a list of revisions with len_change (think bytes changed) between revisions
  • user_cohort - a list of users with information about when the editor started editing
  • reverted - a list of reverted revisions with information about the reverting edit

Run this query. It will generate a temp table, but it finished in 5 hours on db1047 when run by itself.

CREATE TABLE halfak.user_namespace_year_month_day
SELECT
	rlc.user_id,
	rlc.namespace,
	rlc.rev_year,
	rlc.rev_month,
	rlc.rev_day,
	uc.first_edit,
	uc.first_edit_year,
	uc.first_edit_month,
	SUM(rvt.revision_id IS NOT NULL)                                   AS reverting_edits,
	SUM(rvt.revision_id IS NULL AND len_change = 0)                    AS noop_edits,
	SUM(rvt.revision_id IS NULL AND len_change > 0)                    AS add_edits,
	SUM(rvt.revision_id IS NULL AND len_change < 0)                    AS remove_edits,
	SUM(IF(rvt.revision_id IS NULL AND len_change > 0, len_change, 0)) AS len_added,
	SUM(IF(rvt.revision_id IS NULL AND len_change < 0, len_change, 0)) AS len_removed
FROM halfak.rev_len_changed rlc
INNER JOIN halfak.user_cohort uc USING(user_id)
LEFT JOIN halfak.revert_20110115 rvt ON rev_id = revision_id
WHERE rev_timestamp <= "20110115000000"
GROUP BY
	rlc.user_id,
	rlc.namespace,
	rlc.rev_year,
	rlc.rev_month,
	rlc.rev_day;

Notes edit

This table was generated from using revert information from the January 2011 dump of enwiki. Because reverts are limited at that point, the rest of the query must be as well.