WSoR datasets/user year month namespace
The user_year_month_namespace dataset was generated by grouping the work that editors perform by the month and namespace in which they performed said work. This table is the foundation of the WikiPride(link needed) visualizations.
Location
editdb42:halfak.user_year_month_namespace
Fields
edithalfak@internproxy:~$ mysql -h db42 -e "EXPLAIN halfak.user_year_month_namespace;SELECT * FROM halfak.user_year_month_namespace LIMIT 3;" +------------------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+------------------+------+-----+---------+-------+ | user_id | int(10) unsigned | YES | MUL | NULL | | | rev_year | int(4) | YES | | NULL | | | rev_month | int(2) | YES | | NULL | | | namespace | int(11) | YES | | NULL | | | first_edit | varbinary(14) | YES | | NULL | | | first_edit_year | int(4) | YES | | NULL | | | first_edit_month | int(2) | YES | | NULL | | | reverting_edits | int(10) unsigned | YES | | NULL | | | noop_edits | int(10) unsigned | YES | | NULL | | | add_edits | int(10) unsigned | YES | | NULL | | | remove_edits | int(10) unsigned | YES | | NULL | | | len_added | int(10) unsigned | YES | | NULL | | | len_removed | int(11) | YES | | NULL | | +------------------+------------------+------+-----+---------+-------+ +---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+ | user_id | rev_year | rev_month | namespace | first_edit | first_edit_year | first_edit_month | reverting_edits | noop_edits | add_edits | remove_edits | len_added | len_removed | +---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+ | 0 | 0 | 0 | 0 | first_edit | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 1 | 2001 | 9 | 0 | 20010929004320 | 2001 | 9 | 0 | 0 | 5 | 10 | 3557 | -112275 | | 1 | 2001 | 9 | 4 | 20010929004320 | 2001 | 9 | 0 | 0 | 3 | 0 | 5739 | 0 | +---------+----------+-----------+-----------+----------------+-----------------+------------------+-----------------+------------+-----------+--------------+-----------+-------------+
Each row represents a revision and the change in article length from the edit that created it.
user_id
: The identifier of a user. PRIMARY KEY PARTrev_year
: The year of revisions. PRIMARY KEY PARTrev_month
: The month of revisions. PRIMARY KEY PARTnamespace
: The namespace of revisions. PRIMARY KEY PARTfirst_edit
: The date the user made their first edit.first_edit_year
: The year the user made their first edit.first_edit_month
: The month the user made their first edit.reverting_edits
: The number of reverting edits.noop_edits
: The number of edits with len_change of zero.add_edits
: The number of edits with a positive len_change.remove_edits
: The number of edits with a negative len_change.len_added
: The sum of all positive len_changes.len_removed
: The sum of all negative len_changes.
Reproduction
editDepends on:
- rev_len_changed - a list of revisions with len_change (think bytes changed) between revisions
- user_cohort - a list of users with information about when the editor started editing
- reverted - a list of reverted revisions with information about the reverting edit
Run this query. It will generate a temp table, but it finished in 5 hours on db1047 when run by itself.
CREATE TABLE halfak.user_namespace_year_month_day
SELECT
rlc.user_id,
rlc.namespace,
rlc.rev_year,
rlc.rev_month,
rlc.rev_day,
uc.first_edit,
uc.first_edit_year,
uc.first_edit_month,
SUM(rvt.revision_id IS NOT NULL) AS reverting_edits,
SUM(rvt.revision_id IS NULL AND len_change = 0) AS noop_edits,
SUM(rvt.revision_id IS NULL AND len_change > 0) AS add_edits,
SUM(rvt.revision_id IS NULL AND len_change < 0) AS remove_edits,
SUM(IF(rvt.revision_id IS NULL AND len_change > 0, len_change, 0)) AS len_added,
SUM(IF(rvt.revision_id IS NULL AND len_change < 0, len_change, 0)) AS len_removed
FROM halfak.rev_len_changed rlc
INNER JOIN halfak.user_cohort uc USING(user_id)
LEFT JOIN halfak.revert_20110115 rvt ON rev_id = revision_id
WHERE rev_timestamp <= "20110115000000"
GROUP BY
rlc.user_id,
rlc.namespace,
rlc.rev_year,
rlc.rev_month,
rlc.rev_day;
Notes
editThis table was generated from using revert information from the January 2011 dump of enwiki. Because reverts are limited at that point, the rest of the query must be as well.