WSoR datasets/user cohort
The user_cohort dataset was generated to allow faster selection of editors by the year and month in which they started editing.
Location
editdb42:halfak.user_cohort
halfak@internproxy:~/Sandbox/wsor/scripts$ mysql -h db42 -e "EXPLAIN halfak.user_cohort;SELECT * FROM halfak.user_cohort LIMIT 3;" +------------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+-----------------+------+-----+---------+-------+ | user_id | int(5) unsigned | NO | PRI | 0 | | | user_name | varbinary(255) | NO | | | | | first_edit | varbinary(14) | YES | MUL | NULL | | | first_edit_year | int(4) | YES | MUL | NULL | | | first_edit_month | int(2) | YES | | NULL | | | last_edit | varbinary(14) | YES | | NULL | | +------------------+-----------------+------+-----+---------+-------+ +---------+------------------+----------------+-----------------+------------------+----------------+ | user_id | user_name | first_edit | first_edit_year | first_edit_month | last_edit | +---------+------------------+----------------+-----------------+------------------+----------------+ | 1 | Damian Yerrick | 20010929004320 | 2001 | 9 | 20110715131605 | | 2 | AxelBoldt | 20010726145009 | 2001 | 7 | 20110715175901 | | 3 | Tobias Hoevekamp | 20010326202105 | 2001 | 3 | 20040329205621 | +---------+------------------+----------------+-----------------+------------------+----------------+
Fields
editEach row represents a user. There should be a row in this table for every user who made at least one edit (even if it was in a deleted page).
user_id
: The identifier of a row. The user identifier fromuser.user_id
.user_name
: The user's registered name.first_edit
: The date the user made their first edit.first_edit_year
: The year the user made their first edit.first_edit_month
: The month the user made their first edit.last_edit
: The date the user made their last recorded edit.
Reproduction
editRun this query. It should be very fast because it only has to look at the user_id_timestamp index on the revision and archive tables.
CREATE TABLE halfak.user_cohort
SELECT
user_id,
user_name,
MIN(first_edit) AS first_edit,
YEAR(MIN(first_edit)) AS first_edit_year,
MONTH(MIN(first_edit)) AS first_edit_month,
MAX(first_edit) AS last_edit
FROM
(
SELECT
user_id,
user_name,
MIN(rev_timestamp) AS first_edit,
YEAR(MIN(rev_timestamp)) AS first_edit_year,
MONTH(MIN(rev_timestamp)) AS first_edit_month,
MAX(rev_timestamp) AS last_edit
FROM revision r
INNER JOIN user u
ON u.user_id = r.rev_user
GROUP BY user_id
UNION
SELECT
user_id,
user_name,
MIN(ar_timestamp) AS first_edit,
YEAR(MIN(ar_timestamp)) AS first_edit_year,
MONTH(MIN(ar_timestamp)) AS first_edit_month,
MAX(ar_timestamp) AS last_edit
FROM archive a
INNER JOIN user u
ON u.user_id = a.ar_user
GROUP BY user_id
) AS whocares_doesntmatter
GROUP BY user_id, user_name;
Notes
editThis table was first generated in late July of 2011 and until someone remembers to update this line to say they regenerated it, you should assume that it hasn't been updated since.