Research talk:Wikipedia article creation/Work log/Monday, November 25th
Monday, November 25thEdit
Today, I'm going to try to replicate the change in the proportion of deleted pages for non-English wikis. The real problem is that the database servers for non-English Wikipedia do not allow for temporary table -- the kind of scratch space I need in order to move forward.
My plan is to extract exactly what I need to a local database server that I can use to perform the requisite joins. I've figured out how to extract page info as well as user stats, but I can't seem to figure out how to extract the revision that created the page so I can join with the user table and split newcomers from oldtimers.
Page status query
SELECT page_id, page_namespace, page_title, revisions, first_revision. last_revision, archived, ( archived AND UNIX_TIMESTAMP(last_revision) - UNIX_TIMESTAMP(first_revision) < 60*60*24*30 ) AS archived_quickly, creation.rev_id AS creating_rev_id, creation.rev_timestamp AS creating_rev_timestamp, user_id AS creator_id, IFNULL(creatore.user_name, creation.rev_user_text) AS creator_name, IFNULL(direct_create.log_action, indirect_create.log_action) AS creator_creation_action, user_registration AS creator_registration FROM ( SELECT ar_page_id AS page_id, ar_namespace AS page_namespace, ar_title AS page_title, COUNT(*) AS revisions, MIN(ar_timestamp) AS first_revision, MAX(ar_timestamp) AS last_revision, True AS archived, MIN(ar_rev_id) AS first_rev_id FROM archive WHERE ar_timestamp < "20131105000000" UNION SELECT page_id, page_namespace, page_title, COUNT(*) AS revisions, MIN(rev_timestamp) AS first_edit, MAX(rev_timestamp) AS last_edit, False AS archived, MIN(rev_id) AS first_rev_id FROM revision INNER JOIN page ON page_id = rev_page WHERE rev_timestamp < "20131105000000" GROUP BY 1,2,3 ) as page LEFT JOIN revision creation ON rev_id = first_rev_id LEFT JOIN user creator ON rev_user = user_id LEFT JOIN logging AS direct_create ON direct_create.log_type = "newusers" AND creator.user_id = direct_create.log_user AND direct_create.log_action IN ("create", "autocreate", "newusers") LEFT JOIN logging AS indirect_create ON indirect_create.log_type = "newusers" AND indirect_create.log_action IN ("byemail", "create2") AND REPLACE(creator.user_name, " ", "_") = indirect_create.log_title;
Alright. This one is a monster. Here's what it is supposed to get:
(one row per page)
- page_id: Page identifier (if not archived)
- page_namespace: "
- page_title: "
- revisions: Number of revisions saved
- first_revision: Date of the first revision
- last_revision: Date of the last revision
- archived: True if in archive table, false otherwise
- archived_quickly: True if archived within 30 days of creation (judging by last rev_timestamp)
- creating_rev_id: rev_id of the creating revision
- creating_rev_timestamp: "
- creator_id: user_id of the creator (or NULL for anons)
- creator_name: user_name of the creator (or IP for anons)
- creator_action: log_action for relevant account creations
- creator_registration: registration date of creator (or NULL for anons)
Time to see how quickly I can get in trouble for unleashing this.
19:39, 25 November 2013 (UTC)