Research talk:Wikipedia article creation/Work log/Friday, January 17th
Friday, January 17thEdit
Today I'm working on the last piece in the puzzle of newcomer page creations -- figuring out the scale at which all newcomers enter the page creation process.
So first things first, I need to generate a funnel for observing how many registered editors --> R:New editors --> New page creators --> New article creators --> New draft article creators.
SELECT DATE(CONCAT(LEFT(user_registration, 6), "01")) AS registration_month, COUNT(*) AS registered_users, /* registered users */ SUM(revision_stats.day_revisions > 0) AS new_editors, SUM(page_stats.pages > 0) AS new_page_creators, /* users who started a any page in first 30 days */ SUM(page_stats.article_pages > 0) AS new_article_creators, /* users who started an article page in first 30 days */ SUM(page_stats.draft_pages > 0) AS new_draft_creators /* users who started a draft article page in first 30 days */ FROM ( SELECT user_id, user_registration, SUM(day_revisions) AS day_revisions FROM ( ( SELECT user_id, user_registration, SUM(rev_id IS NOT NULL) AS day_revisions FROM user LEFT JOIN revision ON rev_user = user_id AND rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S") WHERE user_registration > "2008" GROUP BY 1 ) UNION ( SELECT user_id, user_registration, SUM(ar_id IS NOT NULL) AS day_revisions FROM user LEFT JOIN archive ON ar_user_text = user_name AND ar_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 1 DAY), "%Y%m%d%H%i%S") WHERE user_registration > "2008" GROUP BY 1,2 ) ) AS revision_archive_stats GROUP BY 1,2 ) AS revision_stats LEFT JOIN ( SELECT user_id, SUM(c.user_id IS NOT NULL) AS pages, SUM(ap.page_title IS NOT NULL) AS article_pages, SUM(ap.original_namespace != 0) AS draft_pages FROM user LEFT JOIN staging.nov13_dewiki_creation c USING (user_id) LEFT JOIN staging.nov13_dewiki_article_page ap USING (page_id) WHERE user_registration > "2008" AND c.rev_timestamp < DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 30 DAY), "%Y%m%d%H%i%S") GROUP BY 1 ) AS page_stats USING (user_id) GROUP BY 1
I cheated a little bit. I got the above query wrong several times before I got it right.
Here's the results. First, the monthly proportion of users that make it to each part of the funnel. Since Enwiki is super big and takes forever to run, I'll start off with German.
~ 30% of newly registered users will make an edit within 24h. ~40% of these new editors will create some page. ~60% of them will have created an article (or a draft article). And depending on where you look, about %3 of those users will have created a draft (non ns=0 page that became an article). There's a concerning dip in the number of draft creators that has me thinking that there might be some anomaly in my data.
Check out the raw numbers:
The number of editors who create drafts each months drops from 30 to 0-2 in the middle of 2011. My best guess is that there was a substantial change in the way that page moves were recorded at that point. I'll have to pick through the data to see if I can figure out what's going on. --Halfak (WMF) (talk) 23:56, 17 January 2014 (UTC)