Research talk:Onboarding new Wikipedians/Rollout/Work log/2014-03-07
Friday, March 7th
editToday, I'm working on trying to reason about the impact of GettingStarted across the wikis where it was deployed. In Enwiki, we ran a series of control experiments to test and refine GettingStarted there, but for the rest of the wikis, we just deployed our best version of the system.
So, in order to reason about the impact that GettingStarted had on these other wikis, I'd like to assume that the deployment represents a en:natural experiment -- and that I can reason about the effect of GettingStarted by comparing the state of the wikis before and after deployment. First things first, I want to check on the deployments to make sure that users were not seeing GettingStarted before they were supposed to and that users were seeing GettingStarted after they were supposed to.
Figure #GettingStarted deployments (to the right) plots the proportion of users eligible to receive a GettingStarted impression that also had one recorded. The vertical line represents the deployment date. It looks clear from this figure that the deployment assumptions hold. However, there are a couple of wikis that don't have enough daily registrations to plot a proportion: astwiki and glwiki. I'll have to figure out what to do with those. We may not be able to get enough observations from them to reason about GS's effectiveness there. --Halfak (WMF) (talk) 16:42, 7 March 2014 (UTC)
Time to pull a sample of users registered before and after deployments. My scripts for generating R:Productive new editor and other metrics run kind of slow because they need to detect reverts, so I don't want to process that many. It would be nice if I could process more than 5k, but less than 10k. Since all but enwiki and jawiki had deployments on 2014-02-11 18:13:00, I'll just check around that since I'm ballparking anyway.
SELECT COUNT(*) FROM (
SELECT wiki, event_userId
FROM ServerSideAccountCreation_5487345
WHERE
timestamp BETWEEN
DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND
DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL 14 DAY), "%Y%m%d%H%i%S") AND
event_isSelfMade AND
NOT event_displayMobile AND
wiki IN (
"dewiki", "elwiki", "eswiki", "fawiki", "frwiki", "fowiki", "glwiki",
"hewiki", "huwiki", "iswiki", "itwiki", "kowiki", "lbwiki", "mkwiki",
"mlwiki", "nlwiki", "plwiki", "ptwiki", "ruwiki", "simplewiki",
"svwiki", "viwiki", "ukwiki", "zhwiki"
)
GROUP BY 1,2) AS sampled_users
+----------+ | COUNT(*) | +----------+ | 92245 | +----------+ 1 row in set (8.42 sec)
Well... that's too many. It will probably make sense for me to down-sample. I could get some solid results if I look at 500 users from each wiki before and after the deployment. I have 24 (non-en) wikis. There are a few that I know I'm not going to get 500 newly registered non-mobile users from, but assuming I did, I'd need to be able to process 24000 users. That's going to take a little while, but it might be necessary. Let's see how many I'll actually get.
SELECT wiki, COUNT(*)
FROM (
SELECT wiki, event_userId
FROM ServerSideAccountCreation_5487345
WHERE
timestamp BETWEEN
DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND
DATE_FORMAT(DATE_ADD("20140211181300", INTERVAL 14 DAY), "%Y%m%d%H%i%S") AND
event_isSelfMade AND
NOT event_displayMobile AND
wiki IN (
"dewiki", "elwiki", "eswiki", "fawiki", "frwiki", "fowiki", "glwiki",
"hewiki", "huwiki", "iswiki", "itwiki", "kowiki", "lbwiki", "mkwiki",
"mlwiki", "nlwiki", "plwiki", "ptwiki", "ruwiki", "simplewiki",
"svwiki", "viwiki", "ukwiki", "zhwiki"
)
GROUP BY 1,2
) AS sampled_users
GROUP BY wiki
ORDER BY COUNT(*) DESC;
+------------+----------+ | wiki | COUNT(*) | +------------+----------+ | eswiki | 21871 | | frwiki | 11919 | | zhwiki | 10613 | | ruwiki | 10439 | | dewiki | 7599 | | ptwiki | 7333 | | itwiki | 4564 | | fawiki | 3392 | | nlwiki | 2307 | | plwiki | 2231 | | viwiki | 2159 | | svwiki | 1592 | | ukwiki | 1239 | | kowiki | 1196 | | huwiki | 1163 | | hewiki | 1126 | | elwiki | 627 | | simplewiki | 471 | | mlwiki | 228 | | mkwiki | 72 | | iswiki | 69 | | glwiki | 25 | | fowiki | 5 | | lbwiki | 5 | +------------+----------+ 24 rows in set (8.15 sec)
So, 8 wikis don't have enough users. That leaves 16. 16000 should be OK. Time to build some SQL to sample from each of the wikis. --Halfak (WMF) (talk) 21:53, 7 March 2014 (UTC)
So, I'll be running the following query for each Wiki:
SET @first_deploy = "20140211181300";
SET @second_deploy = "20140227001800";
(
SELECT DISTINCT wiki AS wiki, event_userId AS user_id, timestamp AS user_registration, "before deployment" AS bucket
FROM ServerSideAccountCreation_5487345
WHERE wiki = "dewiki" AND timestamp BETWEEN
DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL -14 DAY), "%Y%m%d%H%i%S") AND
DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL -1 SECOND), "%Y%m%d%H%i%S")
ORDER BY RAND()
LIMIT 500
)
UNION
(
SELECT DISTINCT wiki AS wiki, event_userId AS user_id, timestamp AS user_registration, "after deployment" AS bucket
FROM ServerSideAccountCreation_5487345
WHERE wiki = "dewiki" AND timestamp BETWEEN @first_deploy AND
DATE_FORMAT(DATE_ADD(@first_deploy, INTERVAL 14 DAY), "%Y%m%d%H%i%S")
ORDER BY RAND()
LIMIT 500
)
--Halfak (WMF) (talk) 22:45, 7 March 2014 (UTC)
Bah! The above queries don't filter out non-self-created users and mobile users. Just fixed that and started it again.
$ wc datasets/sample_users.tsv 16001 80004 750038 datasets/sample_users.tsv
Alright! Sample complete. I think that's it for today. Stay tuned for building statistics about these users and looking for trends that might invalidate the experiment. --Halfak (WMF) (talk) 23:07, 7 March 2014 (UTC)