Research talk:VisualEditor's effect on newly registered editors/May 2015 study/Work log/20150930
Wednesday, September 30, 2015Edit
Today I want to look at the short, mid and longterm survival trends of editors in the experimental conditions. I was recently working on a similar analysis for an experiment with the Teahouse. I'm going to replicate the same methodology here. If you want to read the details, see Research_talk:Teahouse_long_term_new_editor_retention/Work_log/20150928.
The experiment ended on June 4th, 2015, so that means I have nearly 4 months to work with. I want to look at the following trial and survival periods:
 1 week, 1 month
 1 month, 1 month
 2 months, 1 month
So now the query:
SELECT
user_id,
SUM(revisions_1_to_2_weeks) AS revisions_1_to_2_weeks,
SUM(revisions_1_to_2_months) AS revisions_1_to_2_months,
SUM(revisions_2_to_3_months) AS revisions_2_to_3_months
FROM (
(SELECT
user_id,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 7 AND 14
) AS revisions_1_to_2_weeks,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
rev_timestamp IS NOT NULL AND
DATEDIFF(rev_timestamp, user_registration) BETWEEN 60 AND 90
) AS revisions_2_to_3_months
FROM staging.ve2_experimental_users as user
INNER JOIN user USING (user_id)
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 7 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
UNION
(SELECT
user_id,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 7 AND 14
) AS revisions_1_to_2_weeks,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 30 AND 60
) AS revisions_1_to_2_months,
SUM(
ar_timestamp IS NOT NULL AND
DATEDIFF(ar_timestamp, user_registration) BETWEEN 60 AND 90
) AS revisions_2_to_3_months
FROM staging.ve2_experimental_users AS user
INNER JOIN user USING (user_id)
LEFT JOIN archive ON
ar_user = user_id AND
ar_timestamp >= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL 21 DAY),
"%Y%m%d%H%i%S"
)
GROUP BY 1)
) user_span_revisions
GROUP BY user_id;
Here's a sample of the output:
user_id revisions_1_to_2_weeks revisions_1_to_2_months revisions_2_to_3_months 2532<snip> 6 0 0 2532<snip> 0 0 0 2532<snip> 0 0 0 2532<snip> 0 0 0 2532<snip> 4 1 0 2532<snip> 2 0 0 2532<snip> 0 0 0 2532<snip> 0 0 0 2532<snip> 0 0 0
1+ edits survivalEdit
First, I'll consider an editor "surviving" if they make at least 1 edit in the survival period.
bucket  1 to 2 weeks.k  1 to 2 months.k  2 to 3 months.k  n  1 to 2 weeks.p  1 to 2 months.p  2 to 3 months.p 

control  321  294  211  13464  0.02384135  0.02183601  0.01567142 
experimental  311  327  230  13507  0.0230251  0.02420967  0.01702821 
Chi^2 tests


> prop.test(bucket.survival$revisions_1_to_2_weeks.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_1_to_2_weeks.k out of bucket.survival$n Xsquared = 0.1623, df = 1, pvalue = 0.6871 alternative hypothesis: two.sided 95 percent confidence interval: 0.002868681 0.004501194 sample estimates: prop 1 prop 2 0.02384135 0.02302510 > prop.test(bucket.survival$revisions_1_to_2_months.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_1_to_2_months.k out of bucket.survival$n Xsquared = 1.585, df = 1, pvalue = 0.208 alternative hypothesis: two.sided 95 percent confidence interval: 0.006027302 0.001279978 sample estimates: prop 1 prop 2 0.02183601 0.02420967 > prop.test(bucket.survival$revisions_2_to_3_months.k, bucket.survival$n) 2sample test for equality of proportions with continuity correction data: bucket.survival$revisions_2_to_3_months.k out of bucket.survival$n Xsquared = 0.6897, df = 1, pvalue = 0.4063 alternative hypothesis: two.sided 95 percent confidence interval: 0.004457761 0.001744186 sample estimates: prop 1 prop 2 0.01567142 0.01702821 
Looks like the VE cohort is a bit ahead, but the difference is not significant.
5+ edits survivalEdit
Let's try considering survival to only be legitimate if the editor saves at least 5 edits in the survival period.
bucket  1 to 2 weeks.k  1 to 2 months.k  2 to 3 months.k  n  1 to 2 months.p  1 to 2 weeks.p  2 to 3 months.p 

control  108  116  83  13464  0.008615567  0.00802139  0.006164587 
experimental  102  129  79  13507  0.009550603  0.00755164  0.005848819 
Chi^2 tests


> prop.test(bucket.survival5$revisions_1_to_2_weeks.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_1_to_2_weeks.k out of bucket.survival5$n Xsquared = 0.1366, df = 1, pvalue = 0.7117 alternative hypothesis: two.sided 95 percent confidence interval: 0.001702440 0.002641941 sample estimates: prop 1 prop 2 0.00802139 0.00755164 > prop.test(bucket.survival5$revisions_1_to_2_months.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_1_to_2_months.k out of bucket.survival5$n Xsquared = 0.5552, df = 1, pvalue = 0.4562 alternative hypothesis: two.sided 95 percent confidence interval: 0.003273534 0.001403462 sample estimates: prop 1 prop 2 0.008615567 0.009550603 > prop.test(bucket.survival5$revisions_2_to_3_months.k, bucket.survival5$n) 2sample test for equality of proportions with continuity correction data: bucket.survival5$revisions_2_to_3_months.k out of bucket.survival5$n Xsquared = 0.0659, df = 1, pvalue = 0.7974 alternative hypothesis: two.sided 95 percent confidence interval: 0.001602756 0.002234292 sample estimates: prop 1 prop 2 0.006164587 0.005848819 
Again we don't see significance, but unlike above, we don't see a clear advantage for either group. Halfak (WMF) (talk) 18:32, 30 September 2015 (UTC)