Template A/B testing/Huggle Analyses

Overview

The Huggle testing experiment descriptions can be found here.

The experiments themselves involved comparing template sets including a control and a test template. The control templates are made up of existing Huggle warning templates in use in the community while the test templates represent modified versions in measurable ways (length of warning, directives, personalization etc.). Below is a link to the description of what each experiment is meant to test:

Huggle short 2

Huggle short 1 & 2

Huggle short 1 & 2 (84, 85, 86 only)

Huggle 3 analysis (60,62,66,76 VS. 61,63,67,77 only)

This analyses groups together Huggle warnings in response to test, spam, delete, and unsourced revisions.

Whether the changes made to test templates had an effect was primarily assessed using logistic regression analysis in R. Editor groups were pre-filtered based on:

registered / non-registered
minimum number of edits made before the posting
maximum number of edits made before the posting
minimum number of edits made after the posting (usually at least 1)

The results of these analyses along with the conditions on the editor activity samples can be seen in the experiment results in the next section.

In the plots accompanying the analyses the effect on the mean ratio of edits before and after the posting as well as the mean absolute number of edits made after the posting are measured against the minimum number of edits before the posting used to select the sample group. Accompanying these plots are those including the sample sizes for each data point in the edit trend (both control and test).

Analyses Results

Huggle test one (July 19-August 5)

We tested a variety of factors against the default control versions, all of which can be seen at: uw-vandal-rand1/Experiment1. The primary test was of improving the clarity of instructional language, personalization of the messages, and whether images enhanced the impact of the message.

Data and metrics

Detailed results and raw stats are currently up at Research:Warning Templates in Huggle. The total sample was 3241, of which 1750 clicked through to the new messages banner.

Key findings

A more human-readable summary was posted on the Village Pump, but the key findings were:

Adding "personalization", e.g. including usernames, speaking in the first person rather than passive voice, and generally making it obvious who the reverting/warning editor is, had a marginally significant impact on continued good faith editing. More testing was necessary to try and confirm this outcome.
"Teaching messages", i.e. including no personalization but making the instructional content of a warning simpler and more direct, had no positive impact on retention of good faith editors, but was best at discouraging outright vandals.
The new teaching-focused template had a significantly higher rate of retaliatory vandalism than the current versions used as control in our study. Simply improving the clarity of instruction is not effective and may in fact cause negative outcomes.
Personalization of messages, especially including a more explicit invitation to ask questions of other editors, increased the amount of positive contact between those being warned and those reverting the editor.

Other findings

Whether templates include an icon or not made no statistically significant difference in the further editing behavior of those being warned.
The single biggest factor in whether an editor continues to participate after being warned is the amount of editing they did prior to getting the message. Previous experience is the best predictor of continued editing, so in order to get accurate results, you need to perform regression analyses that account for this.

Huggle test two (September 25-October 10)

Using the templates uw-vandal-rand1/Experiment2 we compared...

the current default template
a template with "personalization" plus instructional language, such as a suggestion to use the sandbox and read the introduction to editing
a template with personalization but no instructional/teaching language or links to policy, plus a more explicit thank you for editing

We tested these variations because in the first test we discovered that personalization — adding usernames, speaking in the first person instead of passive voice, and inviting people to ask questions on the vandalfighter's talk page — was better, but we want to know exactly whether instructional-type language works or not. Our hypotheses was that attempting to assign tasks and point people to long, complicated policy to read not only doesn’t get them to improve, it actually discourages them from editing entirely.

Data and metrics

There were a grand total of 2451 messages delivered. (Most of which are to IPs.) We broke these down further by a couple rounds of qualitative coding on a four point scale. There were...

420 were 'vandals' that obviously should have been reverted and may have merited an immediate block. Examples: 1, 2.
982 were 'bad faith editors', people doing vandalism and the simple level 1 warning they received was appropriate. Examples: 1, 2
702 were 'test editors', people making test edits that should be reverted for quality but who aren't obvious vandals. Examples: 1 and 2
347 were 'good faith editors', who were clearly trying to improve the encyclopedia in an unbiased, factual way. Examples: 1, 2

We then did some analysis on each of these groups separately based on several metrics. Raw statistical outputs from R available here.

Key findings

We didn’t see an improvement in the long term retention rates of editors, which wasn’t a big surprise. We didn’t really expect one different template to make people stick around months later.
We did see an improvement in whether new Wikipedians kept editing articles or not. For people who’d already cut their teeth as editors (~10 edits before getting warned), both new templates did a significantly better job of encouraging them to keep editing in the main namespace. Recipients of these templates made further edits equal to 20% of their prior contributions. Considering all warnings generally discourage further editing, this is a positive outcome. Note that this effect was only found in the groups we coded as test editors or those editing in good faith, not vandals.
Another piece of encouraging news is that the ‘nodirectives’ template was the best overall for encouraging people of all kinds to communicate more. Statistically speaking, people who received that template performed one more user talk edit than others after receiving the message. Considering one user talk edit can mean a completed message to another editor, this is good news.
We found only 1% of the user talk edits made after being warned were retaliation directed at other editors. That’s good, since we have had some concerns that giving vandalfighter’s usernames more exposure might increase vandalism directed at them.

Huggle short 2

Data Munging / Filtering: 

Only tracking edits in the first three days after posting
5 <= edits before <= Inf (registered),
1 <= edits before <= Inf (non-registered),
Edits deleted before >= 0,
Blocks after = 0 (no blocks after seeing template)
namespace = 0
first_warning = TRUE


Findings:

For non-registered users the drop in edits from the control template was less than the test (71.38% vs. 79.91%) with a confidence level of 88.1%.  
For registered users the drop in edits from the test template was less than the control (55.63% vs. 88.27%) with a confidence level of 92.25%.  

Therefore there was an observable effect of this template in favour of the test for non-registered users and registered users.  
It is also noteworthy that for non-registered users the difference was heavily skewed to those that made very few edits before the template posting.  It should be further noted that much of the effect being seen is due to editors that don't make any edits after the posting.

Modelling Analysis, Non-Registered Users - R Output


Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.157  -1.157  -1.023   1.198   1.518  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     -0.2116     0.1073  -1.972   0.0486 *
edits_decrease   0.1635     0.1035   1.579   0.1143  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1079.8  on 779  degrees of freedom
Residual deviance: 1077.2  on 778  degrees of freedom
AIC: 1081.2

Number of Fisher Scoring iterations: 4



Reduction in edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4290  1.0000  1.0000  0.7991  1.0000  1.0000 

Reduction in edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.0000  0.9250  1.0000  0.7138  1.0000  1.0000

Modelling Analysis, Registered Users - R Output


Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1224  -0.8574  -0.7639   1.1268   1.6578  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)      0.9267     0.9848   0.941   0.3467  
edits_decrease  -2.0091     1.1379  -1.766   0.0775 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 80.201  on 62  degrees of freedom
Residual deviance: 74.571  on 61  degrees of freedom
AIC: 78.57

Number of Fisher Scoring iterations: 5



Reduction in edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4400  0.5500  0.9706  0.5563  1.0000  1.0000 

Reduction in edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5263  0.8333  0.9359  0.8827  1.0000  1.0000

Huggle short 1 & 2

Data Munging / Filtering: 

Only tracking edits in the first three days after posting
1 <= edits before <= Inf,
Edits deleted before >= 0,
Blocks after = 0 (no blocks after seeing template)
namespace = 0
first_warning = TRUE


Findings:

For non-registered users the drop in edits from the test template was less than the control (73.15% vs. 76.66%) with a confidence level of 88.1%.  
For registered users the drop in edits from the test template was less than the control (68.79% vs. 85.42%) with a confidence level of 91.05%.  There were 70 and 41 samples from test and control respectively.  

Therefore there was an observable effect of this template in favour of the test for non-registered users.

Modelling Analysis, Decrease in edits - Non-Registered Users - R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)

glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7200  -1.3945   0.9663   0.9748   0.9748  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     0.56299    0.04348  12.948   <2e-16 ***
edits_decrease -0.06576    0.04222  -1.558    0.119    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6482.6  on 4901  degrees of freedom
Residual deviance: 6480.1  on 4900  degrees of freedom
AIC: 6484

Number of Fisher Scoring iterations: 4



Reduction in edits Test:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-20.0000   1.0000   1.0000   0.7315   1.0000   1.0000 

Reduction in edits Control:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-10.0000   1.0000   1.0000   0.7666   1.0000   1.0000

Modelling Analysis, Decrease in edits - Registered Users - R Output


execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)

glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9876  -1.3166   0.8896   1.0443   1.0443  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)      1.3245     0.5276   2.510   0.0121 *
edits_decrease  -1.0031     0.5908  -1.698   0.0895 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 146.21  on 110  degrees of freedom
Residual deviance: 142.38  on 109  degrees of freedom
AIC: 146.38

Number of Fisher Scoring iterations: 4



Reduction in edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.2000  0.6667  0.9206  0.6879  1.0000  1.0000 

Reduction in edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5000  0.8667  1.0000  0.8542  1.0000  1.0000

Modelling Analysis, Edits 0-3 Days After - Registered Users - R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)

Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.816  -1.331   0.915   1.031   1.031  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     0.35459    0.20200   1.755   0.0792 .
edits_decrease  0.12007    0.06823   1.760   0.0785 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183.64  on 139  degrees of freedom
Residual deviance: 179.47  on 138  degrees of freedom
AIC: 183.47

Number of Fisher Scoring iterations: 4


Mean edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   2.427   2.000  26.000 

Mean edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.196   1.500   9.000

Huggle short 1 & 2 (84, 85, 86 only)

Data Munging / Filtering: 

Only tracking edits in the first three days after posting
4 <= edits before <= Inf, (z84 vs. z85) 
5 <= edits before <= Inf, (z84 vs. z86) 
Blocks after = 0 (no blocks after seeing template), 
namespace = 0, 
first_warning = TRUE

Findings:

z85 (35 samples) had a significant (80.00% confident) difference in the decrease in edits after posting over z84 (36 samples), 67.35% decrease vs. 83.40% decrease.
z86 had a semi-significant (86.50% confident) difference in the decrease in edits after posting over z84, 72.22% decrease vs. 85.08% decrease.

z85 (45 samples) had a significant (84.80% confident) difference in the mean edit count after posting over z84 (45 samples), 2.356 vs. 1.289.
z86 (40 samples) had a significant (84.20% confident) difference in the mean edit count after posting over z84 (45 samples), 2.500 vs. 1.289.

Modelling Analysis, Decrease in edits - Registered Users z84 vs. z85 - R Output


glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.546  -1.111  -1.099   1.258   1.258  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)      0.4934     0.4822   1.023    0.306
edits_decrease  -0.6812     0.5316  -1.281    0.200

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 98.413  on 70  degrees of freedom
Residual deviance: 96.552  on 69  degrees of freedom
AIC: 100.55

Number of Fisher Scoring iterations: 4



Percentage decrease in edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.2000  0.7936  1.0000  0.6735  1.0000  1.0000 

Percentage decrease in edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5000  0.8399  0.9582  0.8340  1.0000  1.0000

Modelling Analysis, Decrease in edits - Registered Users z84 vs. z86 - R Output


glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5624  -1.0258  -0.9575   1.2497   1.4145  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)      1.0279     0.8957   1.148    0.251
edits_decrease  -1.5699     1.0515  -1.493    0.135

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 74.192  on 53  degrees of freedom
Residual deviance: 71.616  on 52  degrees of freedom
AIC: 75.616

Number of Fisher Scoring iterations: 4



Percentage decrease in edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3333  0.6667  0.8149  0.7222  0.9557  1.0000 

Percentage decrease in edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1000  0.8355  0.9132  0.8508  1.0000  1.0000

Modelling Analysis, Edits 0-3 Days After - Registered Users z84 vs. z85 - R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)

Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5109  -1.1014  -0.2871   1.2554   1.2554  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)    -0.18145    0.24340  -0.745    0.456
edits_decrease  0.10422    0.07281   1.431    0.152

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 124.77  on 89  degrees of freedom
Residual deviance: 122.40  on 88  degrees of freedom
AIC: 126.40

Number of Fisher Scoring iterations: 4


Mean edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.356   2.000  20.000 

Mean edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.289   2.000   9.000

Modelling Analysis, Edits 0-3 Days After - Registered Users z84 vs. z86 - R Output


execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)

Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.497  -1.044  -1.044   1.266   1.317  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)    -0.32157    0.25627  -1.255    0.210
edits_decrease  0.11638    0.08233   1.413    0.158

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 117.54  on 84  degrees of freedom
Residual deviance: 114.88  on 83  degrees of freedom
AIC: 118.88

Number of Fisher Scoring iterations: 4


Mean edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     1.0     2.5     2.0    26.0  

Mean edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.289   2.000   9.000

Huggle 3 analysis on specific template postings - (60,62,66,76 VS. 61,63,67,77)

Data Munging / Filtering: 

Only tracking edits in the first three days after posting
Blocks after = 0 (no blocks after seeing template), 
namespace = 0, 
first_warning = TRUE

> Non-registered

3 <= edits before <= Inf 
test datapoints = 214
control datapoints = 170

> Registered:

5 <= edits before <= Inf 
test datapoints = 30
control datapoints = 30


Findings:

For non-registered the mean decrease in test edits exceeded the control 83.83% and 75.02% respectively.  The result is 94.59% confident.
For registered the mean decrease in control edits exceeded the test 83.20% and 70.58% respectively.  The result is 84.00% confident.

The result of the effect is swapped between registered and non-registered users.

Modelling Analysis, Non-Registered Users - R Output


Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.319  -1.319   1.043   1.043   1.596  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     -0.1510     0.2243  -0.673   0.5007  
edits_decrease   0.4769     0.2476   1.926   0.0541 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 527.28  on 383  degrees of freedom
Residual deviance: 523.36  on 382  degrees of freedom
AIC: 527.36

Number of Fisher Scoring iterations: 4




Percentage decrease in deleted edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.6670  0.8260  1.0000  0.8384  1.0000  1.0000 

Percentage decrease in deleted edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.6670  0.6818  1.0000  0.7502  1.0000  1.0000

Modelling Analysis, Registered Users - R Output


Call:
glm(formula = template ~ edits_decrease, family = binomial(link = "logit"), 
    data = all_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5728  -1.0761  -0.1728   1.2366   1.2894  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)      0.8939     0.6968   1.283    0.200
edits_decrease  -1.1533     0.8205  -1.406    0.160

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83.178  on 59  degrees of freedom
Residual deviance: 81.049  on 58  degrees of freedom
AIC: 85.049

Number of Fisher Scoring iterations: 4



Percentage decrease in deleted edits Test:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.2703  0.4309  0.9071  0.7058  1.0000  1.0000 

Percentage decrease in deleted edits Control:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.7770  0.9857  0.8320  1.0000  1.0000

Plots

Plots - Ratio of decrease of edits after (0-3 days) posting to edits before

Plots - Samples counts of edit decrease ratio plots

Plots - Number of edits after (0-3 days) posting to edits before (<= 50)

Plots - Number of edits after (0-30 days) posting to edits before (<= 50)

Plots - Samples counts of number of edit plots

Summary


Legend of Terms:
================


Variables:

Mean_Diff_Edits_Normalized 
- Mean Difference of edits before and edits 0-3 days after posting normalized by edits before - def: (AVG(Edits before posting) - AVG(Edits 0-3 days after posting)) / AVG(Edits before posting)
!! Lower values are better !!

Diff_Edits_After_0_3 
- Mean Number of Edits 0-3 days after posting - def: AVG(Edits 0-3 days after posting)

Diff_Edits_After_0_30 
- Mean Number of Edits 0-30 days after posting - def: AVG(Edits 0-30 days after posting)

Main Namespace Only

Experiment	Registered	Variable(s)	Control Result	Test Result	Winner	% Increase	Confidence	Sample Size	Params
Huggle 3 (60,62,66,76 vs. 61,63,67,77)	TRUE	Mean_Diff_Edits_Normalized	0.7058	0.8320	test	15.2% (fewer)	84.00%	30 (test), 30 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle 3 (60,62,66,76 vs. 61,63,67,77)	FALSE	Mean_Diff_Edits_Normalized	0.8383	0.7502	control	10.5% (fewer)	94.59%	214 (test), 170 (control)	execute.main(min_edits_before = 3, max_edits_before = Inf, min_edits_after = 0, registered = FALSE)
Huggle Short 1 & 2	TRUE	Mean_Diff_Edits_Normalized	0.8542	0.6879	test	19.5% (fewer)	91.05%	70 (test), 41 (control)	execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2	TRUE	Diff_Edits_After_0_3	1.196	2.427	test	50.7%	92.15%	89 (test), 51 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 1 & 2	FALSE	Mean_Diff_Edits_Normalized	0.7666	0.7315	test	4.58%	88.10%	hundreds (test), hundreds (control)	execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE)
Huggle Short 1 & 2 (84, 85)	TRUE	Mean_Diff_Edits_Normalized	.8340	.6735	test	19.2% (fewer)	80.00%	35 (test), 36 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2 (84, 86)	TRUE	Mean_Diff_Edits_Normalized	.8508	.7222	test	15.1% (fewer)	86.50%	24 (test), 30 (control)	execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2 (84, 85)	TRUE	Diff_Edits_After_0_3	1.289	2.356	test	82.8%	84.80%	45 (test), 45 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 1 & 2 (84, 86)	TRUE	Diff_Edits_After_0_3	1.289	2.500	test	93.9%	84.20%	40 (test), 45 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 2	TRUE	Mean_Diff_Edits_Normalized	0.8827	0.5563	test	17.0% (fewer)	92.25%	21 (test), 42 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle Short 2	FALSE	Mean_Diff_Edits_Normalized	0.7138	0.7991	control	10.7% (fewer)	88.10%	373 (test), 407 (control)	execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE)

All Namespaces

Experiment	Registered	Variable(s)	Control Result	Test Result	Winner	% Increase	Confidence	Sample Size	Params
Huggle 3 (60,62,66,76 vs. 61,63,67,77)	TRUE	Mean_Diff_Edits_Normalized	0.8220	0.5631	test	31.5% (fewer)	98.76%	32 (test), 32 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle 3 (60,62,66,76 vs. 61,63,67,77)	FALSE	Mean_Diff_Edits_Normalized	0.7368	0.8321	control	11.5% (fewer)	96.13%	223 (test), 175 (control)	execute.main(min_edits_before = 3, max_edits_before = Inf, registered = FALSE)
Huggle Short 1 & 2	TRUE	Mean_Diff_Edits_Normalized	0.8023	0.4425	test	44.8% (fewer)	92.74%	82 (test), 48 (control)	execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2	TRUE	Diff_Edits_After_0_3	1.932	4.181	test	53.8%	89.60%	94 (test), 59 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 1 & 2	FALSE	Mean_Diff_Edits_Normalized	0.7896	0.8331	control (flipped)	5.2% (fewer)	73.69%	657 (test), 447 (control)	execute.main(min_edits_before = 1, max_edits_before = Inf, registered = FALSE)
Huggle Short 1 & 2 (84, 85)	TRUE	Mean_Diff_Edits_Normalized	0.7945	0.1359	test	82.9% (fewer)	86.5%	29 (test), 35 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2 (84, 86)	TRUE	Mean_Diff_Edits_Normalized	0.7793	0.5293	test	32.1% (fewer)	87.1%	39 (test), 44 (control)	execute.main(min_edits_before = 4, max_edits_before = Inf, registered = TRUE)
Huggle Short 1 & 2 (84, 85)	TRUE	Diff_Edits_After_0_3	2.057	4.958	test	111.9%	87.5%	49 (test), 54 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 1 & 2 (84, 86)	TRUE	Diff_Edits_After_0_3	2.057	3.452	test	67.8%	77.2%	43 (test), 54 (control)	execute.main(min_edits_before = 3, max_edits_before = 50, registered = TRUE)
Huggle Short 2	TRUE	Mean_Diff_Edits_Normalized	0.8456	0.4190	test	50.4% (fewer)	96.85%	26 (test), 44 (control)	execute.main(min_edits_before = 5, max_edits_before = Inf, registered = TRUE)
Huggle Short 2	FALSE	Mean_Diff_Edits_Normalized	0.7947	0.7094	test (flipped)	10.7% (fewer)	78.4%	204 (test), 207 (control)	execute.main(min_edits_before = 2, max_edits_before = Inf, registered = FALSE)

Test Descriptions

Experiment	Description
Huggle 3 (60,62,66,76 vs. 61,63,67,77)	Huggle 3 experiments - spam, error, unsourced, and delete warnings. Shortened templates.
Huggle Short 1 & 2	spam, error, unsourced
Huggle Short 1 & 2 (84, 85)	vandal - Control against "without directives"
Huggle Short 1 & 2 (84, 86)	vandal - Control against "short"
Huggle Short 2	neutral point of view, biographical information about living persons, attack, blank content, delete -- all long and short

Discussion & Lessons

The plots above depict the the change in edit activity metrics against the minimum number of edits all editors in each group must have made before seeing the template. The metrics shown are defined:

Mean Difference of edits before and edits 0-3 days after posting normalized by edits before (ME1) - def: (AVG(Edits before posting) - AVG(Edits 0-3 days after posting)) / AVG(Edits before posting) => Lower values are better
Mean Number of Edits 0-3 days after posting (ME2) - def: AVG(Edits 0-3 days after posting)
Mean Number of Edits 0-30 days after posting (ME3) - def: AVG(Edits 0-30 days after posting)

Note the interesting trend of the effect observed upon newer editors (generally fewer than 5 edits) that the template has. In each case the effect not only flattens among more experienced editors but the templates themselves seem to make less of a difference (the curves converge) with the exception of the Huggle short 2 experiment. In this case the effect could be delayed to a higher threshold but poses an interesting exception (WHY?). The test group out performed the control in all but the "Huggle 3" experiment and "Huggle Short 2" experiment for non-registered users (WHY?). However, in this experiment there was a good deal of variance among the template types themselves (ie. templates that serve to warn users about different policy violations or behavour) and so Huggle 3 warrants more attention to fully understand the effect that the test templates could have had.

There is a lower threshold before the effect of the template becomes significant. There is also an interesting swapping effect for these templates where the control results in stronger mean edit count when including editors with only a single edit before posting. The test becomes the dominant template significantly for mean edit count when filtering on larger edit numbers.

Huggle short 2

For non-registered users the drop in edits from the control template was less than the test (71.38% vs. 79.91%) with a confidence level of 88.1%. For registered users the drop in edits from the test template was less than the control (55.63% vs. 88.27%) with a confidence level of 92.25%.

Therefore there was an observable effect of this template in favour of the test for non-registered users and registered users. It is also noteworthy that for non-registered users the difference was heavily skewed to those that made very few edits before the template posting. It should be further noted that much of the effect being seen is due to editors that don't make any edits after the posting.

Note: Some templates in this test are very rarely used and may have added noise to the data. In particular: z107 & z108 (Uw-npov), z109 & z11 (Uw-bio), z111 & z112 (Uw-attack). We should rerun the numbers without these to see if the results are clearer.

Huggle short 1 & 2

For non-registered users the drop in edits from the test template was less than the control (73.15% vs. 76.66%) with a confidence level of 88.1%. For registered users the drop in edits from the test template was less than the control (20.22% vs. 25.32%) however the p-value (=> 23.5% confident) indicates that this result is not significant.

Therefore there was an observable effect of this template in favour of the test for non-registered users.

Huggle short 1 & 2 (84 VS. 85, 86)

z85 (24 samples) had a significant (88.80% confident) difference in the decrease in edits after posting over z84 (30 samples), 68.43% decrease vs. 83.40% decrease. z86 had a semi-significant (72.5% confident) difference in the decrease in edits after posting over z84, 64.91% decrease vs. 76.44% decrease. z85 (45 samples) had a significant (84.80% confident) difference in the mean edit count after posting over z84 (45 samples), 2.356 vs. 1.289. z86 (38 samples) had a significant (84.20% confident) difference in the mean edit count after posting over z84 (38 samples), 2.500 vs. 1.289.

There is a lower threshold before the effect of the template becomes significant. There is also an interesting swapping effect for these templates where the control results in stronger mean edit count when including editors with only a single edit before posting. The test becomes the dominant template significantly for mean edit count when filtering on larger edit numbers.

Huggle 3 (60,62,66,76 VS. 61,63,67,77)

For non-registered the mean decrease in test edits exceeded the control 83.83% and 75.02% respectively. The result is 94.59% confident. For registered the mean decrease in control edits exceeded the test 83.20% and 70.58% respectively. The result is 84.00% confident.

The result of the effect is swapped between registered and non-registered users.