Tuesday, January 23, 2018

Today I aim to wrap up our analysis of deletions by looking at specific reasons in Draft, and doing a full analysis of the User namespace, then move on to writing code for our Draft/AfC quality prediction data gathering.

Draft namespace deletionsEdit

Yesterday I did a breakdown of deletions in the Main namespace, measuring changes in number of deletions per day for each reason as well as changes in proportions. I'd like to have a similar table for Draft and User namespaces, so we'll first generate one for Draft. It looks like this:

Reason Deletions/day pre-ACTRIAL Deletions/day ACTRIAL Delta deletions/day Delta deletions/day (%) Proportion pre-ACTRIAL (%) Proportion ACTRIAL (%) Delta proportion (%)
G2 2.32 3.05 0.73 31.4 2.53 2.39 −0.14
G3 2.48 4.07 1.59 63.7 2.71 3.18 0.47
G5 0.88 2.93 2.05 234.6 0.96 2.30 1.34
G7 2.84 4.18 1.34 47.0 3.11 3.27 0.16
G10 1.05 1.25 0.20 18.8 1.15 0.98 −0.17
G11 11.29 19.59 8.30 73.6 12.33 15.34 3.01
G12 3.81 6.80 2.99 78.5 4.16 5.33 1.17
G13 54.14 62.79 8.65 16.0 59.13 49.17 −9.96
AfD 0.88 1.70 0.82 92.7 0.97 1.34 0.37
Other 11.87 21.34 9.47 79.8 12.96 16.71 3.75

Update, February 13, 2018: We note that the criteria for G13 were expanded in late August 2017, ref this RfC. Thank you to TonyBallioni for notifying us of this! The expansion affects our results in that G13-deletions during ACTRIAL might increase because of this expansion, and not necessarily as an effect of the trial. Determining this is currently outside the scope of the project, but could be the focus of future work. Note that after withholding G13 from the table, there is still an overall increase in deletions compared to previous years.

User namespace deletionsEdit

Per the data gathering and initial analysis done on January 17, our data on deletion reasons in the User namespace covers all General and User CSD criteria, with AfD as a separate category, and lastly "other" as everything else. Note that G6 and G8 are regarded as mainly redirect-related deletions and are therefore not part of the dataset.

We first look at the overall number of pages deleted across time, as well as in the recent time around ACTRIAL. Similarly as we did for the Main and Draft namespaces, we plot this on a daily basis with a trend line added. First, historical data from Jan 1, 2009 onwards:

Based on the historical graph above, it appears that deletions in the User namespace is fairly stable across time at somewhere between 100–200 pages per day. There's an increase in 2014, 2015, and the first half of 2016, before the level settles down again in the second half of 2016. We can also see many deletion drives occurring, with peaks in the thousands of pages. Further analysis could give us more information about why exactly these deletions occur, but we'll leave that for now and instead look at 2017:

Deletion behaviour prior to and during ACTRIAL appears to be similar to previous years. There's a general level between 100 and 200 pages per day, and then some larger peaks here and there. We do see a fairly significant increase in the second half of September, and I'll look into whether that's related to ACTRIAL.

There is mainly a single day in the last two weeks of September 2017 that has a large number of deletions in the User namespace, and that's the 19th with 2,215 deletions. In the graph above we can also see three other days with fairly large number of deletions: 25th with 400, 28th with 647, and 29th with 448. The September 19 deletions are mainly due to one MfD discussion. A lot of the deletions on the 25th are stale drafts, but there's also plenty other reasons (e.g. advertisement pages). Same goes for the 28th and 29th. This suggests that on a typical day, there are a steady number of speedy deletions due to advertisements, copyright infringement, and such, and that atypical days tend to be driven by deletions of stale drafts.

First, let's look at overall usage of reasons for deleting pages in the User namespace:

Category Reason Number of deletions %
G11 Unambiguous advertisement or promotion 163,968 30.9
Other Not matching another category 133,527 25.2
U1 User request 101,416 19.1
U5 Blatant misuse of Wikipedia as a web host 39,192 7.4
G13 Abandoned draft or AfC 20,194 3.8
G7 Author requests deletion 18,048 3.4
G3 Pure vandalism and blatant hoaxes 14,194 2.7
G12 Unambiguous copyright infringement 9,186 1.7
G10 Attack pages 8,251 1.6
G5 Creations by banned or blocked users 8,138 1.5
U2 Nonexistent user 6,115 1.2
G2 Test pages 3,720 0.7
AfD Articles for Deletion 2,148 0.4
G1 Patent nonsense 1,400 0.3
G4 Recreation of a deleted page 528 0.1
U3 Non-free galleries 37 0.0
G9 Office actions 0 0.0

G2 (test pages) is the 12th most used reason. The remaining reasons account for 0.8% of all deletions, meaning that it's fairly safe to combine them into "other" to enable us to plot a historical graph showing the 12 most common reasons. We get the following historical plot, using a 28-day moving median to smooth the time series:

In the plot above, we can see how G11, "other", and U1 are fairly stable across time. We can also see the introduction of U5 (blatant misuse of Wikipedia as a web host) being introduced in 2014, and how G13 (stale drafts) start getting deleted in 2013 and is used to a large extent in 2014 and 2015.

Looking more specifically at 2017 to understand changes during ACTRIAL, we create this plot:

From the plot, it is not obvious that ACTRIAL has caused any changes in reasons for deletions in the User namespace. There is an uptake in deletion of stale drafts, but keep in min that those have to not have been edited for six months and therefore cannot have been created during ACTRIAL. The uptake might therefore simply be related to increased activity on Wikipedia during the fall.

We run the same analysis as before in order to investigate whether deletions in User space has changed significantly during ACTRIAL. A histogram suggests that the distribution is somewhat more right-skewed than it was for the Main and User namespaces. We again use log-transformation to combat the skewness, but notice that the resulting distribution is also right-skewed. This is some reason for concern in case we have a marginally significant result.

From the plots of total number of deletions, it appears that the rate of deletions in the User namespace has been fairly consistent since 2012 onwards, so we reuse the five-year period prior to ACTRIAL as our baseline. As mentioned, the data is log-transformed prior to the t-test (log2(1 + x)).

We find that there is a slight and marginally significant increase in the deletion rate during ACTRIAL. Median pre-ACTRIAL was 137 pages/day, during ACTRIAL it's 151. The geometric means are 141.6 and 161.0, respectively. The t-test suggests this is marginally significant: t=-1.9968, df=74.975, p=0.049. Given the outliers and the skewed distribution, I am hesitant to declare this a significant increase. Compare these results to those for Draft and Main, where the effects have been stronger. As mentioned, we also see some outlier days with strong usage of specific reasons for deletions.

Lastly we look at changes in usage of various reasons prior to and during ACTRIAL. In this case, because U5 was introduced in 2014, per our graph above, we restrict the pre-ACTRIAL period to 2014, 2015, and 2016. If we also use 2012 and 2013, we find a large increase in the usage of U5, as one would expect. Comparing the three years prior with ACTRIAL, we get the following table of changes:

Reason Deletions/day pre-ACTRIAL Deletions/day ACTRIAL Delta deletions/day Delta deletions/day (%) Proportion pre-ACTRIAL (%) Proportion ACTRIAL (%) Delta proportion (%)
Other 24.25 55.48 31.23 128.8 15.26 28.38 13.12
G2 1.24 1.43 0.19 15.0 0.78 0.73 −0.05
G3 4.17 4.52 0.35 8.4 2.63 2.31 −0.32
G5 2.08 1.89 −0.20 −9.5 1.31 0.96 −0.35
G7 4.34 3.02 −1.32 −30.5 2.73 1.54 −1.19
G10 2.14 2.69 0.55 25.5 1.35 1.38 0.03
G11 50.31 53.08 2.78 5.5 31.66 27.16 −4.50
G12 3.25 2.92 −0.33 −10.1 2.04 1.49 −0.55
G13 20.37 13.41 −6.96 −34.2 12.82 6.86 −5.96
U1 24.77 16.59 −8.18 −33.0 15.59 8.49 −7.10
U2 1.48 1.39 −0.09 −5.5 0.93 0.71 −0.22
U5 20.48 39.05 18.57 90.7 12.89 19.98 7.09
