Research talk:Autoconfirmed article creation trial/Work log/2018-01-17

Wednesday, January 17, 2018

Today I'll continue analyzing the deletion dataset I gathered yesterday, catch the WMF Research Showcase, put together an improved analysis of expected survival in Draft namespace during ACTRIAL, and start planning gathering data on quality of Drafts and AfC submissions.

Deletion statistics

Using the approach outlined yesterday, I wrote up a Python script that goes through deletions in the log table and figures out if the comment defines why the page was deleted. I then exported the dataset and started processing it.

The code gathers data about deletions in the Main, User, and Draft namespaces, and captures whether a page was deleted due to one of the General (Gxx), Article (Axx), or User (Uxx) reasons. In addition, we also capture Proposed deletions (PROD/BLPPROD) and Articles for Deletion (AfD). If a deletion does not match any of these, it is added to an "other" category.

The first order of business was to understand the extent to which these reasons are used across the three namespaces we are concerned with. To understand that, I plotted their usage for each namespace. Due to the untidiness of the graphs, they will not be shared. Instead, based on inspecting them I found as follows:

User reasons are only used rarely outside the User namespace. I suspect this is due to user error, e.g. someone happens to click the wrong reason in a popup. There is some recent usage in the Draft namespace, but I do not yet know why.
Article reasons are rarely used outside the Main namespace. There is some usage here and there, but it is rare enough that it is not meaningful to count and plot it.
(BLP)PROD is only used in the Main namespace.

After inspecting the graphs, I decided to combine categories for each namespace as follows:

Main: General, Article, AfD, and PROD, with "other" as anything not matching any of those (including all User-related reasons).
User: General, User, AfD, and "other" as anything not matching any other reason.
Draft: General, AfD, and "other" for everything else.

Draft namespace findings

Because the number of reasons for deletions is smallest for the Draft namespace, I decided to start analyzing that first. It quickly became clear that having 15 separate reasons was cumbersome, it is difficult to find good color schemes for plotting that many separate categories. I therefore inspected the usage of each reason and found that G9 (Office action) was never used. There were also several categories that were used sparingly, meaning on average less than a day. One of them was G10 (attack pages), used a total of 1,187 times over about four years. Given that attack pages are fairly serious concern, I decided to keep that as the lowest category and add any below it to the "other" category. This meant that AfD, G1 (patent nonsense), G4 (recreation of a deleted page), and G9 (office action) were removed and combined into the "other" category.

We can then count the total number of deletions and create a sorted table to show why pages tend to be deleted in the Draft namespace:

Category	Reason	Number of deletions	%
G13	Abandoned draft or AfC	64,378	41.1
Other	Not matching another category	51,340	32.8
G11	Unambiguous advertisement or promotion	12,345	7.9
G12	Unambiguous copyright infringement	6,600	4.2
G7	Author requests deletion	5,270	3.4
G8	Depends on nonexistent/deleted page	4,806	3.1
G3	Pure vandalism and blatant hoaxes	3,410	2.2
G2	Test pages	3,217	2.1
G6	Technical deletions	2,574	1.6
G5	Creations by banned or blocked users	1,529	1.0
G10	Attack pages	1,187	0.8

I then plotted them as a stacked line plot, finding that the noisy data made it rather unreadable. To smooth the plot, I decided to use a 28-day moving median. I chose the median as they are more resistant to outliers than moving averages, and used 28 days as 7 days still was rather noisy. In our case, there are some days with huge number of deletions (over 10,000), likely bot-driven deletions of stale drafts. These can greatly affect a moving average calculation. I also chose to plot the larger categories at the bottom of the plot so that these are more clearly seen. The resulting plot is shown below:

Do not pay attention to the Y-axis labels, Wikipedia does not have billions of Draft page deletions. Instead, it's an artifact of the stacked plot and log-scale. From the plot, it can appear that deletions for reasons other than G13 are fairly stable across time. There are some exceptions, but we also know that the five most used reasons cover about 88% of all deletions in this dataset.

Perhaps more important with regards to ACTRIAL is to what extent the trial has affected the number of deletions in the Draft namespace. To examine this in more detail, we make plots of the total number of deletions over the whole dataset, as well as since 2017. First, let's look at the entire dataset:

There are several things to note in the graph above. First of all, we see how deletions in the Draft namespace are introduced in late 2013 (our data gathering starts in 2009, but before then we found no deletions). Until late spring 2014, the number of deletions per day is usually below a dozen. At that point, it starts picking up and is fairly consistently between 10 and 100. Then in late 2014 we see it again jump up and stabilize around 100 per day, and it has generally stayed there since then. There are some exceptions from the general rule, as we see some peaks of deletions in the thousands on some days. The highest peak is in April 2017, when over 11,000 pages were deleted in order to clean up problematic BLPs created by a specific user.

We also see some indications of a higher number of deletions per day in the second half of 2017, so let's plot only 2017 onwards to see more closely what's going on. This graph has a dotted line added to show when ACTRIAL started:

In this case, it is not clear that there has been a clear increase in the number of deletions in the Draft namespace during ACTRIAL compared to what happened previously. One of the things that makes this difficult to ascertain is that deletions happen irregularly. For example, there might be a high number of deletions one day, and then few deletions the next day, instead of a fairly steady number of deletions every day. Overall, comparing the first two and half months of ACTRIAL with the same time periods in 2015 and 2016 suggests a modest but significant increase in number of deletions per day from about 90 to 120 (geometric mean or median). Note that at the same time, the number of created pages in the Draft namespace has increased by about 100–150.

Survival of Draft creators

We looked at survival rates of users who created Drafts in the December 18 work log. In that analysis, a "surviving editor" is someone who makes at least one edit in their first and fifth week after registering, and a Draft creator is someone who made a page in the Draft namespace during the first week. We compared data on survival of these Draft creators prior to ACTRIAL with their survival rate during ACTRIAL, and found a significant decrease.

One potential issue with that analysis is that Draft creators prior to ACTRIAL would self-select, they were not forced to create a page there but could instead create their page in the Main namespace. This would arguably inflate the survival rate of Draft creators, as they made a conscious choice. Now that ACTRIAL is active, this is no longer a choice, newly registered accounts are required to create a draft in either the User or Draft namespaces. We would also suspect that some users come to Wikipedia to create an article, but are unlikely to return regardless of what happens to that article. We might hypothesize that there was a higher proportion of these in the Main namespace before ACTRIAL. In summary, this means we should adjust our prior expectation of the survival rate of Draft creators.

To mitigate this issue, we chose to sample users who created articles (pages in the Main namespace) during the relevant period in 2015 and 2016, which is our "pre-ACTRIAL" time period. We sampled enough users to make the total number of users in the pre-ACTRIAL condition about the same as the ACTRIAL condition. Then, we get the following contingency matrix:

	Non-survivor	%	Survivor	%	Row total	%
Pre-ACTRIAL w/Main sample	9,859	95.5%	465	4.5%	10,324	100.0%
ACTRIAL	9,816	95.4%	470	4.6%	10,286	100.0%
Total	19,675	95.5%	935	4.5%	20,610	100.0%

Here we can clearly see that the ACTRIAL survival rate is similar to what we would expect if we combined the survival rate of Draft and Main prior to ACTRIAL (X²=0.1, df=1, p=0.75).

Add topic