Research talk:Autoconfirmed article creation trial/Work log/2018-01-19

Friday, January 19, 2018 edit

Today I'll continue the analysis of deletions in Main and User namespaces, formalize my data gathering of quality predictions in Draft and AfC, and write the Python code to gather Draft/AfC data.

Improving the data gathering edit

The initial analysis of deletion in the Draft namespace (ref Wednesday's work log) was fairly straightforward, mainly due to the number of reasons for deletions in that namespace being small. Working on deletions in the Main namespace, I understood that there was room for improvement in our data gathering and analysis. The two areas of concern were how we handle redirects and whether our regular expression for capturing references to deletion criteria misses references.

A spot check of some log comments from early 2009 suggested that switching our regular expression from being anchored at the beginning of the comment to anywhere within the comment would capture more references. Secondly, we found some usage of "WP:Criteria for speedy deletion#" and not just "WP:CSD#". I therefore altered the regular expression for CSD so it would allow for both variants, and at the same time also allowed for using both "WP:" and "Wikipedia:". Note that we expect most of deletions to be done through tools that leave standardized comments, meaning that our approach will capture the vast majority of these references. Diving further into usage of references to policy in deletion comments is outside the scope of this project.

In my initial analysis of Main namespace deletions yesterday, I noticed some spikes. These can come from deletion of redirects. It is not straightforward to identify whether a deleted page was a redirect (there is no boolean "is_redirect" flag in the archive table like there is in the page table). We can instead pick up reasons for deletion that refer to redirects and filter them out that way. Inspecting the criteria for speedy deletion we can see that R2 and R3 refer directly to redirects, and X1 does as well. Lastly, G6 and G8 also often refer to redirects (e.g. a redirect pointing to a deleted page, or a disambiguation page being deleted). I therefore propose that we capture these and remove them from consideration, as the other reasons are more likely to refer to deletion of articles.

Updated draft deletion analysis edit

Removing G6 and G8, we get the following sorted list of reasons and usage:

Category Reason Number of deletions %
G13 Abandoned draft or AfC 65,007 43.3
Other Not matching another category 47,420 31.6
G11 Unambiguous advertisement or promotion 14,086 9.4
G12 Unambiguous copyright infringement 6,675 4.4
G7 Author requests deletion 5,366 3.6
G3 Pure vandalism and blatant hoaxes 3,527 2.4
G2 Test pages 3,308 2.2
G5 Creations by banned or blocked users 1,605 1.1
G10 Attack pages 1,200 0.8
AfD Articles for Deletion 1,074 0.7
G1 Patent nonsense 552 0.4
G4 Recreation of a deleted page 259 0.2
G9 Office action 0 0.0

Similarly as before, I combine G9, G4, and G1 into "other" and keep the other categories. That gives me a total of 11 categories, and I can plot these as before:


The two graphs of total number of Draft deletion over time and from Jan 1, 2017 were also updated:


Using the new dataset and comparing the first two and a half months of ACTRIAL against the same time period in 2015 and 2016 continues to find a significant increase (median of 86.5 for 2015 and 2016 while 113 in 2017).

Article deletions edit

We make a similar update of the graph of total deletions in the Main namespace as well:


Secondly, we update the table of usage:

Category Reason Number of deletions %
A7 No indication of importance 525,329 29.9
G11 Unambiguous advertisement or promotion 189,306 10.8
Other All other reasons 175,028 10.0
AfD Articles for Deletion 171,028 9.7
PROD Proposed deletion 138,715 7.9
G3 Pure vandalism and blatant hoaxes 102,057 5.8
G12 Unambiguous copyright infringement 78,966 4.5
G7 Author requests deletion 66,256 3.8
A3 No content 53,356 3.0
A1 No context 51,208 2.9
G10 Attack pages 48,201 2.7
G5 Creations by banned or blocked users 40,510 2.3
G2 Test pages 32,884 1.9
A10 Duplicates existing topic 25,946 1.5
G1 Patent nonsense 23,849 1.4
G4 Recreated deleted page 16,594 0.9
A11 Obviously invented 8,355 0.5
A9 No indication of importance (music) 7,855 0.4
A2 Foreign language 3,014 0.2
A5 Transwikied article 523 0.0
G13 Abandoned draft/AfC 40 0.0
G9 Office action 0 0.0

We split all 22 categories up into two groups of eleven and make a plot of the activity in each category from January 1, 2017 as that allows us to focus on how ACTRIAL affected deletions (a longer history plot is forthcoming). First, the eleven most common reasons:


Then the least common reasons (note that G9 and G13 are not shown due to their low usage):


Based on the graph of the most common speedy deletion criteria, it appears that several have decreased noticeably during ACTRIAL: A7 (no indication of importance), G3 (pure vandalism and blatant hoaxes), G12 (unambiguous copyright infringement), A3 (no content), A1 (no Context), and G10 (attack pages). For the least common criteria, we also see some indications of lower usage: G2 (test pages), A10 (duplicates existing topic), G1 (patent nonsense), and A11 (obviously invented).

In order to understand how the rates and proportions have changed, we compare ACTRIAL against a similar period of time in the five preceding years. We choose five years because from our earlier graph of page creations in the Main namespace, we've seen that it's fairly stable during all those five years. Going further back in time, the page creation rates appear to be higher. We use the first month and half of ACTRIAL as that is the same period we use for our survival analysis.

Return to "Autoconfirmed article creation trial/Work log/2018-01-19" page.