Tuesday, January 16, 2018Edit

Today I'll work on project documentation to ensure we know what we know and what we still need to know, start gathering deletion statistics, and work on gathering data about the quality of pages in the Draft namespace and Articles for Creation.

Deletion statisticsEdit

H18 and H19 are both about deletions, H18 about articles, H19 about other pages. In this case it should be useful to restrict H19 to two namespaces: User and Draft. We are particularly interested in what is happening in the Draft namespace given the increase in page creations there. We also know that users create article drafts in their user space, making that a "draft-like" namespace. In both cases, we seek to learn if there has been a significant change in deletion activity with ACTRIAL. It is less about the volume of deletions, and more about the reasons for deletion.

Looking at the list of criteria for speedy deletion it looks like we will be mostly interested in three categories: general, articles, and user. All other reasons can be lumped into an "other" category. We will also capture two other venues that lead to deletions: proposed deletions and articles for deletion. One challenge with creating this dataset is that for CSD, two categories appear to be namespace-specific: Articles (ns=0) and User (ns=2). Some of the general criteria also apply to specific namespaces (e.g. G13 applies to Draft and User).

The most flexible solution would be to record counts per day for each namespace and each deletion reason. We could also record 28 columns (all G/A/U CSD criteria, PROD, AfD, and "other") for each namespace for simplicity in favor or saving space. In this case we're looking at about 10,000 rows of data (365 days, 3 namespaces, 9 years), suggesting that the latter approach would not be too problematic. So while it does not create a fully normalized database, it makes data export, import, and analysis easier.

