Research talk:Autoconfirmed article creation trial/Work log/2017-11-30
Thursday, November 30, 2017
editToday I'll be working on getting some data in order to answer two questions:
- How does the survival rate of newly registered users who create drafts compare to those who create articles?
- Has there been a change in quality of drafts created?
Survival rate of draft creators
editWe have previously looked at survival rates of users who start out by creating an article (Sept 1 and Sept 3). In that analysis, we only looked at accounts that created an article in their first edit. Now we are interested in understanding how survival rates compare between those who create articles and those who create drafts. Should we in this case expand our dataset to not only look at the first edit? Secondly, if we do, what timeframe is appropriate? For example, we could use all creations in the first week and compare that with edits in the fifth week, to mirror our general definition of a surviving editor. Lastly, should we use the account's first creation, or any creation that happens in the given timeframe?
To understand more about this, we want to answer two related questions:
- If a newly registered accounts creates an article or a draft in the first 30 days, how old is the account when they create the article/draft?
- To what extent does new accounts create multiple articles?
To answer these questions, I gathered historic data from Jan 1, 2009 to July 1, 2017. For the main namespace I used our dataset of non-autopatrolled article creations, while for the draft namespace I created a similar query that was restricted to namespace 118. This query uses regular expressions to identify edit comments that suggest the page was created as a redirect or by a move, although inspection of the number of redirects created in the Draft namespace since late July is very low compared to the number of pages created. It also excludes all creations by non-registered accounts, although an inspection of the database indicates that no drafts were created by non-registered accounts. Lastly, it ignores creations by users with the autopatrol right. There are two reasons why these are not interesting: first, we'd expect users with the autopatrol right to create articles directly, as otherwise the right is not really needed; secondly, examining the data from Jan 1, 2016 onwards, we find that only 4.2% of the pages in the Draft namespace have been created by them, indicating that they are not that interesting.
I then restricted the dataset to only articles/drafts created by accounts that were less than 30 days old and joined the two. This allowed me to create a histogram showing the distribution of how old an account is when a draft or article is created, split by the namespace in which it is created:
The top histogram is the Main namespace, while the bottom histogram is the Draft namespace. While there is an order of magnitude difference in the raw counts, the shape of the histogram is very similar. We see that most of the creations happen within a day after registration, and that draft creations generally take a little longer than articles. 74.4% of all articles in this dataset were created within the first 24 hours after registration, whereas for drafts the proportion is 75%. The median age at creation for articles is 41.5 minutes, and for drafts it is 67.7 minutes.
Generally, newly registered accounts that create a draft or article during their first 30 days only do this once. There are 1,027,272 accounts in our dataset, and 81.4% of these have a single creation event. Of those that make multiple articles/drafts, the distribution is long-tailed with a maximum of 754 page creations. We can also see this in how the proportions decrease as the number of creations increase:
Number of creations | Number of accounts | Proportion |
---|---|---|
1 | 835,750 | 81.36 |
2 | 125,508 | 12.22 |
3 | 35,044 | 3.41 |
4 | 13,424 | 1.31 |
5 | 6,286 | 0.61 |
6 | 3,341 | 0.33 |
7 | 1,996 | 0.19 |
8 | 1,345 | 0.13 |
9 | 868 | 0.08 |
10 | 578 | 0.06 |