Research talk:Autoconfirmed article creation trial/Work log/2018-01-08

Monday, January 8, 2018

Today I'll wrap up writing about the article creator survival rate, and write up our findings on a historical analysis of the quality of articles created by newly registered accounts.

Historic article quality

We're interested in understanding how content quality has developed over time, something H20 in particular focuses on. In order to study this, we used our dataset of article creations to grab the initial revision of these articles, and then feed that revision to ORES' draft quality and article quality models. The draft quality model will flag articles that appear to be spam, vandalism, or an attack page. The article quality model predicts the quality of article per the English Wikipedia's WP 1.0 assessment scale.

As we are most interested in understanding content quality for newly created accounts, we combine this data with our dataset on article creations by newly registered accounts (less than 30 days old) and focus on those.

Proportion of unretrievable articles

The first thing to note is that a lot of created articles are deleted in such a way that their initial revision is unretrievable. Our data was gathered with an account that has the "deletedhistory" user right, allowing it to get the content of deleted revisions (per API:Deletedrevisions).

We found that overall, 61.3% of the article creations (829,264 of 1,353,535)) did not have a retrievable revision. This suggests that the majority of content created by newly registered accounts is not content that can be kept in any way. In other words, it's copyright infringement, attack pages, or something like that.

This proportion has been fairly stable across time. The graph below shows the proportion of article creation revisions that could be retrieved (the inverse of the proportion calculated above):

Said graph shows the proportion calculated per day from Jan 1, 2009 to July 1, 2017. To make the overall trend easier to identify, we added a LOESS-smoothed line. As we see, the proportion fluctuates around 40%, meaning that around 60% were unretrievable.

Proportion of articles flagged by the draft quality model

Note: This analysis has been updated on 2018-01-14 based on using a threshold for flagging an article as "OK", ref the Jan 11 work log.

We are also interested in understanding the quality of articles with retrievable revisions. ORES' draft quality model is trained on revisions that were deleted using three specific speedy deletion criteria: vandalism (G3), attack (G10), and spam (G11) (ref: ORES). The quality of articles that appear to meet these speedy deletion criteria is less interesting to us as the draft quality flags indicate that the articles are unfit for an encyclopedia.

We therefore first calculate the proportion of articles that would not have been flagged by the draft quality model, and it results in the following graph:

Similar to the previous graph, the proportion is calculated on a daily basis from Jan 1, 2009 to July 1, 2017, and features a LOESS-smoothed line to make it easier to see the overall trend. In the graph, we can see an increasing trend for the first three and a half years, stability for about a year and half, then a slight decline until the end of 2016. From then on there appears to be an increasing trend.

Ballparking the graph, we could estimate that around 50% of retrievable creations do not get flagged by the draft quality model. Looking at the data we find that from Jan 1, 2013 to July 1, 2017, the proportion is 52.1%. Combined with the previous result that about 40% are retrievable, we find that overall roughly 21% of creations are not flagged. This is close to the previous finding in MusikAnimal's NPP analysis that 22% of articles created by non-autoconfirmed users were not deleted. Note that their analysis and ours are not exactly similar, because we are looking at all creations done by accounts that are up to 30 days old, whereas they only looked at non-autoconfirmed creations. However, we do know from our previous analysis that a large proportion of creations are done within the first day after registering. This means these two analyses are roughly comparable, and it's worth noting that they find similar results.

Types of flags by the draft quality model

As mentioned above, the draft quality model aims to identify if a revision meets either of three specific criteria for speedy deletion (spam/attack/vandalism). For the roughly 40% of retrievable revisions that are flagged, we calculate the proportion (of total number of retrievable revisions) with a specific flag and plot it on a per-day basis:

From the plot, we can see that the most common flag is "spam", accounting for about 26%. Second is "vandalism" with 12–13%. "Attack" is comparatively low. We suspect that the low proportion of attack pages is due to those being deleted in such a way that they cannot be retrieved later. We also see that there is some variation in the more prominent flags, but that there is not a clear trend showing either an increase or a decrease over time. Lastly, we note that spam appears to be the main challenge when it comes to handling new article creations.

One thing to note about these are that the flags are not technically exclusive. The draft quality classifier calculates a probability for each of the four ways an article can be marked (ok/spam/vandalism/attack), and then chooses the majority one. We could therefore do another analysis to take a more fine-grained look at whether articles get flagged or not, but leave that for future work at the moment.

Quality of non-flagged articles

Note: This analysis has been updated on 2018-01-14 based on using a threshold for flagging an article as "OK", ref the Jan 11 work log.

Having identified what proportion of article creations that retrievable and not flagged by the draft quality model, we turn our attention to understanding the quality of these articles. In order to do so, we use ORES' article quality model. This model predicts the quality of an article based on the WP 1.0 assessment scale, and uses six quality classes: Stub, Start, C, B, Good Article, and Featured Article. Similarly as the draft quality model, the article quality model will calculate the probability that an article falls into each of these six classes. We can use these probabilities to calculate a weighed sum to reflect the general quality of an article. This approach has been used in previous research by Halfaker, ref Demonstrating the Keilana Effect. A similar approach is also used by the Wiki Education Foundation in their dashboard (ref this blog post).

We calculate a weighed sum for each article creation and then calculate the average per day. We can then plot this average, resulting in the following graph:

There are several things to note in this graph. First of all is that it appears to have a slowly upwards trend. This suggests that over time the quality of articles that do not get deleted and are not flagged by the draft quality model is slowly increasing. The cause of this upwards trend is not known. It might be that the quality of articles created improves over time, or it can come from the bar for deletion being slowly raised over time. We would have to do further studies in order to determine that.

Secondly, the average score is roughly 0.75, which suggests that the average surviving new article is well on its way to becoming a Start-class article (with a score of 1.0). We might suspect that these newly created articles were short stubs, but given the high quality score, that appears to not be the case.

Add topic