Research talk:Autoconfirmed article creation trial/Work log/2018-01-31

Wednesday, January 31, 2018

Today I'll write up our results for H1, then start analyzing the AfC data that we've gathered.

H1: Number of registered accounts

Our first hypothesis states that the number of registered accounts will not be affected by ACTRIAL. We did an initial analysis of historical data in our August 8 work log. In that analysis, we looked at the different ways that accounts can be created, and to what extent there is a lot of variation in the data. Our analysis showed that the two types of creation that involves someone else creating the account are stable and fairly low in numbers, leading us to combine that with the regularly created accounts. This is why we from then on only focused on autocreated accounts, being accounts that are created by the system when someone who has an account on another wiki visits the English Wikipedia, and other types of accounts.

I updated the graphs of account registrations with new data using the account creation dataset listed on dataset page. The plots of account creations per day historically, and in the two most recent years are as follows:

In these graphs, we do not see any interruption in the time series in September 2017, suggesting that ACTRIAL has not affected the number of account registrations. We do seek a more thorough analysis of this, though, and therefore turn to forecasting methods in order to understand whether there has actually been a change.

We first investigate how moving from daily to monthly counts can provide us with the insight that we are interested in. One of the challenges with daily count data is that in order to do forecasting well, it needs to consider daily movable events (e.g. holidays like easter, or major sports events). Secondly, a year always has exactly twelve months, making it straightforward to account for yearly trend cycles. At the same time, we need to be careful with missing or anomalous data. In the daily graph, we can see that data is missing for a fairly long period of time in 2011. In order to account for that, we projected registrations for February through May 2011 based on the data from 2009 and 2010. The graph for monthly number of account registrations then looks like this:

In the graph above, we can see that number of registrations consistently drops during the summer months (the middle of each year), and in December as well (Wikipedia is quiet over the holidays, this is also seen in the daily count graphs we saw earlier). We can also see the account alignment project in 2014 and 2015 affecting the number of registered accounts. Apart from the alignment project, there does not appear to be other challenges with the data.

We first calculate the Autocorrelation Function (ACF) for non-autocreated and autocreated accounts, in order to understand whether the time series is stationary or not. The ACF plot for non-autocreated accounts:

Given that the ACF decays rapidly to non-significant autocorrelation, instead of staying significant, it suggests that the time series might not require differencing in order to be stationary. This is not the case for the number of autocreated accounts, which shows a very different AFC:

The number of autocreated accounts has a different AFC partly due to how it increases across time, in 2009 there were less than 50,000 accounts registered per month, in 2017 it's closer to 75,000. As we'll see, we are then required to apply at least first-order differencing to get a stationary time series.

We use ARIMA models for the forecasting, and apply the auto.arima function in R to identify the best model to use. For both non-autocreated and autocreated accounts, we find two competing models with similar log-likelihoods. In both cases, we find that using BIC as the selection criterion results in a simpler model and slightly lower log-likelihood, and therefore choose that.

For non-autocreated accounts, we use a first-order autoregressive model. In training it, we hold out the data for September, October, and November 2017. Then, we use it to forecast those three months, and compare it with the actual values. The forecast graph looks as follows:

The graph shows the forecasted values (black line), as well as the 80% and 95% confidence intervals. The true number of accounts created is shown in red. As we can see, the forecast closely aligns with what actually happened during the first two months of ACTRIAL, suggesting that the trial did not cause any changes.

For autocreated accounts, we use a first-order integrated model with an additional seasonal component to account for changes over time. The seasonal component is second-order autoregressive with a 12-month period. We withhold data for September, October, and November as before and use the model to forecast for those three months. The resulting forecast graph looks like this:

We see in the graph that the forecast is lower than the actual number of account creations, but within the 80% confidence interval. The actual number of creations appears to be similar to that of 2016, which again suggests that ACTRIAL has had no effect on account creations.

In summary, we find support for the hypothesis that ACTRIAL has not affected the number of accounts registered.

Drafts and Articles for Creation

H16, H17, and H22 relate to Articles for Creation and how ACTRIAL affects that process. In order to understand that, we gather a dataset of creations in the Draft namespace and go through them to identify all submissions to AfC. This process does not identify all submissions to AfC, because contributors can create drafts in the User namespace and then submit them for review. We assert that most of the AfC submissions come through the Draft namespace, particularly during ACTRIAL. Secondly, there is not a dataset of historical category or template information available, meaning that to identify historic AfC submission for the User namespace requires going through the history of all pages in the User namespace. Doing so it outside the scope of this project.

To get a sense of the dataset, we start out by plotting the number of draft creations over time:

There are several things to note about the historical plot. First of all, we have page creations prior to when the namespace actually existed. Per T59569, the namespace came to be on Dec 17, 2013. The existence of creations in our dataset prior to that date comes from the challenges of recreating page histories across time, particularly as pages get deleted. For example, one of the 2009 creations was a draft created in the User namespace that was moved over to the Draft namespace in mid-2014, then quickly reviewed and declined, before finally getting deleted at a stale draft in September 2015.

Secondly, it appears that it took four months from the namespace was created until it became widely used, the number of creations per day picks up in late April 2014. Thirdly, we see increased usage of the namespace, particularly in 2016 and 2017. There appears to be some increase in the fall of 2015, but we see particularly larger and sustained usage in 2016. Because this traffic appears to coincide with school terms, would it be related to the Education Program making more widespread usage of the Draft namespace to create content?

In addition to the increase in usage from 2016 onwards, we also see how ACTRIAL has affected Draft creations. This jump is perhaps more easily seen on the plot below, which starts in mid-2014 and also has a dotted line showing when ACTRIAL started:

The plot above makes the differences in usage perhaps more easy to spot. There's clearly a steady state from mid-2014 until the beginning of 2016. Then we see the increased usage during the spring and fall seasons, with drops during the summer and winter holiday. Lastly, we see that once ACTRIAL starts there's a significant increase, and unlike previous years the increase is consistently high.

Publication rate

To what extent do pages created in Draft turn into something that can be published? We determine this by looking through the edit history of our pages to look for an edit that moves the page into Main. Our dataset contains 126,957 pages created in the Draft namespace between July 1, 2014 and December 1, 2017. Of these, we find only 1,550 (1.2%) appear to have been moved to the Main namespace at some point in their history. This number is much lower than the number of accepted AfC submissions listed in Category:Accepted AfC submissions, which contains almost 80,000 pages. We do not know to what extent accepted AfCs come through the User namespace rather than the Draft namespace, and whether accepted AfCs are redirects (there is a separate process for proposing redirects for creation).

AfC submission rate

We track the usage of AfC submission templates in the Draft pages over time, noting when they were added to the page, when they were reviewed, and whether the submission was declined or not. This dataset does to some extent contain duplicate submissions, in that some pages might contain a submission template only for the creator to then add a second one. We did notice that the default submission template does not submit the page for review, and we remove unsubmitted templates from our dataset. This leads us to conclude that our dataset is as good as we can get without spending undue amounts of time, which is not in the scope of what we are doing.

Calculating the number of submissions per day gives us the following graph:

One thing to note here is that we do not see the same spikes in AfC submissions that we previously saw for Draft creations. If drafts are created as part of the Education Program, they might not be submitted for review through AfC, but instead get reviewed by a community liaison, meaning they bypass the process. It is therefore not unexpected that we have less variation in the data.

We can also see how ACTRIAL has drastically increased the number of submissions to AfC. The graph suggests a two- to three-fold increase. A more in-depth analysis with clearer numbers will be done tomorrow.

Add topic