Research talk:Autoconfirmed article creation trial/Work log/2018-02-10

Saturday, February 10, 2018

Today I aim to complete H9, H10, and H11, write up H18 and H19 on the research page, and get around to respond to some comments on our talk page.

H9: Number of patrol actions will decrease.

We did a preliminary analysis of data related to H9 in our August 22 work log. After doing that analysis, we shared our preliminary findings with the New Page Patrol reviewers and got some great feedback that helped improve our data gathering process.

H9 states that the number of patrol actions will decrease as a result of the reduction in number of pages created. We therefore start by looking at the graph of number of patrol actions. This graph starts on October 1, 2012, as that is the beginning of the first full month after the introduction of the PageTriage extension for new page patrol.

In the historical graph, we can notice a slowly increasing trend in number of patrol actions per day up until some time in the first half of 2016. We then appear to have a fairly stable period, and finally a drop towards the end of the graph. It is also clear that on some days we have a very large number of reviews, perhaps in order to attempt to reduce the backlog of unreviewed articles. We know that the drop in reviews in early 2016 is due to a single, prolific reviewer being asked to stop reviewing. It also seems like we have a clear drop in number of patrol actions per day once ACTRIAL starts. Let's focus in on last two years of data and add a vertical line for ACTRIAL:

In the more focused graph, we can again see the drop in May 2016, as well as increases in patroller activity in September 2016, February 2017, and during the summer of 2017. Lastly, we see a clear drop in the number of patrol actions after ACTRIAL starts, although there are some days shortly after ACTRIAL with a fair amount of activity.

Based on the historical plot, it appears that the level of patrolling activity has been fairly stable in the fall of each year. We therefore start our analysis of H9 by comparing the overall activity level during the first two months of ACTRIAL against the same period of 2013-2016. The data is somewhat skewed, but not strongly, meaning that we should be able to trust both a t-test and the Mann-Whitney U test. Both tests find a statistical significant change. The average daily number of reviews prior to ACTRIAL was 739.1, while during ACTRIAL it is 439.3, making the test results unsurprising (t-test: t=14.89, df=189.44, p << 0.001; Mann-Whitney: W=13577, p << 0.001). These results support H9, but we would also like investigate forecasting models in case there might be interesting time-related trends in the data.

Our forecasting approach calculates the number of patrol actions bimonthly. That way we have a split that coincides with the start of ACTRIAL, and it allows for a fair number of data points across our dataset. The graph for this time series looks like this:

In the graph, we see a general increase in the number of patrol actions up until early 2016, at which it drops down to a new stable level, before tapering off towards the end of 2017 when ACTRIAL starts. The drop in 2016 is, as discussed before, due to a single prolific reviewer no longer participating. Examining the stationarity and seasonality of the time series, we find that it is not evident that it has a seasonal component (i.e. a yearly cycle), and that it is non-stationary.

We apply both R's auto.arima function, as well as test manually created models. The auto-generated model is fairly complex (an ARIMA(5,1,0) model). Through iterations of training and verifying the ACF and PACF graphs of the models, we find that a seasonal model is an improvement. We land on an ARIMA(3,1,0)(0,1,1)[24] model as our best approach. Using it to predict the first two months of ACTRIAL gives the following graph:

The true number of patrol actions is below the 95% confidence interval in all of October, and almost below in the second half of September and first half of November. This again support our previous finding that H9 is supported.

Related measure: Patrol actions to created articles

Our related measure aims to understand to what extent a change in the number of articles created also means that new page patrollers adjust their efforts accordingly. We hypothesize that this ratio will stay stable, in other words effort is updated according to demand.

To answer this question, we need a dataset of article creations needing patrolling. We used the Data Lake to gather such a dataset from 2009 onwards, limiting it to non-redirect pages in the Main namespace created by users who do not have autopatrol rights. From late July 2017 onwards, we use data from the log database instead of the Data Lake, because the former has improved redirect filtering. As we are interested in aggregate statistics, we count the number of pages created per day.

In addition to pages created in the Main namespace, new page patrollers also have to deal with moves of drafts from other namespaces. These moves are likely to come from the User or Draft namespaces, and some of them might be done by contributors with autopatrol rights (meaning they do not need patrolling). The important question for us is whether these moves make up a substantial proportion of all pages needing patrolling. We keep counts of the daily number of moves from User and Draft into the Main namespace in our database on Toolforge, and combine a dataset of these counts with the counts of page creations. Note that this approach asserts that all moves require patrolling, which is not the case. We could gather data aiming to determine if the mover had autopatrol rights at the time of making the move. If we assert that most moves are done either by individuals or through AfC, and that neither require particular rights, then it also means that most moves require patrolling. Future work could look more closely at this to determine to what extent moves require patrolling or not.

Calculating the proportion of moves out of all pages needing patrolling (using the definition above) gives us the following historical graph:

The graph shows that the proportion of moves to page creations increased up until 2012, at which point it stabilized around 5%. It remains there for most of the time until late 2016, there are some periods such as summer 2013 with a higher proportion and spring 2014 with lower proportions. There's a marked increase in spring 2017 before settling back down over the summer, and another increase in the fall of 2017 around when ACTRIAL starts. To make the more recent years easier to see, we limit the graph to 2016 onwards:

Looking at the last two years shows some periods of increased activity, e.g. March and September 2016, and makes the increase in January through April of 2017 easier to see. The trend also suggests that the general level is higher in late 2016 as well as through the summer of 2017 compared to what we saw in the historical graph (about 7.5% compared to 5% historically). Lastly, we see a clear increase in the proportion during ACTRIAL.

In this case, our concern is whether moves make up a substantial proportion of pages needing patrolling, and the graph suggests that they sometimes do, and that ACTRIAL is one of those times. We will therefore add add all page moves with Main namespace creations when calculating our related measures.

Plotting the ratio of patrol actions to creations and moves per day since the introduction of the PageCuration tool in late 2012 gives the following graph:

Some of the things to note in the graph is prolonged periods where the proportion appears to stay below or above 100%. If it's below 100%, that indicates that the review backlog grows, while if it's above 100% it shrinks. Because we are only looking at data from the introduction of the PageTriage extension, pages do not automatically expire. Instead, they are only leaving the queue by being reviewed or deleted. We can also see many days with clear peaks of activity, suggesting that doing reviews to help clear the backlog tends to happen in short bursts.

Focusing in on the two most recent years can make it easier to see what's happening during ACTRIAL:

In the graph above, we can see the higher activity in early 2016, which was driven by a single highly active reviewer. When they stop reviewing in May 2016 there is a clear drop in the proportion, and it mainly stays below 100% until September, with the exception of parts of July. We see high activity in fall of 2016 and parts of spring 2017. The low proportion in spring 2017 can be driven by the large number of moves in that period, and it might be that many of these do not require patrolling. The trend line indicates that reviewers keep up with demand over the summer months of 2017 and into ACTRIAL as well, before starting to taper off. While activity is high during ACTRIAL, we also see the proportion drop well below 100%, and sometimes consistently as well. It is therefore unclear whether the proportion has really changed during the trial.

In order to determine if it has changed, we first look at the first two months of ACTRIAL as a whole. For H9, we compared that period with the same period in 2013-2016. Based on the historical graph of the related measure, we don't see a clear reason to discard any of those years as activity appears to have been fairly stable, so we keep data for both months of all four years. Examining the distribution of the proportion indicates that it does have some skewness, leading us to put more weight on the non-parametric Mann-Whitney test if the tests disagree.

A t-test suggests that we have a significant increase in the proportion during ACTRIAL compared to the previous years. The mean proportion increases from 96.5% to 115.6% (t=-4.460, df=87.768, p << 0.001). The Mann-Whitney test also finds there to be a significant shift (W=4544, p << 0.001). These results indicate that the hypothesis for the related measure is not supported.

We also investigate this using forecasting models in case there might be trends in the time series that we ought to account for. Similarly as for H9, we calculate the proportion on a bimonthly basis. Plotting it gives the following graph:

This plot also shows an increase in activity until early 2016, at which the proportion appears to drop below 100% until mid-2017. Just as we did for H9, we start by investigating stationarity and seasonality in the time series, finding that it appears to be non-stationary, and that it is not clear that it has a seasonal component. R's auto.arima function suggests an ARIMA(0,1,2)(1,0,0)[24] model. We investigate several candidate models, both with and without a seasonal component, but find no improvements. Using the mentioned model to predict the first two months of ACTRIAL gives the following graph:

In the forecast graph, we can see the higher proportion in the initial month and a half of ACTRIAL, after which the proportion drops down below 100%, arguably a regression towards the mean. We can also see that the forecast has a wide confidence interval, and the true proportion always falls within the confidence interval. Given the variation we see in the time series from 2015 onwards, this is not unexpected. This forecast conflicts with our previous finding of a significant increase during ACTRIAL. Given that the forecast takes seasonal and other variation into account, we choose to give it more weight and conclude that the related measure of H9 is supported.

H10: Number of active patrollers will decrease.

H10 hypothesizes that with the lower influx of articles, some patrollers will stop doing new page patrol, leading to a reduction in the number of active patrollers. We did a preliminary analysis of historical data for this in our August 22 work log, and a short follow-up analysis in our January 15 work log. For this analysis, we will focus on the first two months of ACTRIAL, similarly as we have done for other hypotheses. First, let's revisit the graph of historical patroller activity:

The historical graph shows an initial stable period until mid-2013 when more patrollers appear to have been recruited. There's then a steady increase towards early 2015, at which point the number of active patrollers starts fluctuating similar to Wikipedia's general activity (e.g. there's a reduction in mind-2015). We can also see a clear drop in early 2016, which does not appear to coincide with a particularly active reviewer leaving. If one were to study new page patrol, it might be worth looking into that. We then see high activity levels in late 2016 before the introduction of the reviewer right in November the same year, where the number of active reviewers drops significantly. Lastly, there's another drop which might be when ACTRIAL starts. Let's focus in on the last two years:

The focused graph shows that the reduction in early 2016 appears to have happened in March, while the prolific reviewer left in May, indicating that the two are not directly related. We can again see the higher activity levels in fall of 2016, and the introduction of the reviewer right in November with its large reduction. Number of active patrollers appears to have been quite stable during the first half of 2017. Shortly after ACTRIAL starts, the level makes another drop.

Analyzing this drop is not as straightforward as it has been for many of our other hypotheses due to how the introduction of the reviewer right in late 2016 introduced a completely different system. This means we cannot directly compare ACTRIAL to similar time periods of previous years as we know the levels would be off. We need a different approach, or a different time period, in order to understand if ACTRIAL has had an effect.

In the graph of the two most recent years, it appears that the activity levels have been fairly stable for the first six months of 2017. We therefore choose to use that as a comparison period with the first two months of ACTRIAL. Here we find that there's been a significant decrease in the average number of active patrollers per day during the first two months of ACTRIAL (t-test: t=16.625, df=120.29, p << 0.001; Mann-Whitney U test: W=10502, p << 0.001). Average per day during the first six months of 2017 was 76.7 patrollers, while during the first two months of ACTRIAL it's 57.1 patrollers. This suggests that H10 is supported.

We also investigate this using a forecasting model. As for H9, we measure the number of active patrollers on a bimonthly basis. Historically, the plot looks as follows:

The trends in this graph are similar to what we've seen previously, but arguably they are more pronounced. Perhaps particularly the drop in number of active patrollers when the reviewer right was introduced. We can see that prior to the introduction, there were typically at least 500 active reviewers per half-month, while after the reviewer right is introduced that drops to around 250. Future work could look into whether this reduction in reviewers has paid off.

We examine the time series for stationarity and seasonality and find that it's non-stationary, which is to be expected given the longer trends in the data, and that it's unclear whether it has a seasonal component. R's auto.arima function suggests an ARIMA(1,1,1)(0,0,1)[24] model. Because it is unclear whether the time series has a strong seasonal component, we investigate alternative models and find that those without the seasonal component do not perform as well. In that process, we do find that differencing the seasonal component appears to improve fitness, leading us to choose an ARIMA(1,1,1)(0,1,1)[24] model. Using it to forecast the first two months of ACTRIAL gives the following graph:

The graph shows that the true number of active patrollers is lower than forecasted, but well within the confidence interval. One thing to note here is the width of the confidence interval. The variance in the months leading up to ACTRIAL is low, but the confidence interval is still very wide. This might be driven by the shift in mean that happens with the introduction of the reviewer right. We therefore choose to put more weight on the previous analysis that looks at the first two months of ACTRIAL as a whole, indicating that H10 is supported. Future work can look into improving the forecasting models, perhaps with different techniques, in order to understand the shift in November 2016 and how that can be accounted for.

Related measure: Ratio of active patrollers to creations and moves

Similarly as we had for H9, we have a related measure for H10 that looks at number of active patrollers in relation to the influx of articles to review. Just as we did for H9, we use our dataset of non-autopatrolled article creations combined with moves from User and Draft namespaces. We measure the ratio of creations and moves to active patrollers, as that measure is the average number of reviews each active patrollers has to do in order to keep up with demand.

In our August 22 work log, we plotted this historically, so we start by updating our plot with data through the first two months of ACTRIAL:

In the plot, we see the increased average workload early on before the recruitment of more patrollers in mid-2013. From then on the ratio stays fairly constant at or below 7.5 until March 2016. The drop in active patrollers in March 2016 coincides with an increase in the ratio, and then it drops again in mid-2016 and stays low until the introduction of the reviewer right in November. Let's focus in on 2016 and 2017 to make trends in those years easier to see:

The increase in early 2016 is more easily noticed, and we can see how the ratio stays stable during the summer and fall of 2016. When the reviewer right is introduced in November 2016, the ratio just about doubles from five to ten, but the trend suggests a slower increase. We see the ratio staying above ten for much of the first half of 2017 due to the large number of moves happening in that time period. The ratio is fairly stable during the summer and early fall of 2017, then drops significantly when ACTRIAL starts.

As we saw for H10, the analysis of this is complicated by the introduction of the reviewer right. Due to the relative consistency in the first half of 2017, we use the same time period for comparison as we did for H10. During the first half of 2017, the average ratio was 10.4 articles per patroller, during the first two months of ACTRIAL it dropped to 6.8. This decrease is found to be significant (t-test: t=22.03, df=146.97, p << 0.001; Mann-Whitney U test: W=10881, p << 0.001). This suggests that the related measure of H10 is not supported. In other words, we find that the number of active patrollers has dropped less than what we would expect based on the reduction in creations and moves.

We also investigate this using forecasting model, again measuring the ratio on a bimonthly basis. The historical plot of this looks as follows:

The trends in this ratio echo those in the previous plot where we calculated the ratio on a daily basis. One thing to note here is that the ratio on a bimonthly basis is much higher than on a daily basis, and that reflects the unevenness of the work distribution among reviewers. Some reviewers do most of the reviews, and measuring across a larger timespan brings that out.

We first investigate stationarity and seasonality in the time series, finding that it is clearly non-stationary and that a seasonal model greatly improves the resulting ACF and PACF. R's auto.arima function suggests an ARIMA(0,1,0)(0,1,0)[24] model. We investigate other candidate models based on what the ACF and PACF suggested, but fail to find any improvements over the auto-generated model. Using it to forecast the first two months of ACTRIAL gives the following graph:

The forecast graph shows how the true ratio is much lower than the forecasted value, and that the confidence interval of the forecast is very wide. We can also see how the ratio is low during the last half of September, and then increases in both October and November. The wideness of the confidence interval is again somewhat concerning, partly due to the shifts in the time series. We therefore choose to give more weight to the previous result that looked at the first two months of ACTRIAL as a whole, suggesting that the related measure is not supported. That being said, we also note that the ratio appears to be increasing during ACTRIAL, suggesting that as ACTRIAL goes on, the participation rates in new page patrol might be decreasing. Follow-up analysis would therefore be useful.

H11: The distribution of patrolling activity evens out.

Our preliminary analysis of historical data for H11 is found in our August 23 work log. During that analysis, we found that using the Gini coefficient to measure this is complicated by it being difficult to interpret the results, leading us to instead measure the proportion of all patrol actions done by the top quartile of active patrollers. If the work was evenly distributed, we would expect them to do about 25% of the work, and if the work is unevenly distributed, they will be doing most of the work. Here's an updated plot of this proportion from October 2012 onwards:

In the graph, we can see that there is quite a bit of fluctuation, but also that the proportion stays high throughout. This means that generally, the most active reviewers do most of the review work. It is not clear that this has changed significantly over time either. Let's focus in on the most recent two years and see if there are inflection points there:

During the recent years, the only place where there might be a clear change is when a very prolific reviewer stops reviewing in May 2016. There we can see the ratio drop from about 85% down to 75%. For the rest of the graph, there is some variation, perhaps particularly in the trend, but we can also see that the ratio tends to stay within the 70–80% band a lot of the time. Lastly, there might be a reduction during ACTRIAL down to an average of 75%, but it is not clear that ACTRIAL introduces a new paradigm as we can days of similar levels shortly prior to ACTRIAL as well.

Because the proportion has been fairly stable across time, we decide to compare the first two months of ACTRIAL against similar periods of the years 2013-2016. Similarly as for other hypotheses, we will also build forecasting models as well to examine changes from a time series perspective.

We find that looking at the first two months of ACTRIAL as a whole and comparing those to the same months of 2013-2016, there's a significant drop in the proportion during ACTRIAL (t-test: t=6.023, df=99.4, p << 0.001; Mann-Whitney U test: W=10699, p << 0.001). Prior to ACTRIAL, the average proportion was 78.5%, during ACTRIAL it's 74.1%.

As mentioned above, it is unclear whether there is a significant change during ACTRIAL or whether this reduction is largely expected due to the variation in the data. We therefore also approach this analysis from a time series perspective using a forecasting model. First, we switch to measuring the data on a bimonthly basis, resulting in the following graph:

We can see there is some variation in the proportion across time, to some extent following the general activity trend of Wikipedia (higher participation in the spring and fall, lower participation during the summer and the holidays). There is also the distinct shift in Q2 2016, after which the proportion appears to settle on a new mean around 76% or so. Lastly, we can see that there is a fair amount of variation in the data, which could indicate that the changes during ACTRIAL are within what we would expect.

Examining the stationarity and seasonality of the time series, we find that it is non-stationary, and that a seasonal model improves the ACF and PACF. The model suggested by R's auto.arima function is an ARIMA(1,0,0)(1,0,0)[24] with non-zero mean. Based on our interpretation of the ACF and PACF, we investigate alternative models both with and without integration, but find that none of them are a substantial improvement. We therefore choose the simpler model, resulting in the following forecast graph:

The forecast is higher than the true value, but within the 95% confidence interval. We can see that the proportion during the first two months of ACTRIAL is not much unlike what we saw in fall 2016, indicating that it is perhaps not unexpected. There is also a reasonable amount of variation in the data, but there is not large shifts in the mean as we saw for H10. These factors taken together leads us to conclude that H11 is not supported, the distribution of patrolling activity has not evened out significantly.

Add topic