Research talk:Autoconfirmed article creation trial/Work log/2017-08-08

Tuesday, August 8, 2017 edit

Today I'll be working on visualizing the historic data of number of registered accounts over time.

Number of registered accounts edit

Yesterday I wrote a Python script to gather data on the number of registered accounts from the logging table, and I have started looking into visualizing this. First stop is to visualize the entirety of the data, which gives us the following plot:

 
Historical data on the number of registered accounts per day on the English Wikipedia.

The above graph shows the number of accounts registered per day since the start of logging that in the logging table. Since then, there have been five different types of codes used for logging said registrations:

newusers
A new account was created. This is shown as the total number of accounts before the introduction of "create" and "create2" in 2006 (see also the notes below regarding how the total is calculated). After that point, this code is not in use.
create
A new account was created (i.e. the "normal" way of creating an account). Introduced in 2006.
create2
Someone with account creation rights created an account for someone else (e.g. an instructor at an edit-a-thon). Introduced in 2006.
autocreate
A user with an account in a different Wikipedia edition visited the English Wikipedia and an account was automatically created for them. Introduced in 2008.
byemail
Someone requested an account and it was created for them, with the password being emailed to them. These appear to be typically handled through the Request an account page. Introduced in 2013.

We can recognize the introduction of the various measures logged, e.g. logging of "autocreate" accounts (accounts created when someone from another Wikipedia edition comes to the English one) starts in 2008. We can also see that there is data missing for some periods. The total number of accounts is counted differently across time, at the beginning of the dataset it is the only action logged ("newusers" per above), whereas once the other types are introduced, we count the total as the sum of all the other types of account creations.

The number of accounts created for others ("create2"), or that's created but where the password is email ("byemail") is fairly stable from 2014 onwards. The box plot below visualizes this stability using data from Jan 1, 2014 onwards:

 
Box plot of number of accounts created per day for accounts getting passwords through email ("byemail") and those created for others ("create2").

The maximum number of accounts created with each method differs somewhat. We can see that the maximum number of accounts that got their passwords through email is 249, and there are many days where the number was over 100. Comparatively, the maximum for the other method is 167, and only four days has above 100 accounts created for others.

The median number of accounts created per day with both methods is fairly similar: 33 accounts per day for "byemail", and 31 accounts per day for "create2".

From the initial plot we can also see that there is an increase in the number of accounts created starting in the second half of 2014 through the first half of 2015. This is more easily seen if we only plot the auto-created accounts and the rest, and use a linear Y-axis:

 
Plot of the number of accounts created and auto-created.

In the plot above it is easy to see that the number of accounts created almost doubled in early August 2014, and a higher level of activity continues through the first half of 2015. This appears to be partly caused by the process of unifying accounts across projects, as described on SUL finalisation. It is probably also caused by the Wikipedia mobile app requesting that users create an account. Unfortunately, the information in the logging table does not distinguish between accounts registered through the app and those registered through other means. The disappearance of auto-creation data for several months in 2011 is also clear here. Based on the fairly stable level of activity of the other types of creations, we might be able to use that to extrapolate numbers if need be.

Lastly, we are interested in understanding to what extent new account registrations are caused by visits from users who already have an account on another wiki. We calculate the proportion (in percent) of newly registered accounts that are auto-created, and plot that from Jan 1, 2009 onwards:

 
Plot of the proportion of registered accounts per day that were auto-created.

While more analysis is needed to have definite answers, it appears that the proportion is slowly but surely increasing. Back in 2009–2010, the proportion rarely goes above 20%, while since 2016 it appears to consistently be in the 30–40% range. If we also look at the previous plot of number of accounts registered, we see higher activity in 2009 and 2010 compared to 2016 and so far in 2017. This could suggest that Wikipedia is seeing a lower influx of users that are not already participants in another project, but further study is needed to understand more about this.

Return to "Autoconfirmed article creation trial/Work log/2017-08-08" page.