Open main menu

Research:Autoconfirmed article creation trial/Datasets

Counts of account creationsEdit

We have created a dataset with an overview of the number of accounts created per day from Sept 8, 2005 onwards. This date is the first full day containing log data on account creations as the first timestamp of such creations are 2005-09-07 22:16:49 UTC. The dataset is available from account_registration_counts.tsv and is in a tab-separated values format with six columns:

na_date
Date in YYYY-MM-DD format.
na_newusers
Total number of new accounts registered on the given date. Up until 2006-04-16, this is the number of log entries with log_type='newusers' in the log table. From 2006-04-17 onwards, it is the sum of the other four columns of account creations.
na_autocreate
Number of auto-created accounts.
na_byemail
Number of accounts created where the password was sent to the user through email.
na_create
Number of created accounts
na_create2
Number of accounts created by someone with account creation rights for others.

For more information about the four different types of log events associated with account creation, see the MediaWiki manual on log actions. This dataset is updated daily around 03:00 UTC.

Size of New Page Patrol backlogEdit

We have been checking the size of the New Page Patrol backlog periodically from Aug 29, 2017 onwards. The dataset is available from npp_backlog_size.tsv and is in a tab-separated values format with two columns:

npp_timestamp
Timestamp of when the snapshot of the backlog was taken, in YYYY-MM-DD HH:MM:SS format.
npp_num_articles
Count of number of articles in the backlog.

This dataset is updated daily around 03:00 UTC, but as evident in the file the underlying data is updated four times a day.

Patroller activityEdit

We have two datasets related to participants in New Page Patrol (NPP). The first contains counts of number of active patrollers and the number of patrol actions performed per day. This dataset is available from patrol_summaries.tsv and is in a tab-separated values format with three columns:

ps_date
Date in YYYY-MM-DD format.
ps_num_patrollers
Number of NPP patrollers that marked articles as patrolled on the given day.
ps_num_patrol_actions
Number of total articles marked as patrolled on the given day.

The second dataset contains information on the number of articles marked as patrolled by a given NPP patroller on a given day. This dataset is available from patrollers.tsv and is in a tab-separated values format with three columns:

pat_date
Date in YYYY-MM-DD format.
pat_userid
User ID of the user who patrolled articles.
pat_num_actions
Number of articles marked as patrolled by the given user on the given day.

Both datasets are updated daily around 03:00 UTC.

Page creationEdit

We created a dashboard for displaying graphs of article creation statistics, this dashboard is available from page-creation.wmflabs.org. It displays data from datasets that are also publicly available, they are accessible from here. There is a number of datasets available and some of them are available for all Wikimedia wikis. For the purposes of ACTRIAL, the datasets we are interested in are:

pagecreations/enwiki.tsv
Total number of pages created.
pagecreations_main/enwiki.tsv
Number of pages created in the Main namespace.
pagecreations_main_noredirects/enwiki.tsv
Number of non-redirecting pages created in the Main namespace.
pagecreations_main_bots/enwiki.tsv
Number of pages created in the Main namespace by user accounts identified as bots.
pagecreations_main_autopatrolled/enwiki.tsv
Number of pages created in the Main namespace by users in a user group that has autopatrolled rights.
pagecreations_main_autoconfirmed/enwiki.tsv
Number of pages created in the Main namespace by users with autoconfirmed status.
pagecreations_main_non-autoconfirmed/enwiki.tsv
Number of pages created in the Main namespace by users that do not have autoconfirmed status.
pagecreations_draft/enwiki.tsv
Number of pages created in the Draft namespace.

All these datasets count the number of pages created per day starting from 2017-07-21. They are tab-separated and contain two columns: the first column is the date, the second column is the number of pages. These datasets are updated daily, typically at 01:00 UTC.