Research:Autoconfirmed article creation trial/Datasets
This page documents the datasets gathered and published as part of the autoconfirmed article creation trial.
Counts of account creations
editWe have created a dataset with an overview of the number of accounts created per day from Sept 8, 2005 onwards. This date is the first full day containing log data on account creations as the first timestamp of such creations are 2005-09-07 22:16:49 UTC. The dataset is available from account_registration_counts.tsv and is in a tab-separated values format with six columns:
- na_date
- Date in YYYY-MM-DD format.
- na_newusers
- Total number of new accounts registered on the given date. Up until 2006-04-16, this is the number of log entries with
log_type='newusers'
in the log table. From 2006-04-17 onwards, it is the sum of the other four columns of account creations. - na_autocreate
- Number of auto-created accounts.
- na_byemail
- Number of accounts created where the password was sent to the user through email.
- na_create
- Number of created accounts
- na_create2
- Number of accounts created by someone with account creation rights for others.
For more information about the four different types of log events associated with account creation, see the MediaWiki manual on log actions. This dataset is updated daily around 03:00 UTC.
Size of New Page Patrol backlog
editWe have been checking the size of the New Page Patrol backlog periodically from Aug 29, 2017 onwards. The dataset is available from npp_backlog_size.tsv and is in a tab-separated values format with two columns:
- npp_timestamp
- Timestamp of when the snapshot of the backlog was taken, in YYYY-MM-DD HH:MM:SS format.
- npp_num_articles
- Count of number of articles in the backlog.
This dataset is updated daily around 03:00 UTC, but as evident in the file the underlying data is updated four times a day.
Patroller activity
editWe have two datasets related to participants in New Page Patrol (NPP). The first contains counts of number of active patrollers and the number of patrol actions performed per day. This dataset is available from patrol_summaries.tsv and is in a tab-separated values format with three columns:
- ps_date
- Date in YYYY-MM-DD format.
- ps_num_patrollers
- Number of NPP patrollers that marked articles as patrolled on the given day.
- ps_num_patrol_actions
- Number of total articles marked as patrolled on the given day.
The second dataset contains information on the number of articles marked as patrolled by a given NPP patroller on a given day. This dataset is available from patrollers.tsv and is in a tab-separated values format with three columns:
- pat_date
- Date in YYYY-MM-DD format.
- pat_userid
- User ID of the user who patrolled articles.
- pat_num_actions
- Number of articles marked as patrolled by the given user on the given day.
Both datasets are updated daily around 03:00 UTC.
Page creation
editWe created a dashboard for displaying graphs of article creation statistics, this dashboard is available from page-creation.wmflabs.org. It displays data from datasets that are also publicly available, they are accessible from here. There is a number of datasets available and some of them are available for all Wikimedia wikis. For the purposes of ACTRIAL, the datasets we are interested in are:
- pagecreations/enwiki.tsv
- Total number of pages created.
- pagecreations_main/enwiki.tsv
- Number of pages created in the Main namespace.
- pagecreations_main_noredirects/enwiki.tsv
- Number of non-redirecting pages created in the Main namespace.
- pagecreations_main_bots/enwiki.tsv
- Number of pages created in the Main namespace by user accounts identified as bots.
- pagecreations_main_autopatrolled/enwiki.tsv
- Number of pages created in the Main namespace by users in a user group that has autopatrolled rights.
- pagecreations_main_autoconfirmed/enwiki.tsv
- Number of pages created in the Main namespace by users with autoconfirmed status.
- pagecreations_main_non-autoconfirmed/enwiki.tsv
- Number of pages created in the Main namespace by users that do not have autoconfirmed status.
- pagecreations_draft/enwiki.tsv
- Number of pages created in the Draft namespace.
All these datasets count the number of pages created per day starting from 2017-07-21. They are tab-separated and contain two columns: the first column is the date, the second column is the number of pages. These datasets are updated daily, typically at 01:00 UTC.