Research talk:Autoconfirmed article creation trial/Work log/2017-08-21
Monday, August 21, 2017
editToday I'll be working on instrumenting the page creation log data, documenting the various historic data sources and datasets we've gathered, and continue our data gathering process.
Page creation log data
editPage creation events are now logged, ref T150369. We want to use this data to gather datasets and visualize it in order to better understand page creation. It will also be useful for H9, H10, H14, and H15.
Our main concern is the main namespace, and it appears that the table only logs creations. We will therefore have to augment this data with data on page moves into the main namespace in order to better track "article publications". Since we're already doing that, it's not a problem.
The table allows us to do some useful segmentation:
page_is_redirect
column allows us to filter out all articles that are created as a redirect (e.g. when a page is moved)performer_user_groups
allows us to identify users who have "autoconfirmed" or "autopatrolled" status, and conversely identify users without those rights.performer_user_is_bot
allows us to remove any bot-created articlesrev_minor_edit
flags the edit as a minor edit (this flag appears to never be in use for page creations, though)
In our case, I think we are mainly interested in the following metrics (all regarding pages created in the main namespace (page_namespace=0
), where the user is not flagged as a bot (performer_user_is_bot=0
)):
- Total number of pages created
- Number of pages created that are not redirects
- Number of non-redirect pages created where the user has autoconfirmed status
- Number of non-redirect pages created where the user has autopatrolled status
- Number of non-redirect pages created where the user is neither autoconfirmed nor autopatrolled
I'll write a SQL query to get that.
SELECT DATE(rev_timestamp) AS log_date, COUNT(*) AS num_created_pages, SUM(IF(page_is_redirect = 0, 1, 0)) AS num_non_redirects, SUM(IF(page_is_redirect = 0 AND LOCATE('autoconfirmed', performer_user_groups) > 0, 1, 0)) AS num_autoconfirmed, SUM(IF(page_is_redirect = 0 AND LOCATE('autopatrol', performer_user_groups) > 0, 1, 0)) AS num_autopatrolled, SUM(IF(page_is_redirect = 0 AND LOCATE('autoconfirmed', performer_user_groups) = 0, 1, 0)) AS num_non_autoconfirmed FROM mediawiki_page_create_2 WHERE `database`='enwiki' AND page_namespace = 0 AND performer_user_is_bot = 0 GROUP BY log_date;