Research talk:Identifying bot accounts/Work log/2014-09-05
It's been a few months. I've finally had a chance to sit down with this again. TL;DR: The bot regex works very badly unless we filter to users that make at least one edit -- then it works OK, but we still catch more regular users than we do flagged bots.
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT -> LEFT(user_registration, 6) AS month, -> SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match, -> SUM(ug_group IS NOT NULL) AS group_match, -> SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND -> ug_group IS NOT NULL) AS both_match -> FROM user -> LEFT JOIN user_groups ON -> user_id = ug_user AND -> ug_group = "bot" -> WHERE -> user_registration BETWEEN "20140101" AND "20140901" -> GROUP BY 1; +--------+------------+-------------+------------+ | month | name_match | group_match | both_match | +--------+------------+-------------+------------+ | 201401 | 100 | 5 | 4 | | 201402 | 88 | 0 | 0 | | 201403 | 82 | 0 | 0 | | 201404 | 73 | 0 | 0 | | 201405 | 92 | 3 | 3 | | 201406 | 70 | 1 | 1 | | 201407 | 73 | 1 | 1 | | 201408 | 121 | 0 | 0 | +--------+------------+-------------+------------+ 8 rows in set (20.54 sec)
So the regular expression above matches "bot" when it appears at the end of a username, before a space or before some other type of word boundary. This roughly corresponds to Erik's "sounds like a bot" regular expressions. You'll note that we see about two orders of magnitude more bot names than we see bot flags. A quick query suggests, qualitatively, that names captured by this regex do, in fact, sound like a bot.
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT -> user_id, -> user_name, -> ug_group -> FROM user -> LEFT JOIN user_groups ON -> user_id = ug_user AND -> ug_group = "bot" -> WHERE -> user_registration > "20140101" AND -> user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" -> LIMIT 100; +----------+--------------------------------+----------+ | user_id | user_name | ug_group | +----------+--------------------------------+----------+ | 20483999 | Monkbot | bot | | 20499663 | BabyBruiserBot | NULL | | 20508333 | AnomieBOT III | bot | | 20611691 | Filedelinkerbot | bot | | 20648241 | OrginBot | NULL | | 20692668 | Stanfordpandabot | NULL | | 20760835 | ArteBot | NULL | | 20791540 | Guardianbot | NULL | | 20792760 | FragBot | NULL | | 20811948 | Trellbot | NULL | | 20827123 | FeltonBot | NULL | <snip>
So I was wondering if maybe there were a lot of bot-like names registered that never made any edits. So I filtered the query to those users who had saved at least 1 edit.
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT -> LEFT(user_registration, 6) AS month, -> SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match, -> SUM(ug_group IS NOT NULL) AS group_match, -> SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND -> ug_group IS NOT NULL) AS both_match -> FROM user -> LEFT JOIN user_groups ON -> user_id = ug_user AND -> ug_group = "bot" -> WHERE -> user_registration BETWEEN "20140101" AND "20140901" AND -> user_editcount > 1 -> GROUP BY 1; +--------+------------+-------------+------------+ | month | name_match | group_match | both_match | +--------+------------+-------------+------------+ | 201401 | 17 | 4 | 3 | | 201402 | 19 | 0 | 0 | | 201403 | 18 | 0 | 0 | | 201404 | 14 | 0 | 0 | | 201405 | 28 | 3 | 3 | | 201406 | 16 | 1 | 1 | | 201407 | 18 | 1 | 1 | | 201408 | 21 | 0 | 0 | +--------+------------+-------------+------------+ 8 rows in set (10.99 sec)
Now that's much more reasonable. We're still looking at an order of magnitude difference. Let's look at a few qualitatively.
- User:EraserBot -- blocked
- User:Gloombot -- Regular user
- User:Coldwarrobot -- Regular user
- User:Nixonbot -- blocked
- User:BabyBruiserBot -- Regular user
- User:WhiteBOT II -- blocked
- User:S$annerbot -- Regular user
- User:GunBot -- blocked
So, it looks about 50/50 regular users and blocked users. --Halfak (WMF) (talk) 19:01, 5 September 2014 (UTC)
DAndreescu had a good suggestion. We can use the ipblocks table to filter out users who were blocked. He proposed the following query:
SELECT LEFT(user_registration, 6) AS month, SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )") as regex, SUM(ug_group IS NOT NULL) AS group_match, SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND ug_group IS NOT NULL) AS group_and_regex, SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND ipb_user IS NULL) AS not_blocked_and_regex FROM user LEFT JOIN user_groups ON user_id = ug_user AND ug_group = "bot" LEFT JOIN ipblocks ON user_id = ipb_user WHERE user_registration BETWEEN "20140101" AND "20140901" AND user_editcount > 1 GROUP BY 1; month regex group_match group_and_regex not_blocked_and_regex 201401 17 4 3 8 201402 19 0 0 9 201403 18 0 0 12 201404 14 0 0 9 201405 28 3 3 19 201406 16 1 1 8 201407 18 1 1 3 201408 21 0 0 10
So, I took this query, ran it for ALL THE WIKIS for the last year (Sept 2013 - Sept 2014). #Bot counts by matching strategy plots the top 25 wikis by number of regex matched users.
We can see two things here. All of the top wikis saw user accounts registered that were given the bot flag -- so presumably the bot flag is being used. We can also see that there's often an order of magnitude more active, non-blocked user accounts that fit the regex criteria. --Halfak (WMF) (talk) 19:13, 5 September 2014 (UTC)