Research talk:Identifying bot accounts/Work log/2014-09-05

Latest comment: 9 years ago by Halfak (WMF)

It's been a few months. I've finally had a chance to sit down with this again. TL;DR: The bot regex works very badly unless we filter to users that make at least one edit -- then it works OK, but we still catch more regular users than we do flagged bots.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     LEFT(user_registration, 6) AS month,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match,
    ->     SUM(ug_group IS NOT NULL) AS group_match,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND
    ->         ug_group IS NOT NULL) AS both_match
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration BETWEEN "20140101" AND "20140901"
    -> GROUP BY 1;
+--------+------------+-------------+------------+
| month  | name_match | group_match | both_match |
+--------+------------+-------------+------------+
| 201401 |        100 |           5 |          4 |
| 201402 |         88 |           0 |          0 |
| 201403 |         82 |           0 |          0 |
| 201404 |         73 |           0 |          0 |
| 201405 |         92 |           3 |          3 |
| 201406 |         70 |           1 |          1 |
| 201407 |         73 |           1 |          1 |
| 201408 |        121 |           0 |          0 |
+--------+------------+-------------+------------+
8 rows in set (20.54 sec)

So the regular expression above matches "bot" when it appears at the end of a username, before a space or before some other type of word boundary. This roughly corresponds to Erik's "sounds like a bot" regular expressions. You'll note that we see about two orders of magnitude more bot names than we see bot flags. A quick query suggests, qualitatively, that names captured by this regex do, in fact, sound like a bot.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     user_id,
    ->     user_name,
    ->     ug_group
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration > "20140101" AND
    ->     user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )"
    -> LIMIT 100;
+----------+--------------------------------+----------+
| user_id  | user_name                      | ug_group |
+----------+--------------------------------+----------+
| 20483999 | Monkbot                        | bot      |
| 20499663 | BabyBruiserBot                 | NULL     |
| 20508333 | AnomieBOT III                  | bot      |
| 20611691 | Filedelinkerbot                | bot      |
| 20648241 | OrginBot                       | NULL     |
| 20692668 | Stanfordpandabot               | NULL     |
| 20760835 | ArteBot                        | NULL     |
| 20791540 | Guardianbot                    | NULL     |
| 20792760 | FragBot                        | NULL     |
| 20811948 | Trellbot                       | NULL     |
| 20827123 | FeltonBot                      | NULL     |
<snip>

So I was wondering if maybe there were a lot of bot-like names registered that never made any edits. So I filtered the query to those users who had saved at least 1 edit.

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT
    ->     LEFT(user_registration, 6) AS month,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )") AS name_match,
    ->     SUM(ug_group IS NOT NULL) AS group_match,
    ->     SUM(user_name RLIKE " ?(bot|Bot|BOt|BOT|bOt|bOT|boT)(\b|$| )" AND
    ->         ug_group IS NOT NULL) AS both_match
    -> FROM user
    -> LEFT JOIN user_groups ON
    ->     user_id = ug_user AND
    ->     ug_group = "bot"
    -> WHERE
    ->     user_registration BETWEEN "20140101" AND "20140901" AND
    ->     user_editcount > 1
    -> GROUP BY 1;
+--------+------------+-------------+------------+
| month  | name_match | group_match | both_match |
+--------+------------+-------------+------------+
| 201401 |         17 |           4 |          3 |
| 201402 |         19 |           0 |          0 |
| 201403 |         18 |           0 |          0 |
| 201404 |         14 |           0 |          0 |
| 201405 |         28 |           3 |          3 |
| 201406 |         16 |           1 |          1 |
| 201407 |         18 |           1 |          1 |
| 201408 |         21 |           0 |          0 |
+--------+------------+-------------+------------+
8 rows in set (10.99 sec)

Now that's much more reasonable. We're still looking at an order of magnitude difference. Let's look at a few qualitatively.

  • User:EraserBot -- blocked
  • User:Gloombot -- Regular user
  • User:Coldwarrobot -- Regular user
  • User:Nixonbot -- blocked
  • User:BabyBruiserBot -- Regular user
  • User:WhiteBOT II -- blocked
  • User:S$annerbot -- Regular user
  • User:GunBot -- blocked

So, it looks about 50/50 regular users and blocked users. --Halfak (WMF) (talk) 19:01, 5 September 2014 (UTC)Reply


DAndreescu had a good suggestion. We can use the ipblocks table to filter out users who were blocked. He proposed the following query:

SELECT LEFT(user_registration, 6) AS month,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )") as regex,
        SUM(ug_group IS NOT NULL)  AS group_match,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND
            ug_group IS NOT NULL)  AS group_and_regex,
        SUM(user_name RLIKE " ?([bB][oO][tT])(\b|$| )" AND
            ipb_user IS NULL)      AS not_blocked_and_regex
   FROM user
                LEFT JOIN
        user_groups     ON user_id = ug_user
                        AND ug_group = "bot"
                LEFT JOIN
        ipblocks        ON user_id = ipb_user
  WHERE user_registration BETWEEN "20140101" AND "20140901"
    AND user_editcount > 1
  GROUP BY 1;

month	regex	group_match	group_and_regex	not_blocked_and_regex
201401	17	4		3		8
201402	19	0		0		9
201403	18	0		0		12
201404	14	0		0		9
201405	28	3		3		19
201406	16	1		1		8
201407	18	1		1		3
201408	21	0		0		10

So, I took this query, ran it for ALL THE WIKIS for the last year (Sept 2013 - Sept 2014). #Bot counts by matching strategy plots the top 25 wikis by number of regex matched users.

Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.
Bot counts by matching strategy. Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.

We can see two things here. All of the top wikis saw user accounts registered that were given the bot flag -- so presumably the bot flag is being used. We can also see that there's often an order of magnitude more active, non-blocked user accounts that fit the regex criteria. --Halfak (WMF) (talk) 19:13, 5 September 2014 (UTC)Reply

Return to "Identifying bot accounts/Work log/2014-09-05" page.