Research:Identifying bot accounts

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Robotic editing represents a major class of automated editing within Wikipedia projects. While English Wikipedia has a strong norm for flagging bot users via the user_groups table, it's unclear whether other language editions follow a similar pattern. For this reason, some alternative strategies have been explored to detect robots broadly. In this report, we'll analyze and discuss the differences between a few of the most common bot identification strategies.

Strategies edit

Bot flag edit

Many wikis make use of a user group to flag and track the activities of bot accounts. E.g. user_groups.ug_group = "bot". Wikimedia communities employ the use of this flag consistently, then one could make use of the user_groups table to efficiently identify bot accounts.

Username regex edit

Example
en:User:HBC_AIV_helperbot5
Counter-example
en:User:I_Jethrobot

Curated lists edit

For example: en:Wikipedia:Bots/Status and "bots.csv" from Wikistat csv

Hybrid strategies edit

Wikistats
  1. Is there a bot flag in user group table?
  2. Does it sound like a bot? (nowadays only allowed for bot, on many wikis).
    • Perl: if (($user =~ /bot\b/i) || ($user =~ /_bot_/i))
  3. Is it known to be an unregistered bot (Wikipedia has a list of false negatives at [1])
  4. Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project
  5. Three names that sound like bot are hard coded exceptions (people who wrote ErikZ to tell him they are human): Paucabot, Niabot, & Marbot

Bots don't sleep edit

In some recent work [1] I found many users that appeared to be bots but whose edits did not have the bot flag set. My approach was to exclude users who didn't have a break of more than 6 hours between edits over the entire month I was studying. I was interested in the users who had multiple edit sessions in the month and so when with a straight threshold. A way to keep users with only one editing session would be to exclude users who have no break longer than X hours in an edit session lasting at least Y hours (e.g., a user who doesn't break for more than 6 hours in 5-6 days is probably not human)

[1] Multilinguals and Wikipedia Editing http://www.scotthale.net/pubs/?websci2014

Analysis edit

Username matching vs. the bot flag. edit

 
Bot counts by matching strategy. Count of bots by matching strategy registered between Sept. 2013 and Sept. 2014 for the top 25 wikis by count of regex-matched bot users.

#Bot counts by matching strategy plots the count of bots by detection method for the top 25 wikis by the most inclusive regex matching strategy. The plot makes two things salient: (1) All the top wikis saw user accounts registered that were given the bot flag -- so presumably the bot flag is being used. (2) We can also see that there's often an order of magnitude more active, non-blocked user accounts that fit the regex criteria.


Questions edit

  • How are bot accounts registered?
    • Can we filter for non-bot accounts via the logging table if we look at log_action="create" and log_type="newusers"?
    • It seems more likely that bot accounts would be registered by proxy (e.g. newusers/create2).