Overview edit

This bot will post external link analysis, find probable spambot-created pages, and eventually tag them for speedy deletion. It will also generate a set of statistics that can be used by the community to determine whether some pages are being used as spam carriers.

Tasks edit

The bot itself is composed of a set of bash shell script files, each doing a single task:

  • review.sh: The "bot" itself. The script just calls each of the following scripts in order, handling any problem they may have.
  • download.sh: Checks download.wikimedia.org to find new database dumps, comparing the current ones with the last one it had processed. If new ones are found, it can generate a list of urls to download page.sql.gz and externallinks.sql.gz to be downloaded via en:wget.
  • process.sh: Executes the queries from page.sql.gz and externallinks.sql.gz in a local database, then executes several custom-made queries to gather statistics:
    SELECT COUNT(el_from) AS total, el_from, page_title
    FROM externallinks, page
    WHERE externallinks.el_from = page_id AND page_is_redirect = 0 AND page_namespace = 0
    GROUP BY el_from
    ORDER BY total DESC;
    Generates a list of articles sorted by the amount of external links each has.
    SELECT COUNT(el_to) AS total, SUBSTRING_INDEX(el_to, '/', 3) AS search
    FROM externallinks, page
    WHERE page_id = el_from AND page_namespace = 0
    GROUP BY search
    ORDER BY total DESC;
    Generates a list of external links in descendant order.
    SELECT page_id, page_title, page_namespace
    FROM page
    WHERE page_title LIKE '%index.php%'
    OR page_title LIKE '%/wiki/%'
    OR page_title LIKE '%/w/%' OR
    page_title LIKE '%/';
    Generates a list of pages with titles containing one of several patterns used by malfunctioning bots, like /wiki/, /w/, or ending with /.
    After executing the queries, the script processes the resulting lists to limit the lists to a determined amount, to prevent creating pages too big. If resulting listing has more than 500 items, the bot stops, as the dump result must be manually analyzed.
  • upload.sh: This script executes the communication between the bot and the Wikipedia project. The script logins the bot and uploads the generated listings at a determined location. Currently, that is being done at User:ReyBrujo/Dumps. First, the script determines whether there is a current dump, and if so, archives it at User:ReyBrujo/Dumps/Archive. Then it uploads the listings and the dump page, with the format:
    User:ReyBrujo/Dumps/yyyymmdd where yyyymmdd is the database dump date (and not the processing date)
    User:ReyBrujo/Dumps/yyyymmdd/Sites linked more than xxx times where xxx is usually 500 in the case of the English Wikipedia
    User:ReyBrujo/Dumps/yyyymmdd/Sites linked between xxx and yyy times where xxx and yyy are delimiters when a single listing would have over 500 items.
    User:ReyBrujo/Dumps/yyyymmdd/Articles with more than xxx external links where xxx is usually 1000.
    User:ReyBrujo/Dumps/yyyymmdd/Articles with between xxx and yyy external links where xxx and yyy are delimiters when a single listing would have over 500 items.

Finally, the bot edits a global page currently found at User:ReyBrujo/Dump statistics table, updating the statistics in that page. Permission for the bot to run there will be requested after having the bot approved in the English Wikipedia.