RBSpamAnalyzerBot
Joined 26 May 2007
Overview
editThis bot will post external link analysis, find probable spambot-created pages, and eventually tag them for speedy deletion. It will also generate a set of statistics that can be used by the community to determine whether some pages are being used as spam carriers.
Tasks
editThe bot itself is composed of a set of bash shell script files, each doing a single task:
- review.sh: The "bot" itself. The script just calls each of the following scripts in order, handling any problem they may have.
- download.sh: Checks download.wikimedia.org to find new database dumps, comparing the current ones with the last one it had processed. If new ones are found, it can generate a list of urls to download page.sql.gz and externallinks.sql.gz to be downloaded via en:wget.
- process.sh: Executes the queries from page.sql.gz and externallinks.sql.gz in a local database, then executes several custom-made queries to gather statistics:
SELECT COUNT(el_from) AS total, el_from, page_title
FROM externallinks, page
WHERE externallinks.el_from = page_id AND page_is_redirect = 0 AND page_namespace = 0
GROUP BY el_from
ORDER BY total DESC;- Generates a list of articles sorted by the amount of external links each has.
SELECT COUNT(el_to) AS total, SUBSTRING_INDEX(el_to, '/', 3) AS search
FROM externallinks, page
WHERE page_id = el_from AND page_namespace = 0
GROUP BY search
ORDER BY total DESC;- Generates a list of external links in descendant order.
SELECT page_id, page_title, page_namespace
FROM page
WHERE page_title LIKE '%index.php%'
OR page_title LIKE '%/wiki/%'
OR page_title LIKE '%/w/%' OR
page_title LIKE '%/';- Generates a list of pages with titles containing one of several patterns used by malfunctioning bots, like /wiki/, /w/, or ending with /.
- After executing the queries, the script processes the resulting lists to limit the lists to a determined amount, to prevent creating pages too big. If resulting listing has more than 500 items, the bot stops, as the dump result must be manually analyzed.
- upload.sh: This script executes the communication between the bot and the Wikipedia project. The script logins the bot and uploads the generated listings at a determined location. Currently, that is being done at User:ReyBrujo/Dumps. First, the script determines whether there is a current dump, and if so, archives it at User:ReyBrujo/Dumps/Archive. Then it uploads the listings and the dump page, with the format:
- User:ReyBrujo/Dumps/yyyymmdd where yyyymmdd is the database dump date (and not the processing date)
- User:ReyBrujo/Dumps/yyyymmdd/Sites linked more than xxx times where xxx is usually 500 in the case of the English Wikipedia
- User:ReyBrujo/Dumps/yyyymmdd/Sites linked between xxx and yyy times where xxx and yyy are delimiters when a single listing would have over 500 items.
- User:ReyBrujo/Dumps/yyyymmdd/Articles with more than xxx external links where xxx is usually 1000.
- User:ReyBrujo/Dumps/yyyymmdd/Articles with between xxx and yyy external links where xxx and yyy are delimiters when a single listing would have over 500 items.
Finally, the bot edits a global page currently found at User:ReyBrujo/Dump statistics table, updating the statistics in that page. Permission for the bot to run there will be requested after having the bot approved in the English Wikipedia.