Community Wishlist Survey 2021/Admins and patrollers/Improve anti spam mechanisms

Improve anti spam mechanisms

  • Problem: "Wikimedia's captchas are fundamentally broken: they keep users away but allow robots in" (T158909). This was sadly true in 2017 and so it is in 2020 (T241921). While a proposal to enable better ones exists (T141490) its implementation is being delayed due to lack of testing/metrics. Year after year stewards and other volunteers spent most of their time blocking spambots and clearing after them. Thousands of spambots need to be manually blocked and cleaned up after by stewards and administrators every month that should've not been allowed to register from start. Moreover, this abusive spambot registration occurrs mostly on small and scarcely watched wikis. While the global SpamBlacklist and AbuseFilter are enormously helpful when it comes to prevent spam edits, we could do better than that and prevent that they register from the start. We need a long-term strategy that spares volunteers from this continuous hindrance. Existing proposals (in addition to those mentioned above): Revamp anti-spamming strategies and improve UX (2015), Automatically detect spambot registration using machine learning (2017) (#aicaptcha), enable MediaWiki extension StopForumSpam (Phabricator workboard · Beta Cluster deployment request (2017)). CheckUser shows that most spambots we detect register and edit using IPs or ranges blacklisted in one or more anti spam sites such as StopForumSpam and analogous DNSBL sites. Filtering out traffic originating from those would also help addressing this.
  • Who would benefit: All users.
  • Proposed solution: I guess it depends on how Community Tech would like to address this issue. My informal proposal (which may not be the path that the developers have in mind) would be as follows: (a) short term: Deploy improved FancyCaptcha, (b) medium term: enable MediaWiki extension StopForumSpam (passive mode: do not send data about our users, just receive the data they have about toxic IPs/networks), and (c) long term: AICaptcha.
  • More comments: —
  • Phabricator tickets: See above, but T125132 contains an accurate summary of the issue. Of interest: T125132 (restricted), T212667, T230304 (restricted).
  • Proposer: —MarcoAurelio (talk) 19:05, 16 November 2020 (UTC)[reply]


  • I believe the Captcha work is already underway at Phab:T250227. Samwalton9 (talk) 00:01, 17 November 2020 (UTC)[reply]
    hCaptcha might be another posibility, but I am not sure how many people would agree to use a third-party system. In any case, if it is decided that hCaptcha is the way to go, Community Tech could still get involved. —MarcoAurelio (talk) 18:49, 17 November 2020 (UTC)[reply]
  • I would probably not use Google's reCAPTCHA and instead use a simple in-house developed "type the letters you see" captcha. If the network the user is on blocks Google but not Wikipedia, they may not be able to edit. Félix An (talk) 02:29, 17 November 2020 (UTC)[reply]
    Indeed. I am not proposing to use reCAPTCHA or other third-party system. User privacy is important to me. —MarcoAurelio (talk) 18:49, 17 November 2020 (UTC)[reply]
  • Query: how will this be usable by those who do not have "latinized" keyboards? (Examples: users from many Asian and Middle Eastern countries) Not inherently opposed, just not sure how this will work when many of the projects that would benefit the most are languages for which "latin" letters are not standard. I can see how producing a captcha of some sort that uses the same alphabet as the project may reduce spam, which is often in a different language than the project. Risker (talk) 18:46, 20 November 2020 (UTC)[reply]
    I guess this needs to be analyzed and come to a solution that can be as inclusive as possible regardless of the cultural background. Instead of writting random words as it happens currently, users could be offered a mosaic of pictures from Commons and ask them to click on the ones that are cats/dogs/cars/rivers/etc. for example? Or maybe solve an easy math question (e.g. How much is 3 + 7?). I feel that, ideally, the solution would be the IACaptcha work started some time ago where without the need of captchas the system is able to identify and exclude non-humans from registering. That, I guess, can take some time; but maybe we can profit from this oportunity to restart that work and end this situation of cross-wiki volunteers having to deal with hundreds of spambots every day. I think that not doing anything in this subject is no longer an option. —MarcoAurelio (talk) 19:00, 28 November 2020 (UTC)[reply]
  • Please, be very careful. For me as a user Captachas are increasingly annoying. As a normal user which includes not using Windows one should NEVER see a captcha. --Shoeper (talk) 15:02, 23 November 2020 (UTC)[reply]
    • I may be missing something here, Shoeper, but I don't see any way that you could never see a captcha unless you were being tracked across the internet. If you come to a new site which does not have any information about you, and they need to be sure you aren't a bot, a captcha is how they do it. Natural-language questions are also useful, but have to be changed regularly and are hard to make non-culturally-biassed. I'd suggest offering the user some simple editing tasks, suggested-edits style, might be an effective way to test. Pre-Google, re-CAPTCHA asked users to digitize a couple of words from scanned public-domain books, using overlap between users to validate. I understand this is too easy for modern bots anyway, and Google now asks everyone to train their proprietary driverless car algorithms (usign the same consensus-of-humans method to verify). But we have no shortage of bot-undoable tasks on the wikis. HLHJ (talk) 01:22, 24 November 2020 (UTC)[reply]
      • (People could be used to improve Wikidata, but it would probably require the user to research something leading to bad user experience.) The increasing use of captchas on the internet is worrying. Non technical users and women are underrepresented on Wikipedia. I fear adding a captcha is going to worsen that situation further. But tbh although I'd like to improve Wikipedia and try it from to time it never was a real pleasure. Looks like it is getting even worse.--Shoeper (talk) 18:13, 28 November 2020 (UTC)[reply]
        • Wikimedia already uses Captchas, but they're broken so bots easily get in, and some people struggle with them. Real people being blocked by broken captchas is certainly a concern for me and I'd like to find a solution that is both effective and inclusive. I think AICaptcha is the solution to this as it'd use no captcha at all. In the meanwhile if you are having problems with the Wikimedia captchas, you can ask for a global captcha-exemption permission. See details at this page. —MarcoAurelio (talk) 19:00, 28 November 2020 (UTC)[reply]
        • Basically the choice is between an unlimited/unrestricted flow of rubbish coming in (which is haunting away regulars due to the amount of work), restrict everything that looks like rubbish (which is stopping spambots but disallowing new genuine editors), or a 'click here' OK-box which is a nuisance (extra click, though most people won't bother too much) but basically an unlimited/unrestricted flow of rubbish coming in. A captcha is a path between: it is (when there is a good captcha) rather restricting on spam-bots (except for the really intelligent ones, which cost money to the spammer), and a nuisance for genuine editors (I myself do not care about the occasional captcha, it should however be reasonable; I agree that some (new) editors will be genuinely annoyed, but that will be way less than when you fully restrict or have to click away the OK-box every single time). --Dirk Beetstra T C (en: U, T) 10:47, 30 November 2020 (UTC)[reply]
  • It is important to note that this is, probably, not a matter of shifting a tradeoff between being better at keeping bots out and being better at allowing humans in - it is very likely that both can be improved at the same time. The core capability that we are missing here is some kind of analytics to evaluate captcha changes - there are easy options to tweak the captcha algorithm in a way that probably improves all the parameters, but we cannot actually measure those parameters currently, so we'd have to fly blind. That has kept those changes from being made for a long time. --Tgr (talk) 03:28, 13 December 2020 (UTC)[reply]
  • The only concern I have here is accessibility for blind users. As if Google's reCAPTCHA was blind-friendly in the first place anyway (their audio feature is broken)... But I wonder how the three suggested implementations would handle that. Pandakekok9 (talk) 03:04, 15 December 2020 (UTC)[reply]