Community Wishlist Survey 2019/Admins and patrollers/Overhaul spam-blacklist

Overhaul spam-blacklist

  • Problem: The current blacklist system is archaic; it does not allow for levels of blacklisting, is confusing to editors. Main problems include that the spam blacklist is indiscriminate of namespace, userstatus, material linked to, etc. The blacklist is a crude, black-and-white choice, allowing additions by only non-autoconfirmed editors, or only by admins is not possible, nor is it possible to allow links in certain namespaces. Also giving warnings is not possible (on en.wikipedia, we implemented XLinkBot, who reverts and warns - giving a warning to IPs and 'new' editors that a certain link is in violation of policies/guidelines would be a less bitey solution).
  • Who would benefit: Editors on all Wikipedia's
  • Proposed solution: Basically, replace the current mw:Extension:SpamBlacklist with a new extension with an interface similar to mw:Extension:AbuseFilter, where instead of conditions, the field contains a set of regexes that are interpreted like the current spam-blacklists, providing options (similar to the current AbuseFilter) to decide what happens when an added external link matches the regexes in the field (see more elaborate explanation in collapsed box).

    Note: technically, the current AbuseFilter is capable of doing what would be needed, except that in this form it is extremely heavyweight to use for the number of regexes that is on the blacklists, or one would need to write a large number of rather complex AbuseFilters. The suggested filter is basically a specialised form of the AbuseFilter that only matches regexes against added external links.

description of suggested implementation

description of suggested implementation

  1. Take the current AbuseFilter, create a copy of the whole extension, name it ExternalLinkFilter, take out all the code that interprets the rules ('conditions').
  2. Make 2 fields in replacement for the 'conditions' field:
    • one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line, like current spam-blacklist).
    • one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally more simple, and cleaner than writing a complex regex, not everybody is a specialist on regexes).
  3. Leave all the other options:
    • Discussion field for evidence (or better, a talk-page like function)
    • Enabled/disabled/deleted (not turn it off when not needed anymore, delete when obsolete)
    • 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
    • Rate limiting - catch editors that start spamming an otherwise reasonably good link
    • Warn - could be a replacement for en:User:XLinkBot
    • Prevent the action - as is the current blacklist/whitelist function
    • Revoke autoconfirmed - make sure that spammers are caught and checked
    • Tagging - for certain rules to be checked by RC patrollers.
    • I would consider to add a button to auto-block editors on certain typical spambot-domains (a function currently taken by one of Anomie's bots on en.wikipedia).

This should overall be much more lightweight than the current AbuseFilter (all it does is regex-testing as the spam-blacklist does, only it has to cycle through maybe thousands of AbuseFilters). One could consider to expand it to have rules blocked or enabled on only certain pages (for heavily abused links that actually should only be used on it's own subject page). Another consideration would be to have a 'custom reply' field, pointing the editor that gets blocked by the filter as to why it was blocked.

Possible expanded features (though highly desired)
  1. create a separate userright akin AbuseFilterEditor for being able to edit spam filters (on en.wikipedia, most admins do not touch (or do not dare to touch) the blacklist, while there are non-admin editors who often help on the blacklist).
  2. Add namespace choice (checkboxes like in search; so one can choose not to blacklist something in one particular namespace, with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
    • some links are fine in discussions but should not be used in mainspace, others are a total nono
    • some image links are fine in the file-namespace to tell where it came from, but not needed in mainspace (e.g. flickr is currently on revertlist on en.wikipedia's XLinkBot)
  3. Add user status choice (checkboxes for the different roles, or like the page-protection levels)
    • disallow IPs and new users to use a certain link (e.g. to stop spammers from creating socks, while leaving it free to most users).
    • warn IPs and new users when they use a certain link that the link often does not meet inclusion standards (e.g. twitter feeds are often discouraged as external links when other official sites of the subject exists; like the functionality of en:User:XLinkBot).
  4. block or whitelist links matching regexes on specific pages (disallow linking throughout except for on the subject page) - coding akin the title blacklist
  5. block links matching regexes when added by specific user/IP/IP-range (disallow specific users to use a domain) - coding akin the title blacklist
Downsides

We would lose a single full list of material that is blacklisted (the current blacklist is known to work as a deterrent against spamming). That list could however be independently created based on the current rules (e.g. by bot).

  • More comments:
  • Phabricator tickets: task T6459 (where I proposed this earlier)

Discussion

I 2nd this and would like to request a feature to be added: When fighting spam one always has to look up both SBL log and Abuse log. It would have saved me thousands of clicks if the SBL log could be merged (for displaying purposes only) into the Abuse log (by checkbox for example or even via a virtual abuse filter number) so that one view shows the spamming actions. --Achim (talk) 12:50, 30 October 2018 (UTC)[reply]

Just to set realistic expectations here, it is unlikely that the Community Tech team would have the time and resources to create something similar to AbuseFilter. It may be possible, however, for the team to make some improvements to the existing bare-bones implementation. Ryan Kaldari (WMF) (talk) 01:05, 14 November 2018 (UTC)[reply]

    • @Ryan Kaldari (WMF): my apologies if this comes through harsh: that is a matter of priorities, the WMF has spent an enormous amount of developing time to code that is sometimes flat out rejected by the community, code that is then stuffed down our throats in the worst cases (up to superprotect, remember). The spam blacklist is now a crude, minimalistic piece of code that, despite some hacks, is in some cases more disruptive than the spammers/abuse, and maintainers need sometimes to jump through hoops to collect evidence. Requests to upgrades exist for years now, utterly ignored. Three bots are written to fill in some of the holes, but it would be more efficient to replace on of them (which is now only protecting one wiki ..), and have a meaningful and useful extension (and replacement to a part) for the others.
    Editor retention and attracting new editors scales linearly (if not exponentially) with attracting spammers and ‘keeping’ (i.e. not getting rid of) them ... it is time to finally realise that the admin corps needs more tools to protect. —Dirk Beetstra T C (en: U, T) 03:58, 14 November 2018 (UTC)[reply]
    • @Ryan Kaldari (WMF): (<- new ping, as I am not sure if my previous ping worked). One of those hoops is currently discussed here: Talk:Spam_blacklist#more_than_3000_entries. One of our maintainers, User:Lustiger seth is now running a very long running script to collect all rules that have not been hit. From that, we have to go through other hoops to select out the ones that should, even if they do not hit, should not be removed, and then we can cleanup the current list of several thousands of entries. Editors in some cases request de-listing as 'it has not been added' (dôh), and we have to try and show that it was actually tried (which we ignore because we simply can't). The current blacklist is too crude. It needs to be replaced by something that is way more modular and has way more options. It cannot be done by the current extension, it basically needs to be re-designed from scratch. --Dirk Beetstra T C (en: U, T) 05:35, 14 November 2018 (UTC)[reply]
      • @Beetstra: Thanks for providing more context. It sounds like the current system is pretty abysmal. I think it would be appropriate for us to consider an overhaul, I just don't think it would be realistic for our team to build something as complex as AbuseFilter within the constraints that we have (and considering our very limited capacity for maintenance work afterwards). I won't disagree with any of your arguments, just want to set realistic expectations. Ryan Kaldari (WMF) (talk) 17:43, 14 November 2018 (UTC)[reply]
        • How about others in the WMF? Although that would be out of the scope of the survey. C933103 (talk) 06:14, 15 November 2018 (UTC)[reply]
        • @Ryan Kaldari (WMF): in fact, I am taking the already existing AbuseFilter, and rip out all the rule interpretation code and replace it with one (maby two) simple regex check. It is byfarnot as complex as the AbuseFilter, and the rest of the existing code is almost immediately usable without adaptation. —Dirk Beetstra T C (en: U, T) 07:10, 15 November 2018 (UTC)[reply]
          • @Beetstra: Unfortunately, the AbuseFilter extension has been mostly unmaintained for years and would need to be overhauled, even if we didn't reuse the rule interpretation code. The AbuseFilter codebase is extremely old and buggy at this point. Last time I checked it had 311 open bugs, including 23 open security bugs! The new MediaWiki Platform team has agreed to do some work on cleaning-up AbuseFilter next year, but in the meantime I don't think it would be a viable option for us. I'm confident we could come up with something that works better than the current system, however. Ryan Kaldari (WMF) (talk) 21:58, 15 November 2018 (UTC)[reply]
            • @Ryan Kaldari (WMF): so .. a nice time for a joint effort? Or are we back at my initial point regarding WMF (IMHO completely misplaced) priorities? Not only have they made the spam blacklist become outdated, also the AbuseFilter .. <sarcasm>at least now we have a nice image viewer, and a visual editor and what not ... </sarcasm> —Dirk Beetstra T C (en: U, T) 05:24, 16 November 2018 (UTC)[reply]
  • I am going to +1 this, the blacklist and edit filters are a hugely undervalued part of protecting the project against abuse and it really is long past time we put some effort into improving them. JzG (talk) 00:28, 16 November 2018 (UTC)[reply]
  • This has a hidden assumption. If the spam blacklist is viewed as a pure spamlist, then it is a one-level list. If it is used to block questionable sites, then it is a multilevel list. Whether to give the user a warning is optional to what the list should be. A multilevel list for questionable sites is a pretty large discussion, as it must also discuss what kind of external vetting should be done. — Jeblad 08:42, 18 November 2018 (UTC)[reply]
  • @Man77: I am not sure if I understand what you mean. ‘Complex regexes’ are for filtering the added external links. That wikicode is too complex is a different story, independent of the mechanisms that make Wikipedia work. Is that what you mean? —Dirk Beetstra T C (en: U, T) 03:45, 21 November 2018 (UTC)[reply]
  • @Ryan Kaldari (WMF) and Beetstra: a simple hack would be to add an AbuseFilter function that takes a page name, loads the page, parses it into a regexp, and returns it (or matches it against the second argument). It would require some caching but even so it's fairly simple, does not require any structural changes to AF code, and would allow spam blacklists to be converted into filters, with all the UI and logic improvements that entails. --Tgr (talk) 05:01, 25 November 2018 (UTC)[reply]
    • @Tgr: True, that works, and that is what we sometimes do. But with thousands of blacklist rules that entails an enormous number of filters (including global ones). The AbuseFilter is in itself a slow system, where the interpretation (especially with complex regexes) will be the problem, which will only slow down if we use logic to select the rules (which, IMHO, could even be moved out of the AbuseFilter itself, into 'options' to significantly improve the speed - though then the options become so many that it defies the system). --Dirk Beetstra T C (en: U, T) 05:22, 25 November 2018 (UTC)[reply]
      • @Beetstra: Wouldy you really need many filters though? You could have one for non-admin, one for non-autoconfirmed, a few for specific namespaces maybe... there are only so many useful combinations of conditions. And since the regexes would be on a separate page and not in the AF rule, and parsed the same way they are now, AF parsing speed would not be an issue (as long as this does not significantly inflate the number of filters). --Tgr (talk) 05:45, 25 November 2018 (UTC)[reply]
        • @Tgr: It would be a significant number of filters (user-level (new editor/IP; non-admin only ..), namespace, page specific (stop adding youtube and twitter feeds to <subject>), full block/warning only; with probably some cross-combinations), with huge blocks of regexes (and the regexes need to be split, as there is a maximum size to one regex, you'd need the same blackist regex-parser for it).
    One of the other strengths of this adapted system (as suggested above, not grouped as you suggest), IMHO, is that we could tell for some why it is blocked and what to do (shorteners -> use the expanded rule; majorly copyright violating sites: link to the original; petition sites -> find either secondary sources and use those), and that we could more easy do per-page exceptions (you cannot link to this site anywhere, except on the subject-page). Moreover, logging hits is easier (if you group, you still have to dig through all hits to see how often spammers attempted to add <badcomain>.com). --Dirk Beetstra T C (en: U, T) 10:03, 25 November 2018 (UTC)[reply]
    @Beetstra: To be clear, I'm talking about a syntax like this: !(user_groups contains autoconfirmed) && blacklist_match( 'MediaWiki:Blacklist-autoconfirmed', added_links ). There is no huge blocks of regexes here, the regex is parsed from a separate page with a blacklist-like format.
    As you say this has some shortcomings (each type of blacklist needs to be on a separate page, which is maybe not always the most convenient choice; it's hard to tell exactly which blacklist item was triggered), on the other hand, it is actually possible to do with limited resources, so the question really is, is it preferable over nothing? (Or more optimistically, over waiting two years for an AbuseFilter rewrite.) --Tgr (talk) 20:56, 25 November 2018 (UTC)[reply]
    I guess this would be better than nothing, but I still think that this is going to be resulting in a massive number of rules (and lacks the possibility to pagespecific whitelist rules (‘everywhere, except <subject> for <subject.com>’).
Waiting 2 years for a rewrite .. we’re waiting for 10 years already. The protecting part of Wikipedia is utterly neglected, and will be neglected for years to come ... —Dirk Beetstra T C (en: U, T) 22:52, 25 November 2018 (UTC)[reply]

Voting