Grants:IdeaLab/Similarity measure for AbuseFilter

statusDRAFT

Similarity measure for AbuseFilter

1RR, 2RR, and 3RR all have a thing in common, it must be possible to detect similarity between contributions. Without a similarity measure all we can do is to detect is something like an equality. This makes it much to easy to add minor changes to avoid detection of a hidden revert.

targetWikis using Mediawiki and AbuseFilter

start date1. nov

end date30. nov

budget (local currency)NOK 16600

budget (USD)USD 2000

grant typeIndividual

non-profit statusNo

grantee• Jeblad

contact(s)• Jeblad

give feedback

created on14:09, 19 June 2016 (UTC)

friendly space expectations

ucc

privacystatement

Project idea

What is the problem you're trying to solve?

When a contribution is reverted, it is way to easy to respond with a new revert. Such revert wars can spiral out of control very rapidly.

There are thus several proposals to formulate rules against such revert wars, but there are no tool available to stop them from being possible in the first place.

What is your solution?

It is possible to make a special so-called locally sensitive hash digest which can then be used for similarity measurements of a contribution against similar hash digests from previous contributions or reverts. The reverts are most interesting for revert wars, while contributions can be used for detection of spam. If a similar revert is detected then the contribution can be assigned a similarity index. This number can then be used in AbuseFilter to make a decision whether the contribution or further reverts should be blocked or simply just tagged.

New contributions that are similar, but with a lower similarity index, should be allowed. That makes it possible to adjust a rejected contribution and then make a save of the updated contribution.

An editor should be allowed to do the opposite change of something (s)he has done previously. If this isn't allowed simple copy-paste editing would be disallowed. A kind of previous_change_similarity could scan the last ten revisions for a title and report the largest change similarity, that is in absolute value, or the previously accumulated sum for this user only. If the absolute value of previous change similarity against current change similarity is below a threshold then it is accepted as a copy edit.

Note that a working solution must somehow relate to the access level the user holds. If not an user with low access level might revert edits done by a more well-renowned user without anyone being able to undo the revert. This is just a small part of the total problem, as an user with more access rights should be able to override an user with lower access rights. This is again just a small part of an even larger problem, as a better solution would be temporal karma with a base capital given by the user rights.

Background

Text fragments can be compared by specially crafted hash digests. Those are often formed by hashed strings comprising of 3-5 characters. There are large group of such algorithms, often called w:Locality-sensitive hashing, but the important point to note is that some of them has $O(n)$ where n is the length of the text. By caching previous digests we get a search with $O(m)$ where m is the number of previously processed texts. The former is somewhat heavy (large constant) while the later is somewhat lightweight, aka $O(n)+O(m)$ . It is although a lot better than most methods for w:edit distance, the standard solution has in comparison $O(mn)$ . It is also possible to simplify this into $O(n)+k$ , but then with some limitations.

There is a proposal Grants:IdeaLab/1RR minimal delay which needs this functionality, or some functionality of similar type.

For a more in-depth description see Grants:IdeaLab/Similarity measure for AbuseFilter/Technical description.

Goals

To create a minimal and yet effective measure of similarity between edits, thereby making it possible to detect (and possibly stop) edit wars and revert chains.

Get Involved

About the idea creator

I've been a contributor on Wikimedia projects for more than ten years, and have a cand.sci. in math and computer sciences.

Participants

Endorsements

Expand your idea

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into a Rapid Grant

Expand into a Project Grant
(launching July 1st)

Project plan

Activities

Develop the code to implement a similarity measure for AbuseFilter according to technical description. The final outcome will be an additional variable change_similarity that can be used in filters for AbuseFilter. The variable should be described at the page mw:Extension:AbuseFilter/Rules format.

Impact

It would make it possible to make filters that can tag edit wars, or even stop them completely.

Resources

A developer for approx a half man-month.