Grants:IdeaLab/Similarity measure for AbuseFilter/Technical description

Technical description for a similarity measure for AbuseFilter is an idea to stop revert wars in the tracks, by limiting the number of reverts and who can revert who. It is based upon a w:Locality-sensitive hashing function that calculates digests for the difference from each contribution, and caches that for reuse on later comparisons. When a later comparison finds that a digest is close enough to a previous digest, then the edit can be rejected. The general idea is that the text will be reformulated, hopefully to the better, and that it will force the contributors to evolve the text instead of spiralling down into an edit war.

Several years ago a similar detector was written as Fuzzy comparison of strings. That was for mw:Wikibase, and it calculated a distance metric by using a w:Locality-preserving hashing function. That is a hashing function mapping the text into a subspace. Depending on the number of stored contributions we want to track, it could be better to create a hash and calculate the distance with a bean counter on the diverging bits. Such a bean counter will be less accurate, but it could be good enough for our purpose, and it will be way faster.

The version in gerrit makes fuzzy comparisons by calculating a simple vector for two strings and then summing the difference. That is the vector is the a number of occurrences over the set of bin. Our digest would be a vector of booleans over the set of bins, where the boolean for a bin is set if the number of occurrences are above the median of the complete set.

Somehow we must limit the possible checks to something manageable. As we can't easily query a database for a digest with some variation, we need another search and storage mechanism. We can get the last N entries from an article, but we can also simply keep the M last entries from recent changes in memcached or some other mechanism.

Basic algorithm edit

Assume there are strings $A^{k}$ where each revision k of string A is the previous version and some contributions. Assume also that there is a window into those strings and it is the string $A_{i,i+m}^{k}$ , that is a string from i to i+m in the kth revision. That window can be convolved over the whole revision, and for each substring a hash is created. That hash identifies a specific position in a vector which is then incremented on each iteration. When the convolution is completed we have a vector $C^{k}$ of sums. We can then find the median value and set all values above to boolean true and all values below to boolean false. This vector can be turned into a bitvector, or really a long int, and used as a digest $D^{k}$ for the revision. We can then use the lsh-digest for comparisons instead of the full string.

It is possible to create a new digest $\Delta ^{k-1,k}$ for the difference between $A^{k-1}$ and $A^{k}$ by calculating the difference between $C^{k-1}$ and $C^{k}$ , and then turning this into a digest. This digest can be used as a fingerprint for the change (difference) between two consecutive revisions. If we add the same string to any article it will still create the same digest.

It is not quite obvious how we compare such digests. As long as we have an identity we can use a lot of equality functions, often they are nearly similar but not quite equal. The usual comparison to use is an bitwise nexor, a not exclusive or, followed by counting all set bits. This is often called a bean counter. On a 128 bit long int a 90% similarity gives 115 set bits after the nexor operation.

The obvious solution is to keep a bag of such previous digests, and compare a new digest from each new change against the whole bag. That works reasonably well up to a pretty large bag of digests. But there are other and perhaps more attractive solutions.

Slightly more advanced algorithm edit

Assume that we add all previous digests together in a new vector. That is a bit in the ith position is added as if it where a number to the ith element in the vector. That is we do $\textstyle {\bar {v}}=\sum _{i=1}^{N}{\bar {v_{i}}}$ . This vector is our new bag of digests. It is not a list of digests anymore, it is all the digests munged together in a sum.

Now assume we have a digest from a true rollback, and that we want to check if we have observed this digest previously. One way to do this is to w:Correlate the current digest against our bag of munged digests. If the absolute result is positive and above a threshold then the digest exist among previous digests in the bag. If it is below an absolute threshold the outcome fail and the digest has not been previously observed.

Actually a sum above a set threshold indicates a similar action to the previously registered one, while a sum below a set threshold indicates the opposite action. Because we want to stop edit wars we would like to use the absolute value.

After some preset time we should clean up the sum (bag) of digests. That can be done by checking the true rollbacks for a slice back in time, and then checking those rollbacks against the bag. If we find them, then we remove them. It is important that we only do this for true rollbacks, otherwise we will remove the digest when the original entry was made. (This depends on whether we add all digests or only rollbacks.) Some variation could be to maintain a queue of digests with a timestamp and purge them as they are shifted out. Both methods have their merit. Removing them from the sum is simply to subtract the value that was previously added.

Ideally a correlation of a digest with itself will be N, where N is the number of bits in the digest, while a correlation with any other pattern will be less than this. If noise is added, then the correlation will drop, and it is the other digests that will be the noise. As a border case we can add N-1 other digests if they are linear independent sequences and still being able to detect the correct one, but that isn't the case. Our digests are constructed as random patterns.

To be able to handle more outcomes one option is to increase the vector length. As it grows larger the process gain increase. This can also be used to increase the confidence in the detection of specific patterns. This has some drawbacks, as a string must have a minimum length to span the whole digest.

Because we add the digest to the vector it is possible to make a shortcut and use the accumulated vector from C. Just subtract the mean, normalize, and add to the common bag vector. (Whether we use median or mean have consequences for what would be the observed mean when the correlation fail.)

Effective correlation edit

A correlation is typically written as (note equal length and both start at 1)

f(a,b)={\frac {1}{N}}\sum _{i=1}^{N}\left({a_{i}}\times {b_{i}}\right)

But note that this can be rewritten as

f(a,b)={\frac {1}{N}}\sum _{i=1}^{N}g({a_{i}},{b_{i}}),g(x,y)=x\times \ y

And that the kernel operation $g(x,y)$ can be replaced with whatever we want as long as it has some basic properties. One such operation is something like

g(x,y)={\begin{cases}x,y>0\\0,y=0\\-x,y<0\end{cases}}

This is really simple to implement, it is a shift operator on the digest to get y and a successive identity or inversion on x.

Notes edit

Some numbers edit

On enwiki there are about 25 hits of "revert" over 500 edits. On nowiki there are about 53 hits of "tilbakestilt" over 5000 edits. Both are on a random evening, that is while writing this. The environment on enwiki can be said to be a decade more hostile than nowiki in this respect.

One source says 400k edits per day on Wikipedia. That is 20k reverts after an enwiki baseline, or 4k reverts after a nowiki baseline.

Storage for the bag of digests edit

The bag of digests (or really the sum of digests) should be collected from memcached as it should include the last minute updates. If there are no entry in memcached a separate log must be inspected, or if that fail a part (or the whole) of recent changes.

Rollbacks can happen in other server threads, or even on another machine, and memcached would then act as common denominator.

Previous rollbacks edit

It is probably better to store the previous rollbacks (reverts) in a separate log and not try to store them in memcached, even if the list is fairly short. At least we should try to make a fallback to the log if the entries in memcached are missing.

Bean counter edit

A fast bean counter, that is counting of set bits

[http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer Stack overflow: How to count the number of set bits in a 32-bit integer?
Hacker's Delight, p. 66, Figure 5-2