Research talk:Revision scoring as a service/Work log/2016-02-03

Wednesday, February 3, 2016

Just dumping some notes from my last flight. No promises that they'll be coherent.

let R be the set of revisions "needing review" let rR be the proportion of revisions "needing review" that are reverted

To-do:

Check the expected proportion of ~Rr edits that are good-faith damaging
Think about an arguement for how we're going to handle good-faith damaging vs. vandalism

Option 1: Simple, vandalism only:

Gather a sample of all edits and label the subset of rR. From this, draw both a training set and test set. Run tests.

Simple, straightforward. Easy to generate and work with. Results may suggest we aren't as good as we actually are at detecting vandalism.

Option 2: Re-weighted, damaging or vandalism:

Gather all labeled edits from Wikilabels as a test set. Sample available observations with replacement from the balanced subsets to re-scale to wikidata normal.

Complex and hard to understand. There's some concern that. Statistically fragile to accidental re-sampling of false-false-positives. We can probably control for this by generating several, random re-sampings and averaging.

Note, we likely included bot edits in the wikilabels set. We'll want to exclude those, but that will reduce our ~rR-like observations more dramatically. This means that we will have to split very few observations to test with.

Option 3: Hybrid

Train model on rR and ~rR and test against re-scaled testing set.

I just had some more thoughts about the Wikidata Wiki labels campaign. So, here's the process we followed:

1. Obtain dataset of reverted/non-reverted edits balanced 10k/10k. 2. "Prelabel" dataset to filter edits by trusted users and bots. 3. Load the remaining edits into Wiki labels

The remaining data roughly corresponds to some mixture of rR and edits by non-trusted users that were not reverted. Most-likely, this remaining set has more rR edits than otherwise. We can likely take this set of reviewed rR edits and combine it with a proportional random sample of ~rR edits to obtain a dataset of *vandalism* and *non-vandalism*.

I'm a little bit worried about our inclusion of user_group features in our feature set given that we're going to simply decide that no vandalism can be performed by "trusted" users. We're essential pre-deciding what our model will learn and be tested against. That's weird and we could be fooling outselves. However, our qualitative analysis suggests that, if we are, it's not by much.

I'm also a little worried about how much gain can be earned by simply dividing the edits into "trusted" and otherwise. This alone can filter out 95% (TODO: Confirm) of edits. We should probably discuss this and our ability to effectively classify the remaining edits (R).

When we release the dataset, we should explicitly provide folds for replicability. Then, we can test agaist the folds to reduce the effect of random false-false-positives in our test set.

I'm starting to be convinced that we *can* do a relatively robust analysis of our ability to catch vandalism by using `label_reverted` to target R and ~R. I feel OK about our including user-groups in the labeling and in the test set if we build the model in parts. E.g. we can learn our ability to catch vandalism without including user_group features and then do it again with user_group features and note the difference in fitness.

The `test` utility might not be necessary. I want to plot and manually generate statistics against the test sets in R anyway, so what I really want is a scoring utility that produces a TSV that I can load into R/ipython for analysis.

lit. rev.

tan2014trust

Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation [1]

First, I like their tone.
Their approach is similar to us, they give a time window for an edit in freebase and consider the edit "good" if it survives for the given window otherwise it's "bad" but they have a huge difference, they look for the correctness of data, not vandalism.
They mentioned in Wikipedia, users' credit is predictive in determining an edit being vandalism or not. They applied their method and realized it's not very predictive to determine "correctness" of edits for freebase. They also noted that using user experience based on their expertise which happens with a large-scale clustering causes improvement in predicting.
They used a feature called "predicate difficulty" which is fraction of statements using that property that stayed. This is novel and interesting but I barely can see any application in Wikidata. For example, P18 (image) is highly vandalized but due to the huge usage of this property, the ratio is still low. Maybe it's predictive for us, worth looking later.
In order to determine area of expertise of each user they used three ways, "toxonomy", "topics" and "predicate" each of them uses a complicated method to determine) then they measure each of correct and incorrect edits of each and then they determined expertise of them in each domain. Turns out "topics" method which contains more than a million topics is so sparse that basically useless (surprise!).
Test set: They labeled about 4 statements by using human experts in their domains. Every statement is labeled twice by two different experts, in case of agreement it'll be labeled again by a third party after hearing arguments of the earlier labelers. Why this complicated procedure? I think mostly because they want "correctness" of statements unlike us, an edit being vandalism, a deliberate attempt to compromise integrity of Wikidata. (it's the definition). 3,414 edits was good and 561 was bad.
To train a training set, their method was considering everything that survives after "K" week is "good" and everything that doesn't is "bad". At first I thought it's their idea but they mentioned they borrowed it from a Wikipedi-related research paper.
Out of 7,626,924 statements added by not-whitelisted users 7,280,900 are good and 346,024 are bad (and some of them unknown because probably K weeks hasn't passed yet about them so we can't say it'll survive our not)
To determine the K they used a test set (a complete different one from the one mentioned above). It turns out that best number of is 4. Obviously more K is better but the function flattens after 4.
95.46% of training set is good edits, and 85.9% of test set.
They admit this feature is a post-hoc feature and not suitable for a real-time classifier they suggested people use it as a moderation parameter.
They used three methods of classifying: 1- Logistic Regression, 2- GradBoost, 3- Perceptron. Logistic Regression performed the best in their case.
As a measurement they used AUC, etc. but they also used RER, Relative Error Reduction:

{\frac {error_{baseline}-error_{model}}{error_{baseline}}}

Baseline is a model that choose majority every time. so In this case it's model that chooses "not vandalism" for all of edits.

AUC is 70%

It was very long but totally worth it :)

neis2012towards

Neis, Pascal, Marcus Goetz, and Alexander Zipf. "Towards automatic vandalism detection in OpenStreetMap." ISPRS International Journal of Geo-Information 1.3 (2012): 315-332.

OSM and Wikipedia are pretty much alike.
One big difference: They don't allow unregistered editing.
"When a vandalism event on one of the articles in Wikipedia is detected, it is usually reverted within a matter of minutes [39]. In our OSM analysis, 63% of the vandalism events were reverted within 24 h and 76.5% within 48 h."
It's a rule-based method not a machine learning thing.

Add topic