Finalize your proposal by September 30!
Hi とある白い猫. Thank you for drafting this proposal!
Hi. Which algorithms are used to obtain these scorings? I'm really interested in these.
And for completeness: I wrote a scoring script for the german wikipedia nearly eight years ago. It could be found at http://tools.wmflabs.org/ipp/. The score is only generated for edits by IPs. A spam probability is generated using a very simple naive bayes approach. It is trained automatically by looking at new articles created by IPs. If an article is deleted within seven days (speedy deletion) the words within are learned as spam, if it still exists after seven days the words are learned as ham. Over the years, nearly 790 000 articles created by IPs were learned with 78 million words (2.9 million different "words"). For example the word "fuck" was used 12388 times, the spam probability is 98.6%. The word "und" (and) was used 1.7 million times, the spam probability is "only" 60.4%. Maybe this word database is useful for adapting other tools for the german wikipedia. --APPER (talk) 13:48, 3 October 2014 (UTC)
Eligibility confirmed, round 2 2014
This Individual Engagement Grant proposal is under review!
We've confirmed your proposal is eligible for round 2 2014 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.
The committee's formal review for round 2 2014 begins on 21 October 2014, and grants will be announced in December. See the schedule for more details.
Reusability of the datasets
How would you make sure the datasets produced in this project will be reusable? You might want to make sure the datasets can be CC0. To keep it reusabe for a longer term, you might want to include text in the datasets, not just IDs of revisions which could be deleted or suppressed. In that case, the licensing of the datasets could be a little bit more complicated, though. whym (talk) 02:10, 18 October 2014 (UTC)
This may be too detailed to discuss at this phase, but I just wondered: is there any idea on how to implement (or use implementations of) tokenization in different languages? Some languages have word spacing while others (Chinese, Japanese, etc) don't. Even when they have word spacing, you might want to split some long words into components (e.g. long nouns in German, composed of shorter nouns). I am sure there are ready-to-use tools for well-studied languages (such as en and de, I'm not too sure about az and tr), but when considering freely licensed ones only, your choice might have to be limited. A character-level n-gram tokenization might work as a language-independent fallback.
Furthermore, assuming you keep a suitable abstraction at the level of tokenizer and make it pluggable, I wonder if the system can be extended to support non-text content (such as data items of Wikidata, or images on Commons) with a reasonable amount of adaptation. whym (talk) 09:08, 20 October 2014 (UTC)
Could you please clarify how the technical work will be shared by the three mentioned in this proposal? GitHub commits seem to suggest that EpochFail has been the main contributor. Will this continue to be so, despite his volunteer position here? If the plan is that とある白い猫 and He7d3r will undertake the technical work more, some pointers to their previous work would help the IEG review. I can see en:User:EpochFail nicely summarizes at his (volunteer) work, but I couldn't get such information easily from User:とある白い猫 and User:He7d3r's userpages. whym (talk) 03:21, 22 October 2014 (UTC)
"provide us with a random sample of hand-coded revisions (as damaging vs. not-damaging)" - Gesichtete Versionen?
If a new wiki-language community wants to have access to scores, we'd ask them to provide us with a random sample of hand-coded revisions (as damaging vs. not-damaging) from which we can train/test new models.
Isn't that what the flaggedrevisions extension provides, for dozens of wikis, for years? The reviewing users decide: accept or undo new revision. Huge samples in polish, finnish, german, russian, arabic, turkish etc pp. Did i miss something or shouldn't this be mentioned/explored in the proposal? --Atlasowa (talk) 12:59, 12 November 2014 (UTC)
Aggregated feedback from the committee for Revision scoring as a service
Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 16:53, 13 November 2014 (UTC)
Round 2 2014 decision
Congratulations! Your proposal has been selected for an Individual Engagement Grant.
The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $16,875
Comments regarding this decision:
Return to "IEG/Revision scoring as a service" page.