Research:Good Faith Newcomer Prediction

Duration:  2017-January – ??
This project's code is open-source

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The goal of this project is to build and evaluate machine learning classifiers for identifying good-faith vs. bad-faith newcomers based on a newcomers early edits.

Related Work edit

In "The Rise and Decline of an Open Collaboration System",[1] Halfaker et al. (2013) labeled 1000 newcomers as good-faith vs bad-faith based on their first edit session. In subsequent work,[2] Halfaker et al. (2014) used this set of labels to train a Naive-Bayes model for good-faith newcomer prediction using vandalism scores from STiki[3] for each revision in the newcomers' first edit sessions. This model was used for the Snuggle application. Due to limitations in the STiki API, only 152 examples could be used to train and test the Snuggle model. This project builds on this prior work by investigating other modeling strategies and making use of the full data set 1000 labeled newcomers.

Methods edit

With the development of the ORES edit quality API, it is possible to get "damaging", "good-faith" or "reverted" scores for all 1000 editors. Having access to 1000 instead of 150 labeled training instances allows us to investigate slightly more complex modeling strategies and get more stable evaluation results. We will start by replicating the Snuggle model from Halfaker et al. (2014) using ORES scores instead of STiki scores and evaluating the model via cross-validation on all 1000 labeled newcomers. Next we will see if we can build a better model using other machine learning models and more feature engineering.

Results edit

As of now, all the modeling and evaluation work can be found in this ipython notebook. We experimented with using a replicated version of the Snuggle model with scores from each of the three ORES edit quality models and all 1000 instances. The "damaging" model is best and gives an AUC of approximately 0.79. By contrast Halfaker et al. (2014) report an AUC of 0.87. Its not clear where this discrepancy come from. It could be due to an error in the replicated model, a difference in the STiki and ORES scores or an artifact of the tiny train and test set used by Halfaker et al. Using a Random Forrest classifier to combine Snuggle models trained on all three edit quality models and a few other features does not meaningfully increase performance.

References edit