Research talk:Automated classification of edit quality/Work log/2017-08-03

Thursday, August 3, 2017

edit

Continuing the work to feed Flagged Revs approvals to our damaging model, this was the second iteration on that experiment, with some refinement of the training data.

Tracked in Phabricator:
Task T166235

Experiment 1b: Refine data, omit multi-revision approvals, reverteds, and some bots

edit

TODO in a future iteration:

  • Omit all bots.
  • Include approvals that are part of a multi-revision chain, if all changes are by the same author. Perhaps all revisions in the chain should be included in our data set.
  • If we can break out of scoring pure revisions, the diff between start and end is a high confidence good edit.

Methodology

edit

Filter to single-revision approvals

edit

Zache pointed out that Flagged Revs is often (about 1/3 of the approvals) used to merge more than one edit at a time.[1] We can't be confident that all or any of these individual revisions are good-faith or non-damaging, only that the end product is an improvement. For example, a bad edit and its rollback might be included, and the reviewer would still approve the final article state.

I used a simple condition, that the beginning revision is the parent of the end revision. See the TODO on this work for how to correct some nuances that I missed--specifically that multiple edits by a single user probably stand a chance of being desirable edits and we should try harder to include them.

Filter out some bots

edit

Any approvals by Zache and SeulojaBot are omitted from our set. I'm not totally clear on the reasoning, but I think these are bots reviewing other bots, and as such are edits we want to avoid.

Filter out later reverted

edit

We ran the "autolabel" script on our approved revisions, and threw out anything with the "review reason" of "reverted edit". (TODO: link to an explanation of how that script works.)

Prepare for intermediate database tables

edit

I split this query into pieces to make it easier to follow, and create a temporary table to store intermediate results. This is a bit annoying in Quarry and I ended up cheating, but the basic steps to replicate this approach are:

Create a user database to allow for intermediate tables.

edit
ssh tools-login.wmflabs.org
mysql --defaults-file=replica.my.cnf -h fiwiki.labsdb
create database u4974__ores_tmp_p;

Building the results purely through Quarry might have been possible, but required some extra work to allow write access our temporary table, so I took a shortcut and ran the bulk of the queries from the console, only using Quarry to perform the fetch step.[2][3]

We discover a data iceberg

edit

In experiment 1a, I had missed that we were only parsing the newest approvals, those created since December 2016. Older approvals used a legacy log_params format, which hadn't been picked up by our query condition. Once we relaxed the condition to include the legacy format, we gained 160,000 more approvals to add to our data set.

Results

edit
Current champion damaging model Model trained on approved Flagged Revisions (2nd iteration)
ScikitLearnClassifier
 - type: GradientBoosting
 - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700
 - version: 0.3.0
 - trained: 2017-06-26T03:59:29.167423

Table:
	         ~False    ~True
	-----  --------  -------
	False     16727     2231
	True        113      904

Accuracy: 0.883
Precision:
	-----  -----
	False  0.993
	True   0.289
	-----  -----

Recall:
	-----  -----
	False  0.882
	True   0.89
	-----  -----

PR-AUC:
	-----  -----
	False  0.993
	True   0.548
	-----  -----

ROC-AUC:
	-----  -----
	False  0.95
	True   0.954
	-----  -----
ScikitLearnClassifier
- type: GradientBoosting
- params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false
- version: 0.0.1
- trained: 2017-08-02T04:43:42.045973

Table:
~False    ~True
-----  --------  -------
False      4588      139
True        138      120

Accuracy: 0.944
Precision:
-----  -----
False  0.971
True   0.463
-----  -----

Recall:
-----  -----
False  0.971
True   0.465
-----  -----

PR-AUC:
-----  -----
False  0.991
True   0.401
-----  -----

ROC-AUC:
-----  -----
False  0.878
True   0.878
-----  -----
  1. "⚓ T166235 Flagged revs approve model to fiwiki". phabricator.wikimedia.org. Retrieved 2017-08-03. 
  2. fiwiki_flaggedrevs_approvals.sql, 2017-08-01, retrieved 2017-08-02 
  3. "Fiwiki good diffs - Quarry". quarry.wmflabs.org. Retrieved 2017-08-02. 
Return to "Automated classification of edit quality/Work log/2017-08-03" page.