Research talk:Automated classification of edit quality/Work log/2017-08-03
Thursday, August 3, 2017
editContinuing the work to feed Flagged Revs approvals to our damaging model, this was the second iteration on that experiment, with some refinement of the training data.
Experiment 1b: Refine data, omit multi-revision approvals, reverteds, and some bots
editTODO in a future iteration:
- Omit all bots.
- Include approvals that are part of a multi-revision chain, if all changes are by the same author. Perhaps all revisions in the chain should be included in our data set.
- If we can break out of scoring pure revisions, the diff between start and end is a high confidence good edit.
Methodology
editFilter to single-revision approvals
editZache pointed out that Flagged Revs is often (about 1/3 of the approvals) used to merge more than one edit at a time.[1] We can't be confident that all or any of these individual revisions are good-faith or non-damaging, only that the end product is an improvement. For example, a bad edit and its rollback might be included, and the reviewer would still approve the final article state.
I used a simple condition, that the beginning revision is the parent of the end revision. See the TODO on this work for how to correct some nuances that I missed--specifically that multiple edits by a single user probably stand a chance of being desirable edits and we should try harder to include them.
Filter out some bots
editAny approvals by Zache and SeulojaBot are omitted from our set. I'm not totally clear on the reasoning, but I think these are bots reviewing other bots, and as such are edits we want to avoid.
Filter out later reverted
editWe ran the "autolabel" script on our approved revisions, and threw out anything with the "review reason" of "reverted edit". (TODO: link to an explanation of how that script works.)
Prepare for intermediate database tables
editI split this query into pieces to make it easier to follow, and create a temporary table to store intermediate results. This is a bit annoying in Quarry and I ended up cheating, but the basic steps to replicate this approach are:
Create a user database to allow for intermediate tables.
editssh tools-login.wmflabs.org mysql --defaults-file=replica.my.cnf -h fiwiki.labsdb create database u4974__ores_tmp_p;
Building the results purely through Quarry might have been possible, but required some extra work to allow write access our temporary table, so I took a shortcut and ran the bulk of the queries from the console, only using Quarry to perform the fetch step.[2][3]
We discover a data iceberg
editIn experiment 1a, I had missed that we were only parsing the newest approvals, those created since December 2016. Older approvals used a legacy log_params format, which hadn't been picked up by our query condition. Once we relaxed the condition to include the legacy format, we gained 160,000 more approvals to add to our data set.
Results
editCurrent champion damaging model | Model trained on approved Flagged Revisions (2nd iteration) |
---|---|
ScikitLearnClassifier - type: GradientBoosting - params: loss="deviance", warm_start=false, balanced_sample=false, subsample=1.0, max_leaf_nodes=null, min_samples_leaf=1, center=true, balanced_sample_weight=true, min_samples_split=2, learning_rate=0.01, verbose=0, min_weight_fraction_leaf=0.0, presort="auto", max_features="log2", scale=true, random_state=null, max_depth=5, init=null, n_estimators=700 - version: 0.3.0 - trained: 2017-06-26T03:59:29.167423 Table: ~False ~True ----- -------- ------- False 16727 2231 True 113 904 Accuracy: 0.883 Precision: ----- ----- False 0.993 True 0.289 ----- ----- Recall: ----- ----- False 0.882 True 0.89 ----- ----- PR-AUC: ----- ----- False 0.993 True 0.548 ----- ----- ROC-AUC: ----- ----- False 0.95 True 0.954 ----- ----- |
ScikitLearnClassifier - type: GradientBoosting - params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false - version: 0.0.1 - trained: 2017-08-02T04:43:42.045973 Table: ~False ~True ----- -------- ------- False 4588 139 True 138 120 Accuracy: 0.944 Precision: ----- ----- False 0.971 True 0.463 ----- ----- Recall: ----- ----- False 0.971 True 0.465 ----- ----- PR-AUC: ----- ----- False 0.991 True 0.401 ----- ----- ROC-AUC: ----- ----- False 0.878 True 0.878 ----- ----- |
- ↑ "⚓ T166235 Flagged revs approve model to fiwiki". phabricator.wikimedia.org. Retrieved 2017-08-03.
- ↑ fiwiki_flaggedrevs_approvals.sql, 2017-08-01, retrieved 2017-08-02
- ↑ "Fiwiki good diffs - Quarry". quarry.wmflabs.org. Retrieved 2017-08-02.