Research talk:Automated classification of edit quality/Work log/2017-07-19

Wednesday, July 19, 2017

edit

Today, I am going to tell you the story of how I decided to change the max_features parameter of the GradientBoostingClassifier (GBC) to improve the accuracy of the editquality model and what came out of it.

Problem

edit

One of the issues with the current editquality model is its bias (it is leaning towards, or rather against, non-registered and new editors). To decrease this bias, it could be helpful to increase model's variance by engaging as many features as reasonable. See Bias-Variance Tradeoff for some details.

Hypothesis

edit

So, I hypothesized that an additional potential source of bias could be max_features. What does Scikit-learn library tell us about this parameter? Here you go: "choosing max_features < n_features leads to a reduction of variance and an increase in bias." So, we use max_features=log2. Log2 is < than n_features. Which means, if we have ~10 features, log2 leaves us with ~3 randomly selected ones. What if we bring max_features to default which is None, i.e. all features will be engaged into the calculation? It promises to be a safe experiment because overfitting is unlikely to be a problem thanks to CV. Let's do this for ruwiki only.

Results

edit

My hypothesis proved wrong (at least for ruwiki): ROC-AUC score with max_features=null for damaging model is 0.934 while the score for the model with max_features=log2 was higher - 0.936. Similar results are for goodfaith model (0.932 vs. 0.935) and reverted model (0.886 vs. 0.891).

Apparently, with all features enacted, the variance increases too much. A common practice with GBC is to check up to 30-40% of the features, which log2 essentially does. As well as sqrt actually, which is the most recommended parameter for max_features in GBC.

Below are the excerpts from the ruwiki tuning reports, ["log2"] version vs. None [null] version.

1. DAMAGING

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.936 0.006 max_depth=7, n_estimators=700, learning_rate=0.01, max_features="log2"
GradientBoostingClassifier 0.936 0.006 max_depth=3, n_estimators=300, learning_rate=0.1, max_features="log2"
GradientBoostingClassifier 0.935 0.007 max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2"

vs.

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.934 0.006 n_estimators=700, learning_rate=0.1, max_depth=1, max_features=null
GradientBoostingClassifier 0.934 0.006 n_estimators=300, learning_rate=0.1, max_depth=3, max_features=null
GradientBoostingClassifier 0.934 0.006 n_estimators=500, learning_rate=0.1, max_depth=1, max_features=null

2. GOODFAITH

RFC actually tops the list here, with 0.935, but GB with log2 is at least shows the same score sometimes.

GradientBoostingClassifier

mean(scores) std(scores) params
0.935 0.008 max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01
0.934 0.006 max_features="log2", max_depth=7, n_estimators=500, learning_rate=0.01
0.934 0.007 max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01

vs.

GradientBoostingClassifier

mean(scores) std(scores) params
0.932 0.007 learning_rate=0.01, max_depth=5, max_features=null, n_estimators=500
0.932 0.006 learning_rate=0.01, max_depth=7, max_features=null, n_estimators=500
0.932 0.007 learning_rate=0.01, max_depth=5, max_features=null, n_estimators=300

3. REVERTED

Top scoring configurations

model mean(scores) std(scores) params
GradientBoostingClassifier 0.891 0.008 learning_rate=0.01, max_depth=7, n_estimators=500, max_features="log2"
GradientBoostingClassifier 0.891 0.007 learning_rate=0.01, max_depth=7, n_estimators=700, max_features="log2"
RandomForestClassifier 0.89 0.011 criterion="entropy", max_features="log2", n_estimators=320, min_samples_leaf=5

vs.

[GBC shows up way below Random Forest, not even in the top10]

GradientBoostingClassifier

mean(scores) std(scores) params
0.886 0.005 learning_rate=0.01, n_estimators=700, max_depth=5, max_features=null
0.884 0.004 learning_rate=0.01, n_estimators=500, max_depth=5, max_features=null
0.884 0.007 learning_rate=0.01, n_estimators=500, max_depth=7, max_features=null


Sources of inspiration:

   * http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
   * https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
Return to "Automated classification of edit quality/Work log/2017-07-19" page.