Research talk:Automated classification of article importance/Work log/2017-06-29

Thursday, June 29, 2017 edit

Today I'll work on a gap analysis for our WikiProjects, and look into whether I can do a global one as well.

A global dataset can be generated using wikiclass, see extract_scores.py Not sure how long it'll take for enwiki, but it shouldn't be too long.

WikiProjects edit

WikiProject Africa edit

Top High Mid Low
Top 1,813 240 166 43
High 4 1,245 0 6
Mid 475 686 2,226 688
Low 1,368 3,840 5,943 14,972

WikiProject China edit

Top High Mid Low
Top 411 1 0 0
High 67 1,439 97 83
Mid 301 2,036 4,638 2,649
Low 204 2,001 3,759 7,686

WikiProject Judaism edit

Top High Mid Low
Top 233 0 0 1
High 8 473 8 8
Mid 101 285 901 311
Low 96 311 942 2,647

WikiProject Medicine edit

Top High Mid Low
Top 90 2 0 0
High 26 862 45 35
Mid 106 1,917 4,114 2,620
Low 21 1,075 3,810 8,348

Note that WikiProject Medicine defines many categories of articles as "Low-importance" (e.g. all individuals). We have identified 6,619 such articles and they are not part of this table as their rating is never predicted.

WikiProject National Football League edit

Top High Mid Low
Top 345 11 1 2
High 0 519 2 0
Mid 46 262 2,444 283
Low 12 229 450 3,789

Note that WikiProject National Football League defines that some categories of articles should have specific importance ratings, but unlike as we did for WikiProject Medicine, these articles have not been excluded from the table. This is because the organization of the Wikidata entities related to these articles is inconsistent, making it impossible for us to correctly identify them.

WikiProject Politics edit

Top High Mid Low
Top 111 0 0 0
High 2 1,100 6 7
Mid 206 923 2,441 528
Low 196 1,997 3,524 13,205

Trends edit

One clear trend in these confusion matrices is that WikiProjects are consistently not labelling articles as "Top-importance" even though they have similar characteristics as other articles that do have this label. Our models are incredibly precise when it comes to correctly labelling that class of articles, which means that articles from other classes in the same column are prime candidates for getting their rating examined. One might be concerned about overfitting on this class given that our training data includes almost all of those articles, but this should not be a problem because we utilize oversampling in the model training (except for WikiProject Africa because of its larger size), and that oversampling is based on a k-nearest neighbors approach, where the neighbors can be from any of the classes.

The other trend in these matrices is that it is more difficult to determine the boundaries between the High-, Mid-, and Low-importance classes. This is something we've also seen previously, for example that in some projects it is not clear whether an article should have a Mid- or Low-importance rating. Because we chose WikiProjects that tend to define importance through article views and we have said views as a predictor in our models, this suggests that these WikiProjects might want to re-examine their definitions and how they apply their ratings. That could lead to both ratings that are more clearly aligned with the definition, as well as seeing the ratings being more consistently applied.

Return to "Automated classification of article importance/Work log/2017-06-29" page.