Research talk:Automated classification of article importance/Work log/2017-06-29

Thursday, June 29, 2017

Today I'll work on a gap analysis for our WikiProjects, and look into whether I can do a global one as well.

A global dataset can be generated using wikiclass, see extract_scores.py Not sure how long it'll take for enwiki, but it shouldn't be too long.

WikiProjects

WikiProject Africa

	Top	High	Mid	Low
Top	1,813	240	166	43
High	4	1,245	0	6
Mid	475	686	2,226	688
Low	1,368	3,840	5,943	14,972

WikiProject China

	Top	High	Mid	Low
Top	411	1	0	0
High	67	1,439	97	83
Mid	301	2,036	4,638	2,649
Low	204	2,001	3,759	7,686

WikiProject Judaism

	Top	High	Mid	Low
Top	233	0	0	1
High	8	473	8	8
Mid	101	285	901	311
Low	96	311	942	2,647

WikiProject Medicine

	Top	High	Mid	Low
Top	90	2	0	0
High	26	862	45	35
Mid	106	1,917	4,114	2,620
Low	21	1,075	3,810	8,348

Note that WikiProject Medicine defines many categories of articles as "Low-importance" (e.g. all individuals). We have identified 6,619 such articles and they are not part of this table as their rating is never predicted.

WikiProject National Football League

	Top	High	Mid	Low
Top	345	11	1	2
High	0	519	2	0
Mid	46	262	2,444	283
Low	12	229	450	3,789

Note that WikiProject National Football League defines that some categories of articles should have specific importance ratings, but unlike as we did for WikiProject Medicine, these articles have not been excluded from the table. This is because the organization of the Wikidata entities related to these articles is inconsistent, making it impossible for us to correctly identify them.

WikiProject Politics

	Top	High	Mid	Low
Top	111	0	0	0
High	2	1,100	6	7
Mid	206	923	2,441	528
Low	196	1,997	3,524	13,205

Trends

One clear trend in these confusion matrices is that WikiProjects are consistently not labelling articles as "Top-importance" even though they have similar characteristics as other articles that do have this label. Our models are incredibly precise when it comes to correctly labelling that class of articles, which means that articles from other classes in the same column are prime candidates for getting their rating examined. One might be concerned about overfitting on this class given that our training data includes almost all of those articles, but this should not be a problem because we utilize oversampling in the model training (except for WikiProject Africa because of its larger size), and that oversampling is based on a k-nearest neighbors approach, where the neighbors can be from any of the classes.

The other trend in these matrices is that it is more difficult to determine the boundaries between the High-, Mid-, and Low-importance classes. This is something we've also seen previously, for example that in some projects it is not clear whether an article should have a Mid- or Low-importance rating. Because we chose WikiProjects that tend to define importance through article views and we have said views as a predictor in our models, this suggests that these WikiProjects might want to re-examine their definitions and how they apply their ratings. That could lead to both ratings that are more clearly aligned with the definition, as well as seeing the ratings being more consistently applied.

Add topic