Research talk:Automated classification of article importance/Work log/2017-03-14

Tuesday, March 14, 2017

Today I will continue expanding our page discussing sources of signal, but I will focus on training and evaluating a WikiProject-specific classifier. We think WikiProject Medicine is a good place to start as they have a reasonably well-defined definition of importance. I have previously used WPMED as a test project, so part of the work is already done.

Primary analysis

Similarly as we did for the dataset of unanimously rated articles last Wednesday, we start with a straightforward analysis of distributions of inlinks and views. The difference is that in this case we also have measure of indegree (number of incoming links) restricted to the WikiProject, in other words it counts how many other articles in the WikiProject point to a specific article.

Rating		Minimum	First quartile	Median	Mean	Third quartile	Maximum
Top	Num. Inlinks	54.0	674.0	1,166.0	1,850.0	2,384.0	8,345.0
	Num. Project Inlinks	4.0	246.8	363.0	471.5	653.5	1,852.0
	Num. views	218.0	1,441.0	2,504.0	3,132.0	4,023.0	10,890.0
High	Num. Inlinks	0.0	115.0	279.0	522.5	625.0	6,133.0
	Num. Project Inlinks	0.0	47.0	122.0	190.2	246.0	4,697.0
	Num. views	3.0	285.0	722.0	1,171.0	1,508.0	11,520.0
Mid	Num. Inlinks	0.0	16.0	74.0	157.1	177.0	19,700.0
	Num. Project Inlinks	0.0	6.0	42.0	71.9	101.00	3,914.00
	Num. views	1.0	29.0	110.0	326.5	356.0	14,420.0
Low	Num. Inlinks	0.0	3.0	11.0	70.1	68.0	21,320.0
	Num. Project Inlinks	0.0	1.0	3.0	23.6	21.0	4,092.0
	Num. views	0.0	7.0	18.0	95.2	70.0	16,970.0

There are two main patterns to be found in this table:

There is some ability to distinguish between the classes. We can see this by examining the first quartile, median, and third quartile of each class. Top-importance rated articles tend to have more inlinks and views than High-importance articles, which have more than Mid-importance articles, which have more than Low-importance articles. This might be easier to see once we visualize the data using density plots.
The mean and median are almost always far apart for all classes and measures. When we see that the maximum is much larger than the third quartile, often by an order of magnitude, it strongly suggests we are looking at a skewed distribution. These measures will therefore most likely need some kind of transformation before we use them to train a classifier.

Graphical analysis

Density plot of number of inlinks.

Density plot of number of inlinks within WikiProject Medicine.

Density plot of number of views.

We created three graphs based on the previous numerical analysis. Given the skewed distribution of both number of inlinks and number of article views as described previously, we apply a log-10 transformation (log10(1 + x)) to these in order to reduce the skewness.

The plots on the right visualize the data behind the table seen earlier and provide us with more detail. From the plots of number of inlinks, both globally and limited to WikiProject Medicine, it appears that Top-importance articles are different from the others. They typically have a much higher number of links pointing to them from other articles.

High-importance articles also have a large number of links pointing to them, but the distribution is more drawn out, particularly on the lower end, blending it into Mid-importance articles. Mid- and Low-importance articles both show a bimodal distribution, with a second peak occurring somewhere in the middle of the other class. This peak appears to be somewhat reduced in the counts of inlinks restricted to those within WikiProject Medicine. The boundaries between High-/Mid-/Low-importance articles might therefore not be clearly distinct, and it might be that we can benefit from using both global and project-specific inlink counts in our models.

For number of views we see a similar trend in that Top-rated articles tend to have more of them, and are rather distinct from at least Mid- and Low-importance articles. There is some overlap between High-importance articles and it's two neighboring classes. Low-importance articles are often of rather low popularity. Some of these results might come from the project's definition of importance, which seems oriented towards prioritizing higher popularity topics.

Add topic