Research talk:Automated classification of article importance/Work log/2017-03-24

Friday, March 24, 2017 edit

Today I'll continue my work on processing the clickstream dataset and continue my conversation with WPMED.

Clickstream metrics edit

I'd like to make sure we're extracting the right measurements from the clickstream dataset and have been working on getting that set of measurements ready. The list I currently have is:

Proportion of clicks from other articles
We use popularity as a measure of encyclopedic importance under the assumption that more popular articles are more important. Calculating the proportion of traffic that comes to an article through clicks from other articles provides information on to what extent an article's popularity is a result of its encyclopedic relevance. Or conversely, whether its popularity is due to exogenous causes.
Proportion of inlinks with clicks
We use indegree as a measure of encyclopedic importance under the assumption that high indegree signifies topical relevance to a larger portion of the encyclopedia. Calculating the proportion of inlinks that are actually causing traffic should provide us with a measurement of whether readers find a given article to be topically relevant, thereby giving us more information about whether the inlinks are useful.

We also want to have these measurements restricted to "from other articles within a given WikiProject", provided we are inputting a dataset of project-specific articles.

Other metrics edit

One thing that was brought up in the thread was the issue of overlap between projects. WPMED has several categories of articles that are by default Low-importance, and these are perhaps articles that are also covered by other projects? There are currently 19,672 articles in Category:Low-importance medicine articles, of these 13,648 appear to be connected to multiple projects , whereas 6,026 appear to only be rated by WPMED (ref these SQL queries). Some of those in the latter category are connected to multiple projects (e.g. individuals appear to often be also tagged by WikiProject Biography), so the proportion of Low-importance articles actually related to multiple projects is probably higher.

Since the point of thinking new was brought up in the discussion, it might be worth asking if the way we're tagging talk pages is the right approach? For example, say an article about individuals is tagged by WPBio and WPMED. Should WPBio have a medicine task force, or should WPMED have a "Society and medicine" task force for biographies? With the current approach, this overlap is most likely only captured on the article's talk page and in the category system. However, if we're looking to build something encoding this, I suspect we quickly move away from tagging talk pages and end up on Wikidata, which I'm not sure is the right way forward.

Overall clickthrough rate edit

I've been curious about to what extent readers on Wikipedia follow links to other articles. Research I've pointed to previously suggested the proportion is low (around 1/3), and I recently found Research:Wikipedia clickstream top referrers which if we use a strict definition (referrer is known and a linked article) is 26.57%. If we discard all hits with no referrers, the proportion is 35.76%. The actual proportion is therefore somewhere in that range.

Return to "Automated classification of article importance/Work log/2017-03-24" page.