Research talk:Automated classification of article importance/Work log/2017-05-01

Monday, May 1, 2017

Today I'll focus on wrapping up the candidate WikiProject selection and improving the pipeline for gathering project data.

WikiProject Candidates

There are some trends in the analysis of how the various WikiProjects define importance.

Nine projects define importance through the geographic reach of notability a topic has. In other words, a topic that is globally known is Top-importance, whereas something that is notable on a local level is Low-importance.
Six projects (Europe, Chicago, Politics, Judaism, National Football League, and China) refer to what the "average Wikipedia reader" is likely to look up on Wikipedia when defining importance. Unlike the other five, WikiProject China does not mention it specifically in their general description, but the way they describe Top-importance by referring to what non-Chinese are likely to look up suggests a similar concept.
Three projects (Television Stations, Beauty Pageants, Rugby league) do not define their own set of importance ratings, but instead refer to the WP 1.0 importance scale.
Two projects (Iran and Buddhism) appear to define importance relative to a key topic article (e.g. Iran in the case of WikiProject Iran), mainly through what relationship the topic has with that article. For example, if an article describes a topic that has its own section in the article about Iran, it is Top-importance. WikiProject Politics of the United Kingdom has a related importance scheme based on key UK political parties and representatives of these parties, although that project also has a clear geographic component in its importance ratings (e.g. local politicians are Low-importance).
One project (Malaysia) specifically describes how certain categories generally get a specific importance rating. For example, articles about Malaysian companies are Low-importance, similar to what we saw in WPMED.
One project (Dungeons & Dragons) extends the importance scale with "bottom importance" below "low importance".

In summary, it seems quite common among these projects to define importance through how likely it is for a random person on the globe to know about the topic. Some projects specifically spell this out by referring to what the "average reader of Wikipedia" might look up. These projects might be more interested in our approach since we use article view statistics.

Number of Top-importance articles are generally the limiting factor for how large a dataset we can gather. When sorting the list of candidates by that column, we can see a clear separation of projects above/below 100 Top-importance articles. Based on our experience with WPMED, having more than 100 articles in the smallest class might give us solid performance. In other words, these projects should be prime candidates for collaboration. There's also a split above/below 50 articles, with four projects having between 50 and 100 Top-importance articles. These might be secondary candidates for collaboration.

Four of the projects with more than 100 Top-importance articles also refer to the "average reader of Wikipedia" in their importance definitions: Politics, Judaism, NFL, and China. Gathering data for those first might be a good priority. Given its size, WikiProject Africa might also be a good priority.

WikiProject pipeline

The pipeline for data gathering and processing for WikiProjects has so far been encoded in various scripts. Now that we're interested in approaching additional WikiProjects, formalizing and streamlining this pipeline would be very useful. Here's a proposed formalization:

Gather a snapshot of the given WikiProject's Top-, High-, Mid-, and Low-importance articles. We disregard pages with NA-importance as those should be redirects and disambiguation pages, and articles rated Unknown-importance as that is "not an importance rating". There is a key data cleaning possibility here by checking that the articles we gathered are not redirects (this is easily detected using the redirect table), or disambiguation pages (in English Wikipedia, these can be found through Category:All disambiguation pages, but we can also see if the page's associated Wikidata item is an instance of Wikimedia disambiguation page (Q4167410)).
Using the given snapshot, gather:
1. Global and project-internal counts of inlinks
2. Article view rates
3. Wikidata item associated with each article
Parse the clickstream dataset
Using the Wikidata items found previously, gather the Wikidata subclass/superclass network

Add topic