Research:Automated classification of article importance/Insights

Ref this post to wiki-research-l. A key point about these with regards to the work we are doing is that these show that we are encoding community practices and ideas.

"writers are not a proxy for readers as the editor surveys suggest that Wikipedia writers are not typical of broader society on at least two variables: gender and level of education"
In addition to the editor surveys, there are also research such as West et al[1], but also Lam et al[2] where a gender perspective is taken.
"And I think there are two kinds of inbound links to be considered, those coming from other articles within the same WikiProject and those coming from outside that Wikiproject."
This echoes how our model has a predictor for proportion of links from within the WikiProject.
"This spike in pageviews occurs all the time when some topic is in the news (even peripherally as in this case where it is not the article about the terrorist attack but about the street in which it occurred). Did the street become more "important"?"
Ref our ICWSM 2015 paper[3], about half the most popular articles show this kind of transient behaviour.
"So I suspect the links in the lede paras are of greater relevance to the assessment of importance than links further down in the article which will be more likely relate to details of a topic and may include examples and counter-examples (this is a way in which high importance article may mention much lower importance articles)."
This echoes how we incorporate clickstream data in our model to understand to what extent an article's traffic comes from within Wikipedia, meaning it supports the content of other articles, or from elsewhere (e.g. search engines). We also use the clickstream data to distinguish between active (clicked) inlinks and inactive inlinks, meaning that if an article is linked from many other articles but those are rarely used, it should be predicted as less important.

Key graphs/illustrations edit

 

The graph above plots WikiProjects based on the activity on their pages and on articles within scope of the project over the 180 days prior to 2017-04-24, with a color gradient showing the proportion of articles they have rated as "unknown" importance. This graph motivates our work because it shows recent activity and at the same time encodes how tools built around classifiers can help get work done by reducing the cost of rating the "unknown" articles.

References edit

  1. West, R.; Weber, I.; and Castillo, C. 2012. Drawing a Data-driven Portrait of Wikipedia Editors. In Proc. of OpenSym/WikiSym, 3:1–3:10.
  2. Lam, S. T. K.; Uduwage, A.; Dong, Z.; Sen, S.; Musicant, D. R.; Terveen, L.; and Riedl, J. 2011. WP:Clubhouse?: An Exploration of Wikipedia's Gender Imbalance. In Proc. of WikiSym, 1–10.
  3. Warncke-Wang, M., Ranjan, V., Terveen, L., and Hecht, B. "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities" in the proceedings of ICWSM 2015.