Research talk:Automated classification of article importance/Work log/2017-04-28

Friday, April 28, 2017 edit

Today I'll focus on the WikiProject candidates I've selected, to understand more about them.

WikiProject candidates edit

I created a list of candidates. Before we start processing their articles and building models, I want to look at how many articles they have in the different importance classes. Here's a table with info on that:

proj_name n_top n_high n_mid n_low n_na n_unknown n_unassessed n_articles
WikiProject Africa 2,289 1,261 4,097 25,952 2,220 34,945 10,173 80,937
WikiProject Albums 138 963 4,133 59,107 11,507 71,281 24,896 172,025
WikiProject Beauty Pageants 12 36 131 742 349 3,192 1,522 5,984
WikiProject Buddhism 115 167 396 1,347 149 2,514 1 4,689
WikiProject Chicago 63 142 1,447 12,113 647 21,918 7,274 43,604
WikiProject China 418 1,728 9,824 13,768 2,260 19,114 3,734 50,846
WikiProject Cycling 13 67 463 9,202 312 10,602 551 21,210
WikiProject Dungeons & Dragons 14 139 357 1,346 1,168 1,032 6 4,062
WikiProject Europe 22 79 332 1,393 80 2,448 737 5,091
WikiProject Historic sites 24 125 1,001 4,051 78 3,263 11 8,553
WikiProject Horror 30 169 514 5,167 346 5,169 707 12,102
WikiProject Iran 108 617 769 2,976 8,019 68,961 8,170 89,620
WikiProject Judaism 235 513 1,636 4,083 364 3,337 868 11,036
WikiProject Korea 66 545 1,916 9,295 501 8,469 843 21,635
WikiProject Malaysia 24 463 1,619 2,681 134 2,500 1,625 9,046
WikiProject Motorsport 51 130 751 3,197 148 3,018 2,320 9,615
WikiProject National Football League 365 516 3,008 4,425 338 19,112 2 27,766
WikiProject Olympics 25 647 7,051 47,797 373 47,662 5,038 108,593
WikiProject Pharmacology 116 583 1,901 3,152 285 4,456 464 10,957
WikiProject Politics 115 1,123 4,133 20,015 1,037 13,818 7,315 47,556
WikiProject Politics of the United Kingdom 175 319 1,469 9,322 418 19,385 6,077 37,165
WikiProject Rock music 118 581 1,642 4,665 358 5,171 2,811 15,346
WikiProject Rugby league 23 175 951 6,296 753 4,684 1,162 14,044
WikiProject Television 76 605 3,013 32,237 14,832 37,081 19,721 107,565
WikiProject Television Stations 5 12 186 624 1,853 3,331 3,049 9,060
WikiProject United Nations 0 3 92 21 19 3,290 1,697 5,122
WikiProject Yugoslavia 32 139 305 989 48 892 317 2,722

Sorting the table on the number of Top-importance articles (in ascending order), we can make some notes about the WikiProjects with low number of such articles:

  • WikiProject United Nations is somewhat of a meta-project, appears as a collaboration between United Nations and Wikimedia. On talk pages, their tag shows up as part of WikiProject International relations. The project does not have a definition of its importance ratings, but do use them.
  • WikiProject Television Stations refer to the WP 1.0 importance scale in their talk page templates. The project does not define its own importance scale.
  • WikiProject Beauty Pageants also refers to the WP 1.0 importance scale instead of its own.
  • WikiProject Cycling has its own importance scale. They define importance through what geographic level a topic is notable at (e.g. internationally means "top importance", locally means "low importance"). Note that they also specifically point out that importance is orthogonal to article quality.
  • WikiProject Dungeons & Dragons also defines their own importance scale, which adds "Bottom importance" to the existing set of categories.
  • WikiProject Europe both defines and describes their importance scale. Their description specifically mentions the "average reader of Wikipedia needing to look up the topic", suggesting that our measure of viewership should fit well in their case. Unsure whether the project only having 22 articles (and 1 portal) rated Top-importance will be a problem or not, though.
  • WikiProject Rugby league appears to refer to the WP 1.0 importance scale.
  • WikiProject Historic sites defines their own importance scale, which appears to separate classes mainly based on geography.
  • WikiProject Malaysia defines their own importance scale, it mainly separates between international/national notability when it comes to Top/High importance. They also feature a list of certain categories of topics and their general importance ratings right below their importance scale, e.g. companies are generally classed as low to mid importance.
  • WikiProject Olympics have a detailed definition of how various categories of Olympics-related articles should be rated.
  • WikiProject Horror both defines their importance scale and describes in more detail how geographic regions can play into notability, suggesting that something that might not be particularly popular in the U.S. (thereby perhaps being popular on English Wikipedia) can be highly important because it is highly notable in a different area.
  • WikiProject Yugoslavia also define their own importance scale, which focuses on a core set of topics and their related pages. They also set a cap on the number of Top-importance articles to approximately 100, and limit the number of biographies. High-importance articles are also related to how well-known the topic is outside of Yugoslavia.
  • WikiProject Motorsport define their own importance scale, and it seems like the scale is mainly divided by coverage area. Internationally notable topics are Top-importance, those that are notable on a specific continent are High-importance, and so on.
  • WikiProject Chicago define their own importance scale and describe each importance level in great detail. Similar as for WikiProject Europe, the description mentions the need of the average reader of Wikipedia to look up the topic, suggesting that our importance criteria can be a good fit for this project.
  • WikiProject Korea also define their importance scale. The descriptions are fairly short, and appears to be mainly based on fitness for an encyclopedia (Top-importance) and whether the topic is notable on an international or various national levels.
  • WikiProject Television define their own importance scale, but mention that assessing importance is oftentimes not necessary (e.g. for "less influential television articles"). That can explain why they have 37,081 articles of "unknown" importance.
  • WikiProject Iran define their importance scale, where they define Top-importance mainly based on whether the topic is mentioned on Iran. High-importance articles are "vital to understanding Iran", while Mid-importance are articles about topics that are "well known".
  • WikiProject Buddhism define their importance scale in much the same way as WikiProject Iran, but of course centering it around Buddhism instead.
  • WikiProject Politics has an importance scale that is described overall through the notion of the "average reader of Wikipedia", but for individual classes depending on whether the topic has a global, international, or local notability.
  • WikiProject Pharmacology's importance scale defines importance mainly through how specific the subject is (e.g. a major class of drugs is Top-importance, while individual drugs are lower) and whether it is well-known, commonly prescribed, etc…
  • WikiProject Rock music has a fairly short definition of their importance scale, mainly centered around the concept of "key" articles.
  • WikiProject Albums defines their importance scale partly relative to whether the article can attain Featured Article or Good Article status, and partly whether the album is historically and culturally notable. How the albums performed with regards to sales and/or charts is secondary to historical/cultural impact.
  • WikiProject Politics of the United Kingdom are working on rewriting their importance criteria, and have split them up into individual sections for constituencies, parties, and politicians. Constituencies are Low-importance by definition, and there should be no exceptions (this is specifically spelt out on the page. Importance for political parties spans the whole spectrum and is mainly dependent upon representation in UK government, European parliaments, etc… When it comes to politicians, importance relates to their (elected) position, as well as the importance of the party they represented.
  • WikiProject Judaism also define their importance scale using the "average reader of Wikipedia", allowing them to define more popular topics as more important.
  • WikiProject National Football League also uses the average reader to define importance. Their list of examples also specifically differentiate between general and biographic articles, meaning we might have to train two models in their case?
  • WikiProject China has a short definition of importance. It does not describe the "average Wikipedia reader" like some others do, but are making references to the same concept in the descriptions. Requests Top-importance ratings to be discussed on the project's talk page. Descriptions of importance appears to also be somewhat tied to article quality (e.g. "contributes a depth of knowledge to the encyclopedia").
  • WikiProject Africa also has a fairly short definition of importance. Importance is mainly defined through the importance to a specific field, or if the subject is internationally notable.
Return to "Automated classification of article importance/Work log/2017-04-28" page.