Research:Automated classification of article importance/Gathering importance data

Some Wikipedia editions have WikiProjects, volunteer groups of contributors interested in a topic (e.g. medicine) or a specific type of work (e.g. the guild of copy editors). These WikiProjects typically add project templates to talk pages, identifying that the corresponding article (or redirect/disambiguation page) is within the scope of the project. When adding these templates, the project members also often assess the article's quality and importance. This information can then be used to gather statistics on how the WikiProject is doing, for example whether their important articles are also of high quality.

Article importance is typically assessed using one of four categories. These categories are defined in the WP 1.0 Release Version Criteria. The four common categories are, in descending order of importance: Top, High, Mid, Low. In the referenced list of criteria there are two optional categories: Bottom, and No. We have also seen the rating "Related" in use (by for example WikiProject National Register of Historic Places). The "No" rating is not in common use, instead most WikiProjects appear to use the "NA" (for "not an article") rating for disambiguation pages and redirects. "Bottom" is also rare but also in use (e.g. by WikiProject Rocketry).

These importance ratings are added and maintained by WikiProject members, and the ratings are only relevant within the scope of a specific WikiProject. A couple of articles that exemplify this are Waffle and Jimmy Carter. The article about Waffle is tagged by two WikiProjects on the article's talk page. WikiProject Breakfast rates it as "Top-importance", meaning it is a "core, highly notable topics about breakfast and breakfast-related topics." (ref Wikipedia:WikiProject Breakfast/Assessment#Importance scale) WikiProject Food and drink, on the other hand, rates it as "High-importance", a rating they define as: "Subject is extremely notable, but has not achieved international notability, or is only notable within a particular continent." (ref Wikipedia:WikiProject Food and drink/Assessment#Importance scale) When it comes to the article about Jimmy Carter, it spans the entire importance spectrum, having at least one project rating it with each of the four importance rating: WikiProject US State Legislatures rates it as "Low-importance", WikiProject Human rights rates it "Mid-importance", WikiProject Politics rates it "High-importance", and WikiProject U.S. Presidents and WikiProject Georgia both rate it "Top-importance". There are also other WikiProjects that have tagged the article, both with and without importance ratings.

As mentioned, these ratings are specific to each WikiProject, allowing for project members to rank articles by importance in order to prioritize work. There is currently no known approach for taking these project-specific importance ratings and converting them into a measure of importance across an entire Wikipedia edition. The Version 1.0 Editorial Team project, a project that works on publishing sets of Wikipedia articles for example to be used offline, has a selection process for which articles to include in their sets. This process does take the importance rating into account, but also considers other factors such as the quality of the article and the number of page views it gets.

Gathering data edit

There are two approaches to gathering data on articles with importance-ratings:

  1. Processing talk pages
  2. Gathering articles using the category structure

Because WikiProjects put their templates on article talk pages, one approach to gathering articles with ratings is to process these talk pages. The templates often have self-explanatory names (e.g. Template:WikiProject Medicine), but they also have aliases (e.g. the WikiProject Medicine template has 13 template redirects pointing to it (ref this search)). In the templates, the named parameter "importance" will point to the importance rating.

The articles and their importance rating can also be identified through the category structure, because most WikiProjects organize their articles into categories by importance. For example, WikiProject Medicine has a category Category:Medicine articles by importance, which has six sub-categories. In addition to categories for the four before mentioned importance ratings, there is one for "NA-importance", which as discussed earlier stands for "Not an Article" and contains disambiguation and redirect pages, and one for "Unknown-importance", articles that are tagged by the project but does not yet have an importance rating. This makes it fairly easy to identify the articles tagged by a specific WikiProject that have a specific importance rating.

Challenges edit

If one is gathering importance data by parsing talk pages, one challenge is that the WikiProject templates can have aliases, as discussed above. For example, {{Wikiproject Medicine}} and {{WPMED}} are both the WikiProject Medicine template, and the latter variant is widely used. It will therefore be necessary to maintain a list of aliases in order to correctly identify which project owns a given template.

Secondly, some project templates also contain references to related WikiProjects and define importance with regards to these, whereas some templates reference task forces, groups that are interested in a more specific part of the WikiProject's topic. For example, the article Arab world contains four templates from WikiProject Africa: one for the project itself, and one each for the three related projects of Western Sahara, Mauritania, and Sudan. In this case all of the ratings are consistent, but given that these are not necessarily added automatically, we would expect there to be cases where they do not. When it comes to task forces one example is WikiProject Politics, which has a task force working on American politics. This task force adds information about the articles they are interested in to said articles' talk pages, for example the article about campaign songs falls within the task force, and the task force's importance rating ("Mid-importance") is different from the overarching project ("Low-importance"). Because of these additions to the WikiProject templates, it is necessary to be able to identify these relationships in order to be able to correctly identify the importance rating and which projects it actually belongs to.

When it comes to utilizing the category structure for identifying articles, it is tempting to assume that the categories are consistently named. They are to some degree, but the way the project is named as part of the category is not consistent. This means that it is easy to find articles if you know the category (as mentioned above), but somewhat challenging to identify the project if you know the category. For example, articles in WikiProject China use the "China-related" phrase (e.g. "Top-importance China-related articles‎"), while WikiProject Medicine uses just "medicine" (e.g. "Top-importance medicine articles"). It is therefore not straightforward to go from a WikiProject name to its subsequent category, nor the other way around (as another example, en:Category:Top-importance U.S. Presidents articles uses the abbreviation "U.S." but WikiProject United States Presidents does not). Lastly, some projects have categories for combinations of quality and importance ratings (e.g. WikiProject Russia has Category:High-importance B-Class Russia articles). Because of these differences in naming conventions, some amount of post-processing is most likely required if one wishes to gather all articles and at the same time connect them to their respective WikiProjects.

In both cases it is important to note that talk page archives, pages containing archived discussions moved from the article's talk page, as well as redirects and disambiguation pages will affect the findings. A WikiProject template might have been incorrectly moved to an archive page, meaning there might be a duplicate template on the actual talk page. It is therefore important to check whether the talk page has a corresponding article page, and if it does not, said template should be ignored. Redirects and disambiguation pages can also be incorrectly rated. Both will normally have a "NA" rating, but it is not uncommon that for example an article becomes a redirect without someone updating the talk page template at the same time. It is therefore also important to check if the corresponding article page is a redirect or a disambiguation page, and either update the rating to NA or ignore the rating altogether.

Chosen approach edit

In this research project, we have chosen to only use the category structure for gathering articles. The main reason for this is that we work on a per-WikiProject basis, and then we only need to know the six importance-related categories for that specific WikiProject. Secondly, a well-maintained WikiProject can be expected to follow the naming conventions. This means that if it does not follow the conventions it is likely a stale project, and we are then not as interested in their importance ratings. Lastly, the differences in naming conventions above are only important when it comes to connecting a specific article or category to a specific WikiProject. As a result, to identify all articles in the English Wikipedia that are rated Top-importance one first identifies all talk pages in categories matching the regular expression ^Top-importance.*articles$ (or the corresponding SQL LIKE-clause "Top-importance%articles"), then finds all the articles corresponding to those talk pages (taking care to identify archives, talk pages with missing article pages, redirects, and disambiguation pages as discussed above).