Research talk:Automated classification of article importance/Work log/2017-03-31

Friday, March 31, 2017 edit

Today I'll continue my work on the WPMED categorization problem.

Relationships edit

Yesterday I gathered data on whether articles in WPMED had Wikidata items that were either an instance of, a subclass of, or a part of, something else. I'm now looking for a way that allows me to identify relationships between these with reasonable efficiency. When it comes to the "instance of" and "subclass of" relationships, those should be two sides of the same coin because something is an instance of a class. That moves us from "instance" space to "class" space, and makes it a question of resolving those two. The question then is whether "part of" refers to a class or an instance in the cases where we cannot use one of the other two properties.

I first wrote a query to find the number of items that only have a "part of" relationship:

SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE {
  ?item wdt:P361 ?other .
  FILTER NOT EXISTS{ ?item wdt:P31 ?whatever } .
  FILTER NOT EXISTS{ ?item wdt:P279 ?whatever }
  SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
   }
}

There are 18,580 such relationships. I then add a check for whether "other" is an instance, a subclass, or a part of something else, and that this relationships is incrementally exclusive (i.e. a "subclass of" cannot be an "instance of", as we'd otherwise prefer that). That should give us an idea about what we would be looking at when we walk up the tree:

Type N
instance of 14,144
subclass of 2,318
part of 797

These types of relationships account for 92.9% of the data, there are 1,321 items that are unaccounted for. I think it's fair to conclude that the remaining items are those that we cannot further investigate at this time, because the nature of the relationships cannot as easily be determined.

Return to "Automated classification of article importance/Work log/2017-03-31" page.