Research:Language-Agnostic Topic Classification/Countries/Species

One known gap in Wikidata's coverage of country-level associations with items is in species data. This page collects thoughts / explorations on how to best overcome this gap in the model.

Wikidata properties

edit

Ideally (from the modeling perspective), the countries that are relevant to each plant, animal, etc. species would be documented on Wikidata in the same way that e.g., country-of-citizenship for people is documented. There are a few existing properties that do this -- endemic to (P183) and taxon range (P9714) -- but usage is low. Out of 3,110,443 species on Wikidata (query), only 96,714 (~3%) have a taxon range or endemic property (query), with most of that being taxon range (71,641) and fewer with the endemic to property (27,047).[1]

Categories tracked in Wikidata

edit

Another direct way in which these species-country relations are tracked is via categories on Wikipedia. A common set of categories are ones of the form "<taxon> of <region>" such as Flora of France.[2] On Wikidata, these categories can be modeled with category combines topics (P971) and linked directly to the region -- i.e. the values for "flora" and "France" for the above example. This is slightly more complicated to leverage because it requires knowing which categories map to which countries and then checking all the categories in a given Wikipedia article for whether they're in that set, but not so different from how wikilinks in an article might be leveraged. Usage of these properties is a bit spotty -- there are 2400 categories that are reasonably spread out across many countries though the distribution. On English Wikipedia, there are 355,203 articles about individual species and 72,664 (~20%) can be mapped to at least one country via a category.

External identifiers on Wikidata

edit

There are ~200 external identifiers in use on Wikidata for species that also refer to country-specific collections (query). While some of these have a clear relationship between "species has an identifier in the database" and "the country associated with the database applies to that species" such as Lepidoptera of Belgium (P5862), which tracks lepidoptera that are found in Belgium, others such as the Tierstimmenarchiv ID refer to an animal sound archive maintained in Berlin but that contains species from all over the world. Unfortunately there is no good way to distinguish automatically between these two different types of external identifiers so these properties would be more likely to introduce noise than accurate country assignments (unless added in an individual allowlist manner)

References

edit
  1. There are two other properties that are relevant --taxon especially protected in area (P6569) and taxon found at location (P6803) -- but these are (rare) properties present on items for places and not the species themselves so harder to leverage in a model.
  2. This general pattern occurs outside of species and so could be useful for assigning countries to subjects in other topical areas too.