Research:Assessing gaps in structured data

Tracked in Phabricator:
Task T321224
Duration:  2023-October – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


One of the core content gaps within the Knowledge Gaps Taxonomy is structured data. This gap captures the role off alternative forms of content to just text -- akin to multimedia -- and encapsulates content such as Wikidata items, infoboxes and other templated content, and many forms of annotations such as categories or depicts statements. Given the diversity of forms in which this (semi-)structured data can take, this project is initially focusing on assessing the completeness of Wikdata items as a major component with relevance to many Wikimedia projects.

Background edit

While the Wikidata community cares greatly about quality, they do not currently make assessments of items and annotate them with their perceived completeness or quality.[1] The closest thing to community annotation of Wikidata item quality is via schemas but these unfortunately do not have high coverage or consistency at this stage. This lack of data makes it difficult to train models for this task. Previous research generated a rubric by which quality might be assessed along with several thousand annotations and an accompanying ORES model.

There are three existing tools that we draw inspiration from: ORES itemquality, Recoin, and PropertySuggester.

ORES itemquality has the same goals as this project -- assessing the quality of a Wikidata item. It makes its assessment based on a number of features engineered to capture references, labels, and statements. In practice, its outputs track very closely with the total number of statements on a given item but it performs very well on the annotated data mentioned above.[2] From this model, we borrow the focus on references, labels, and statements as well as some of the nuance in e.g., separating between external identifiers vs. standard properties. The model hard-codes some exceptions -- e.g., astronomical objects and humans -- but otherwise does not differentiate between different types of items or have a clear pathway for adapting as Wikidata changes.

Recoin also is focused on assessing item completeness but uses a different strategy that is more item-specific. It uses statistics on the co-occurrence of different properties for different items (as grouped by instance-of and occupation properties) to assess how "complete" an item is. This strategy feels more flexible and adaptive to the changing state of Wikidata. Recoin, however, does not incorporate any information about labels or references.

PropertySuggester is quite similar to Recoin but is focused on recommending properties to add. An upcoming version[3] uses more nuanced data-mining rules to also take into account existing properties (beyond just instance-of) when recommending additional properties.

Methods edit

We take the rubric and scope of the ORES itemquality model and combine it with the more flexible expected-property approach taken by Recoin and PropertySuggester. We further apply the expected-property approach to references as well, incorporating ideas from Amaral et al.[4], and make some other small tweaks to the model.

When we assessed our model on the annotated data from ORES,[5] the presence of a feature for number of statements was far more predictive than features for label, reference, and property completeness -- i.e. proportion of labels/references/properties we expect for the item. For example, items for Wikimedia disambiguation pages really only need an instance-of and, rarely, properties for different from or partially coincident with. A strategy that uses the number of properties, however, will always assess these items as low-quality even though there is nothing realistically to be done to improve them. As such, we have removed the claim quantity feature and are currently working on developing a new assessment to prove that this is effective.

References edit

  1. Note that "completeness" and "quality" are separate ideas but often conflated, including by me here.
  2. https://public-paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/v2_eval_wikidata_quality_model.ipynb#Comparison-to-ORES
  3. Gleim, Lars C.; Schimassek, Rafael; Hüser, Dominik; Peters, Maximilian; Krämer, Christoph; Cochez, Michael; Decker, Stefan (2020). Harth, Andreas; Kirrane, Sabrina; Ngonga Ngomo, Axel-Cyrille; Paulheim, Heiko; Rula, Anisa; Gentile, Anna Lisa; Haase, Peter; Cochez, Michael, eds. "SchemaTree: Maximum-Likelihood Property Recommendation for Wikidata". The Semantic Web. Lecture Notes in Computer Science (Cham: Springer International Publishing): 179–195. ISBN 978-3-030-49461-2. PMC 7250627. doi:10.1007/978-3-030-49461-2_11. 
  4. Amaral, Gabriel; Piscopo, Alessandro; Kaffee, Lucie-aimée; Rodrigues, Odinaldo; Simperl, Elena (2021-10-15). "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach". Journal of Data and Information Quality 13 (4): 23:1–23:35. ISSN 1936-1955. doi:10.1145/3484828. 
  5. https://public-paws.wmcloud.org/User:Isaac_(WMF)/Annotation%20Gap/v2_eval_wikidata_quality_model.ipynb