Research:Measuring the Gender Gap: Attribute-based Class Completeness Estimation

06:05, 1 November 2022 (UTC)
Duration:  2023-07 – 2024-06

This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.

The problem. Successful crowdsourcing projects like Wikipedia and Wikidata naturally grow and evolve over time. This happens while having editors focussing on certain parts of the project instead of others. While the ability for editors to decide what to contribute to comes with the advantage of flexibility, it may result in biased content where, for example, one gender is better represented than others. An example of this is the number of male astronauts as compared to the number of female astronauts (73 out of 574, in Wikipedia.

The solution. There are possible viable approaches to address this issue. For example, the editor community may decide to stop adding new male astronauts to the project to allow for content about female astronauts to catch up. Alternatively, the community may decide to represent the real distribution in the profession. In any case, this remains a community decision.

Research Contribution. Rather than deciding how to deal with gender unbalanced content in Wikimedia projects, the aim of this research is to automatically identify underrepresented classes by quantifying and measuring the expected size of a class in order to empower the community in taking decisions and setting editorial priorities. This is possible by making use of the edit history for a Wikimedia project.



Approach. Our method can estimate the completeness of a class of entities. Hence can be used to answer questions such as “Does the knowledge base have a complete list of all female astronauts?”. Our techniques are derived from species estimation and data management and are applied to the case of collaborative editing. We make use of entities observed in a project’s edit history as a proxy for observations in a capture/recapture study setup. This allows us to use estimators for species population (e.g., Jackknife Estimators [1]) to predict class cardinality.

Generalization. This approach can as well be applied to non-binary value attributes (e.g., non-binary genders, or age groups like, for example, counting how many astronauts in the age ranges 20-30, 30-40, and 40-50 there should be in Wikipedia) to estimate attributed-based class cardinality.

[1] Heltshe, J.F., Forrester, N.E.: Estimating species richness using the jackknife procedure. Biometrics pp. 1–11 (1983)



M3 - Data collection and preparation. Working on wiki dumps we will collect data about entities for few classes and attributes and prepare their edit history for experimentation.

M6 - Analysis and class cardinality estimation for gender bias. We will apply our methods to the prepared dataset.

M9 - Generalisation to other attributes. We will adapt the method to other non-binary attributes.

M12 - Publication. We will write up the work by presenting the method and experimental results in a paper to be submitted for publication.

Policy, Ethics and Human Subjects Research


The work makes use of Wikidata edit history, thus will not disrupt the current work of Wikidata editors. We have previously obtained ethics approval from our institution for similar work, but we will seek a fresh approval (ETA two weeks) should this proposal be funded.



To close the gap (the gender gap in Wikimedia project content, in our case) it is first critical to be able to measure it. While it is up to the editor community to make decisions on how to best balance content by prioritising editorial focus, we instead aim at empowering editors in their decisions by providing them with relevant information about the current gender balance across classes in Wikidata and categories in Wikipedia. This aligns with the 2030 Wikimedia Strategic Direction as our contribution enables the platform and the community to collect “knowledge that fully represents human diversity”.

Upon completion of the research, we plan to disseminate our findings through several channels. First, we plan to describe our research approach as well as our experimental findings (e.g., a list of classes with related gender balance information) on the relevant Wikimedia project. We also plan to disseminate our approach and results to the academic research community by means of peer reviewed scientific publications in computer science conferences and journals with a topical focus on fairness.

Our previous research [1] has looked at how to use statistical estimators to measure class cardinality in Wikidata. We used the knowledge graph edit history as evidence for the estimators and to measure class completeness. In this project we plan to extend our approach by looking at attribute-specific cardinality estimations (e.g., How many female astronauts should be there? Do we have them all?) and beyond the single Wikidata project.



Michael Luggen, Djellel Difallah, Cristina Sarasua, Gianluca Demartini, and Philippe Cudré-Mauroux. Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs. In: The International Semantic Web Conference (ISWC 2019 - Research Track). Auckland, New Zealand, October 2019.