Research:Measuring the Gender Gap: Attribute-based Class Completeness Estimation

Created

06:05, 1 November 2022 (UTC)

Contact

Gianluca Demartini

The University of Queensland, Australia

Collaborators

Lei Han

The University of Queensland, Australia

Hrishi Patel

The University of Queensland, Australia

Tianwa Chen

The University of Queensland, Australia

Ivano Bongiovanni

The University of Queensland, Australia

Dhaval Vyas

The University of Queensland, Australia

Duration: 2023-07 – 2024-06

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The problem. Successful crowdsourcing projects like Wikipedia and Wikidata naturally grow and evolve over time. This happens while having editors focussing on certain parts of the project instead of others. While the ability for editors to decide what to contribute to comes with the advantage of flexibility, it may result in biased content where, for example, one gender is better represented than others. An example of this is the number of male astronauts as compared to the number of female astronauts (73 out of 574, https://en.wikipedia.org/wiki/List_of_female_astronauts https://en.wikipedia.org/wiki/List_of_space_travelers_by_name) in Wikipedia.

The solution. There are possible viable approaches to address this issue. For example, the editor community may decide to stop adding new male astronauts to the project to allow for content about female astronauts to catch up. Alternatively, the community may decide to represent the real distribution in the profession. In any case, this remains a community decision.

Research Contribution. Rather than deciding how to deal with gender unbalanced content in Wikimedia projects, the aim of this research is to automatically identify underrepresented classes by quantifying and measuring the expected size of a class in order to empower the community in taking decisions and setting editorial priorities. This is possible by making use of the edit history for a Wikimedia project.

Methods

The research conducted for this project has explored three complementary directions:

Study 1. Data-driven work. In this work-package we focussed on adapting statistical methods to estimate the completeness of Wikipedia for different types of persons. The aim was to estimate the level of Wikipedia coverage and completeness across genders. Our results indicate high level of estimated completeness across different sub-classes of the class person.

Our method can estimate the completeness of a class of entities. Hence can be used to answer questions such as “Does the knowledge base have a complete list of all female astronauts?”. Our techniques are derived from species estimation and data management and are applied to the case of collaborative editing. We make use of entities observed in a project’s edit history as a proxy for observations in a capture/recapture study setup. This allows us to use estimators for species population (e.g., Jackknife Estimators [1]) to predict class cardinality. A full report is available here: https://doi.org/10.48550/arXiv.2401.08993 This work was also presented at the Wiki workshop 2024.

This work is accepted to be published in the 35th Australasian Conference on Information Systems.

Study 2. We have been interviewing Wikipedia editors who focus on gender balance to understand the strategies they use and use this to inform the design of data tools that could aid their work. An analysis of the collected data is ongoing.

Study Design. We are conducting semi-structured interviews with Wikipedia editors to understand their decision process in balancing the gender gap when contributing content to Wikipedia. In total, we plan to interview 25 editors and now we have interviewed 9 editors. For each participant, we ask the following semi-structured questions to understand their behaviours in line with our objectives of the study: (1)Do you think about gender balance when you edit Wikipedia? Do you find gender an important dimension that you think about when editing Wikipedia? Why is, in your opinion, important to focus on gender balance? (2)Based on your editorial work for Wikipedia, what’s your usual process when considering gender balance while editing? Would you share an example? (3)How do you make decisions on what content to contribute and what to focus on? And why? (4)Could you comment on any difficulties or challenges you experienced when working towards balancing gender in the Wikimedia? And how do you address them (5)What sources of information do you use when balancing gender in Wikipedia? (6)What data and information would you like to be able to access to help you make decisions on gender balancing contributions to Wikipedia? In which format, modality, or interface would you like to access this data? For example, if you are provided with a dashboard of gender distribution data in Wikipedia, would you use it? How?

This interview protocol would allow participants to reveal their decision process in balancing the gender gap, allowing us to understand the intentions behind the editing process, as well as the difficulties or challenges they experienced.

Ethical considerations. We obtained human research ethics approval from the University of Queensland before recruiting Wikipedia editor participants. We provided all participants with the participant information sheet and consent form and asked them to provide either written or verbal consent. The target participants are all Wikipedia editors. Any Wikipedia editors can participate in this project and no screening is required.

Recruitment. We recruit interviewees to the groups of Wikimedia editors recommended by other researchers, as well as the Wikipedia user talk pages, snowball sampling, and face-to-face interactions at Wikimedia-related events. The target participants do not belong to any special ethnic, age or gender groups.

Data collection. The interviews with Wikipedia editors are conducted using the semi-structured protocol online using Zoom, currently spanning from 2023 to 2024 over the course of 14 months. All the data collected from the participants are anonymised and are not identifiable or re-identifiable. The overall interview for each participant took 70 - 90 minutes. All interviews were conducted in English and audio recorded and then transcribed.

Data analysis. As for analysing the user insights collected from the interviews, we plan to use NVIVO 12(For the use of Nvivo 12, please view https://lumivero.com/products/nvivo/) for verbal protocol analysis. The verbal protocols are being transcribed and provided to two independent coders for analysis to reduce bias in our analysis. We will follow the procedure outlined by Gioia et al., to show the ``dynamic relationships among the emergent concepts that describe or explain the phenomenon of interest and one that makes clear all relevant data-to-theory connections. We will start the analysis in an inductive approach and then transit into a more abductive approach, to ensure ``data and existing theory are now considered in tandem’.

Study 3. We have been building a dashboard to visualise gender data distributions and statistics to support editors in their editorial choices.

This work is currently under review in The Eighteenth International Conference on Web Search and Data Mining (2024).

[1] Heltshe, J.F., Forrester, N.E.: Estimating species richness using the jackknife procedure. Biometrics pp. 1–11 (1983)

Timeline and Research Plan

High-level activities:

- Conduct interviews with a sample (e.g., 5-7) of Wikipedia editors engaged with issues of gender representation and balance to understand their decision making processes and to gather requirements for a data tool that can support them in their editorial work;

- Refine the development of a data-tool based on the finding from the study described at the previous point;

- Disseminate project outcomes within both the Wikipedia editor community (e.g., by attending Wikimania) and the academic research community (e.g., by presenting the conducted research at academic conferences);

Expected deliverables:

- An additional scientific publication about the data requirements of Wikipedia editors engaged with issues of gender representation and balance;

- A data-tool for the community of Wikipedia editors to use to inform their editorial contributions.

Timeline: M3 - Data collection and preparation. Working on wiki dumps we will collect data about entities for few classes and attributes and prepare their edit history for experimentation.

M6 - Analysis and class cardinality estimation for gender bias. We will apply our methods to the prepared dataset.

M9 - Conducting interviews with editors. Developing the dashboard.

M12 - Publication. We will write up the work by presenting the method and experimental results in papers to be submitted for publication. We plan a paper about study 1 and one about study 2.

Policy, Ethics and Human Subjects Research

The work makes use of Wikidata edit history, thus will not disrupt the current work of Wikidata editors. We have obtained ethics approval from our institution for this type of analysis as well as for the interviews conducted with editors.

Vision

To close the gap (the gender gap in Wikimedia project content, in our case) it is first critical to be able to measure it. While it is up to the editor community to make decisions on how to best balance content by prioritising editorial focus, we instead aim at empowering editors in their decisions by providing them with relevant information about the current gender balance across classes in Wikidata and categories in Wikipedia. This aligns with the 2030 Wikimedia Strategic Direction as our contribution enables the platform and the community to collect “knowledge that fully represents human diversity”.

Upon completion of the research, we plan to disseminate our findings through several channels. First, we plan to describe our research approach as well as our experimental findings (e.g., a list of classes with related gender balance information) on the relevant Wikimedia project. We also plan to disseminate our approach and results to the academic research community by means of peer reviewed scientific publications in computer science conferences and journals with a topical focus on fairness.

Resources

Michael Luggen, Djellel Difallah, Cristina Sarasua, Gianluca Demartini, and Philippe Cudré-Mauroux. Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs. In: The International Semantic Web Conference (ISWC 2019 - Research Track). Auckland, New Zealand, October 2019.

References

Research Publications that originated from this research project:

- Hrishikesh Patel, Tianwa Chen, Ivano Bongiovanni, and Gianluca Demartini (2024). Estimating Gender Completeness in Wikipedia. The 35th Australasian Conference on Information Systems (ACIS 2024)

Under review:

- Yahya Yunus, Tianwa Chen, and Gianluca Demartini. WGD: The Wikipedia Gender Dashboard. Exploring Wikipedia Gender Diversity Over Time. The Eighteenth International Conference on Web Search and Data Mining (2024)