Research:Knowledge Gaps Index/Measurement
After finalizing the Taxonomy of Knowledge Gaps, which contains a structured grouping and description of all the potential gaps in Wikimedia projects, our next milestone is to provide insights and tools to help the measurement of such gaps.
Scope
editThe taxonomy of knowledge gaps identified 3 macro-dimensions across which inequalities exist in Wikimedia projects: Readers, namely the set of individual who access Wikimedia sites to consume content, Contributors, the community of editors of Wikimedia projects, and Content, i.e. the knowledge contained in Wikimedia projects.
The aim of the Knowledge Gap Measurement project is to generate a set of metrics to quantify the gaps we identified in the 3 dimensions. We want to map each gap to one or few numbers (a "metric") reflecting the extent to which the gap is present in Wikimedia projects.
Methods
editMapping readers, contributors, and content to specific gaps.
editWe interviewed community members and other stakeholders to understand more in-depth how they understand and frame knowledge gaps. Based on these insights, we operationalized a gap, namely we identified the underlying categories and developed methods to categorize readers, contributors, and pieces of content according to the corresponding categories. For example, we used Wikidata to associate Wikipedia biographies with their corresponding gender identity of the subject. Depending on the knowledge gap dimension, we use two methods for mapping:
- Survey based: We design survey questions specifically tailored to categorize readers and contributors into groups that are relevant for knowledge gaps (for example, gender groups). Based on the answers to these questions, we can estimate the distribution of Readers and Contributors across different categories that are relevant to measure inequalities in Wikimedia Projects. A complete list of mappings for readers and contributors can be found here
- Observation based: we quantify knowledge gaps in Content by estimating the distribution of pieces of content (e.g., Wikipedia articles, Wikidata items) across different categories (e.g., gender, geographic distribution, cultural background). A complete list of mappings for content can be found here . More details about the research behind content measurements is in the Developing Metrics for Content Gaps (Knowledge Gaps Taxonomy) page.
Quantifying the gap based on a selection of relevant metrics.
editWe reviewed different models describing the various aspects in which gaps can be measured and conducted interviews with affiliates to capture the community’s interests. We obtained a set of metrics quantifying the content coverage for each category, by taking into account aspects of the scientific maturity of the metric, as well as project constraints. For survey-based measurements, the metrics is generally a version of "distribution of answers to the gap specific question". For content-based measurements, we aggregate mappings according to two different sets of metrics:
- Selection-Score (e.g., number of articles for each category of the gap), which reflects how much content exists for each category on a wiki.
- Extent-Score (e.g., quality of articles based on length, # sections, # images) explains “how good” the articles in each category are.
More about content metrics here
Results
editSo far, we have developed metrics for 5 content gaps, and most Readers and Contributor gaps:
- 5 out of the 11 Gaps in Content, with 2 metrics under development
- 11 out of the 12 Gaps in Readership
- 10 out of the 11 Contributorship Gaps
For readers and contributors, the unmapped gap is the "Tech Skills" gap in the "Interaction" facet. While there exist surveys to test individuals' Wikipedia Editing Skills, or more generic Internet Skills [1], more research is needed to understand the types of skills we want to test for both readers and contributors, and then implement a questionnaire accordingly.
Readership Metrics
editFACET | GAP | Metric |
The Dimensions' Facet. | The Knowledge gap. e.g. Gender | How do we measure the gap? |
Representation | Gender | Distribution of Survey responses to the gender question. Dataset |
Age | Distribution of survey responses to the age question. | |
Geography | Distribution of pageviews and unique devices by geographic categories inferred from readers IP
and distribution of survey responses to the urban/rural question | |
Language | Distribution of survey responses to the language questions | |
Socio-economic Status | Distribution of survey responses to the socio-economic status questions | |
Cultural Background | Distribution of survey responses to the cultural background questions (ethnicity and discrimination) | |
Sexual Orientation | Distirbution of survey responses to the sexual orientation question | |
Interaction | Motivation | Distribution of survey responses to the motivation question |
Information Depth | Distribution of survey responses to the information depth question | |
Familiarity | Distribution of survey responses to the familiarity question | |
Tech Skills | Not yet developed | |
Disabilities | Distribution of survey responses to the disability question |
See Readers Main Page for a complete list of metrics and their current status.
Contributorship Metrics
editFACET | GAP | Metric |
The Dimensions' Facet. | The Knowledge gap. e.g. Gender | How do we measure the gap? |
Representation | Gender | Distribution of survey responses to the gender question. |
Age | Distribution of survey responses to the age question. | |
Geography | Distribution of edits and (active) editors by geographic categories inferred from readers IP
and distribution of survey responses to the urban/ruralquestion. | |
Language | Distribution of survey responses to the language questions | |
Socio-economic Status | Distribution of survey responses to the socio-economic status questions | |
Cultural Background | Distribution of survey responses to the cultural background questions (ethnicity and discrimination) | |
Sexual Orientation | Distirbution of survey responses to the sexual orientation question | |
Interaction | Motivation | Distribution of survey responses to the motivation question |
Role | Distribution of survey responses to the role questions (Experience and Role on Wiki) | |
Disabilities | Distribution of survey responses to the disability question |
See Contributors Main Page for a complete list of gaps and their current status.
Content Gap Metrics
editFACET | GAP | Metric |
The Dimensions' Facet. | The Knowledge gap. e.g. Gender | How do we measure the gap? |
Representation | Gender | Time series of content gap metrics over gender mappings |
Age | Time series of content gap metrics over time mappings | |
Geography | Time series of content gap metrics over geographic mappings | |
Language | We will be planning more research to measure this gap. | |
Socio-economic Status | We will be planning more research to measure this gap. | |
Cultural Background | We will be planning more research to measure this gap. | |
Topics for Impact | We will be planning more research to measure this gap. | |
Sexual Orientation | Time series of content gap metrics over sexual orientation mappings | |
Interaction | Readability | Currently working on this: follow along our research on multilingual readability |
Structured Data | Currently working on this: follow along our research on Wikidata item quality | |
Multimedia | Time series of content gap metrics over multimedia mappings |
Information about how articles are mapped to specific content gap categories can be found here, a complete list of content gap metrics and their current status can be found here, and technical background about the data pipeline architecture here.
Ideas for Summarizing Metrics
editThe final output of the metrics generation process, for both survey-based and observation-based measurements, is an estimation of the coverage/representation of readers, contributors or pieces of content across different categories. While the raw distribution remains the most informative output reflecting the extent of a gap, different stakeholders (c-level, affiliates, community members) will need to look at knowledge gaps values at different depths.
To this end, we started putting together some ideas about how to summarize the distribution-based metrics into a few numbers reflecting the questions people might want to ask to this data.
- What is the representation of each category for this gap in this project?
Probability distribution for a gap in a language edition for a specific year
- What is the most represented category for this gap in this project?
- What is the least represented category for this gap in this project?
- How dominant is the most represented category with respect to the least represented one?
- How dominant is the most represented category with respect to second most represented one?
- How unbalanced is the representation of different categories?
- How diverse is this project with respect to this gap?
- How are gaps evolving over time? Cumulative distribution for a gap in a language edition over all years.
Visualizations
editWe generated all the preliminary visualizations for each of the gaps. We are now working on a set of tools to expose and visualize gaps.
See also
editEarly research on measuring content gaps Early research on measuring the gender content gap in particular.
- ↑ "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication. 2018-02-28. doi:10.1093/joc/jqx003.