Research:Knowledge Gaps Index/Datasets

The knowledge gaps project defines readership metrics, contributorship metrics and content gap metrics. This page aims to describe the datasets generated for the various categories, including providing links to download the data and descriptions of the data format.

Readership

Reader Survey

The schema for the gender gap survey:

category: "Man", "Woman", "Genderdiverse", "I prefer not to say", "Skipped"
percent: percentage of responses in that category
MOE: margin of error
wiki_db: enwiki, itwiki, etc

Contributorship

Content

The pipeline architecture is described here. The datasets are documented for both the publically datasets available for download here, as well as for internal usage on the data engineering infrastructure. Where applicable, a link to a notebook with example usage of the data is included.

Content Gap Metrics

The schema of the content gap metric datasets is:

wiki_db: enwiki, itwiki, etc
category: the underlying categories for each gap, for example "men", "women", "Europe", etc.
time_bucket: the time bucket, with monthly granularity (e.g. 2020-02)
metrics contains the values for the aggregated metrics
- article_created: number of articles created (in the time bucket)
- pageviews_sum: number of pageviews
- pageviews_mean: average number of pageviews
- revision_count: number of edits
- quality_score: average article quality score
- standard_quality: percentage of articles that are above a standard quality threshold
- standard_quality_count: number of articles that are above a standard quality threshold
quantiles contains the 5th, 25th, 50th (median), 75th, and 95th percentiles
- article_created: quantiles for the number of articles created (in the time bucket)
- pageviews: quantiles for the number of pageviews
- revision_count: quantiles for the number of edits
- quality_score: quantiles for the average article quality score

Access the data

The content gap metrics datasets are published for each new mediawiki snapshot that becomes available. Note that each snapshot contains the full history, so likely you will be interested in using the most recent snapshot. The default file format is apache parquet.

The content gap metrics are available both publicly and within the WMF data infrastructure

The datasets are available for download here. Refer to this notebook for examples on how to load the data.
Within the WMF data infrastructure, the content gap metrics are available on hive in the content_gap_metrics database, as documented on datahub (Wikimedia SSO required). Refer to this notebook for examples on how to load the data.

Aggregation levels

The content gap metrics are available at different aggregation levels. More background on aggregation levels here

by category: metrics for each category of each gap for each wiki (highest granularity)
- public, datahub (wmf internal)
by content gap: metrics for each content gap for each wiki (aggregated across all categories of a gap)
- public, datahub (wmf internal)
by category for all wikis: metrics for each category of each gap (aggregated across all wikis)
- public, datahub (wmf internal)
by content gap for all wikis: metrics for each content gap (aggregated across all categories of a gap, and across all wikis)
- public, datahub (wmf internal)

CSV format

A simplified version of the metrics are also stored in the csv folder. The csv files are only available for the most recent snapshot. The columns are

wiki_db: enwiki, itwiki, etc
category: the underlying categories for each gender gap, for example "men", "women", "Europe", etc.
time_bucket: the time_bucket at which the metric is recorded, with monthly granularity (e.g. 2020-02)
[metric value columns] which contain the measurements for the following: article_count_value; article_created_value; pageviews_sum_value; pageviews_mean_value; standard_quality_value; standard_quality_count_value; quality_score_value; revision_count_value. An explanation of relevant metrics:
- article_created_value: number of articles created for each category in the time bucket
- pageviews_sum_value: total number of pageviews for each category in the time bucket
- pageviews_mean_value: mean number of pageviews for each category in the time bucket
- revision_count_value: total number of edits for each category in the time bucket
- quality_score_value: average article quality score for each category in the time bucket
- standard_quality_value: percentage of articles in the category that are above a standard quality threshold for each category in the time bucket
[total columns] which contain the totals across all categories for the following: article_count_total; article_created_total; pageviews_sum_total; pageviews_mean_total; standard_quality_total; standard_quality_count_total; quality_score_total; revision_count_total. An explanation of relevant metrics:
- article_created_total: number of articles created across all categories in the time bucket
- pageviews_sum_total: total number of pageviews across all categories in the time bucket
- pageviews_mean_total: mean number of pageviews across all categories in the time bucket
- revision_count_total: total number of edits across all categories in the time bucket
- quality_score_total: average article quality score across all categories in the time bucket
- standard_quality_total: percentage of articles in the category that are above a standard quality threshold across all categories in the time bucket

The _value columns are equivalent to the metrics in "by category" dataset, while the _total columns are equivalent to the "by content gap" dataset (i.e. the total refers to "across all categories", not "across all wikis").

Content Gap features

The content gap features dataset connects articles with information (aka features) about the various content gaps. For example, the article about Angkor Wat is labelled as being about Cambodia for the geography gap, associated with 12th century for the time gap, and marked as illustrated with multimedia.

The schema of the dataset is documented on datahub, the features for the various content gaps are described here.

The content gap features dataset is used as input for the aggregation of the content gap metrics themselves. It is useful for

doing custom metrics aggregations. For an example, see this notebook computing metrics for the intersection between the gender and the geography gap
creating lists of articles filtered by criteria based on content gaps. See this notebook for an example constructing a list of articles about women associated with France and the 1970s.

Metric features

In order to provide insights into content gaps over time, the knowledge gap pipeline aggregates commonly used metrics for analyzing Wikipedia content and editor activity into a metric features dataset.

wiki_db: enwiki, itwiki, etc
page_id: the page id of the article
time_bucket: the time bucket, with monthly granularity (e.g. 2020-02)
article_created: boolean of whether the article was created in the time_bucket
pageviews: number of pageviews in the time_bucket
quality_score: article quality score of the last revision to that article in the time_bucket
page_revision_count: number of edits in the time_bucket

The metric features are associated with a particular Wikipedia article, i.e. the Frida Kahlo articles exists in 152 projects and the above metrics are calculated for each of them individually. The metrics are calculated with a monthly granularity, which in turn enables the analysis of trends over time.

See this notebook for more details on the schema and example usage.