Research:Knowledge Gaps Index/Datasets

The knowledge gaps project defines readership metrics, contributorship metrics and content gap metrics. This page aims to describe the datasets generated for the various categories, including providing links to download the data and descriptions of the data format.

Readership metrics edit

Contributorship metrics edit

Content gap metrics edit

The pipeline architecture is described here. The datasets are documented for both the publically datasets available for download here, as well as for internal usage on the data engineering infrastructure. Where applicable, a link to a notebook with example usage of the data is included.

Content Gap Metrics edit

The schema of the content gap metric datasets is:

  • wiki_db: enwiki, itwiki, etc
  • category: the underlying categories for each gap, for example "men", "women", "Europe", etc.
  • time_bucket: the time bucket, with monthly granularity (e.g. 2020-02)
  • metrics contains the values for the aggregated metrics
  • quantiles contains the 5th, 25th, 50th (median), 75th, and 95th percentiles
    • article_created: quantiles for the number of articles created (in the time bucket)
    • pageviews: quantiles for the number of pageviews
    • revision_count: quantiles for the number of edits
    • quality_score: quantiles for the average article quality score

Access the data edit

The content gap metrics datasets are published for each new mediawiki snapshot that becomes available. Note that each snapshot contains the full history, so likely you will be interested in using the most recent snapshot. The default file format is apache parquet.

The content gap metrics are available both publicly and within the WMF data infrastructure

  • The datasets are available for download here. Refer to this notebook for examples on how to load the data.
  • Within the WMF data infrastructure, the content gap metrics are available on hive in the content_gap_metrics database, as documented on datahub (Wikimedia SSO required). Refer to this notebook for examples on how to load the data.
Aggregation levels edit

The content gap metrics are available at different aggregation levels. More background on aggregation levels here

  • by category: metrics for each category of each gap for each wiki (highest granularity)
  • by content gap: metrics for each content gap for each wiki (aggregated across all categories of a gap)
  • by category for all wikis: metrics for each category of each gap (aggregated across all wikis)
  • by content gap for all wikis: metrics for each content gap (aggregated across all categories of a gap, and across all wikis)
CSV format edit

A simplified version of the metrics are also stored in the csv folder. The csv files are only available for the most recent snapshot. The columns are

  • wiki_db: enwiki, itwiki, etc
  • category: the underlying categories for each gender gap, for example "men", "women", "Europe", etc.
  • time_bucket: the time_bucket at which the metric is recorded, with monthly granularity (e.g. 2020-02)
  • [metric value columns] which contain the measurements for the following: article_count_value; article_created_value; pageviews_sum_value; pageviews_mean_value; standard_quality_value; standard_quality_count_value; quality_score_value; revision_count_value. An explanation of relevant metrics:
    • article_created_value: number of articles created for each category in the time bucket
    • pageviews_sum_value: total number of pageviews for each category in the time bucket
    • pageviews_mean_value: mean number of pageviews for each category in the time bucket
    • revision_count_value: total number of edits for each category in the time bucket
    • quality_score_value: average article quality score for each category in the time bucket
    • standard_quality_value: percentage of articles in the category that are above a standard quality threshold for each category in the time bucket
  • [total columns] which contain the totals across all categories for the following: article_count_total; article_created_total; pageviews_sum_total; pageviews_mean_total; standard_quality_total; standard_quality_count_total; quality_score_total; revision_count_total. An explanation of relevant metrics:
    • article_created_total: number of articles created across all categories in the time bucket
    • pageviews_sum_total: total number of pageviews across all categories in the time bucket
    • pageviews_mean_total: mean number of pageviews across all categories in the time bucket
    • revision_count_total: total number of edits across all categories in the time bucket
    • quality_score_total: average article quality score across all categories in the time bucket
    • standard_quality_total: percentage of articles in the category that are above a standard quality threshold across all categories in the time bucket

The _value columns are equivalent to the metrics in "by category" dataset, while the _total columns are equivalent to the "by content gap" dataset (i.e. the total refers to "across all categories", not "across all wikis").

Content Gap features edit

The content gap features dataset connects articles with information (aka features) about the various content gaps. For example, the article about Angkor Wat is labelled as being about Cambodia for the geography gap, associated with 12th century for the time gap, and marked as illustrated with multimedia.

The schema of the dataset is documented on datahub, the features for the various content gaps are described here.

The content gap features dataset is used as input for the aggregation of the content gap metrics themselves. It is useful for

  • doing custom metrics aggregations. For an example, see this notebook computing metrics for the intersection between the gender and the geography gap
  • creating lists of articles filtered by criteria based on content gaps. See this notebook for an example constructing a list of articles about women associated with France and the 1970s.

Metric features edit

In order to provide insights into content gaps over time, the knowledge gap pipeline aggregates commonly used metrics for analyzing Wikipedia content and editor activity into a metric features dataset.

  • wiki_db: enwiki, itwiki, etc
  • page_id: the page id of the article
  • time_bucket: the time bucket, with monthly granularity (e.g. 2020-02)
  • article_created: boolean of whether the article was created in the time_bucket
  • pageviews: number of pageviews in the time_bucket
  • quality_score: article quality score of the last revision to that article in the time_bucket
  • page_revision_count: number of edits in the time_bucket

The metric features are associated with a particular Wikipedia article, i.e. the Frida Kahlo articles exists in 152 projects and the above metrics are calculated for each of them individually. The metrics are calculated with a monthly granularity, which in turn enables the analysis of trends over time.

See this notebook for more details on the schema and example usage.