Wikilytics Dataset

This page describes which variables are available for further analysis in Wikilytics.

The primary location of these variables is in the following collection:

<languagecode><projectname>_editors_dataset

Variables <languagecode><projectname>_editors_dataset edit

_id: internal MongoDB ID, not relevant for analysis
editor: the ID of the editor (String).
username: the name by which an editor is identified (String).
first_edit: the date of the first edit under this username (Date).
final_edit: the date of the last edit under this username (Date).
new_wikipedian: the date this editor became a New Wikipedian. If False, then this is editor is not a New Wikipedian (Date or False).
articles_edited: A dictionary by year, by month, by namespace of the article IDs that were edited by this editor.
namespaces_edited: A dictionary by year, by month, by namespace of the namespaces that were edited by this editor.
edit_count: A dictionary by year, by month, by namespace of the number of edits that were made by this editor.
article_count: A dictionary by year, by month, by namespace of the number of articles that were edited by this editor.
revert_count: A dictionary by year, by month, by namespace of the number of edits that were reverted. Note editor is the reverter (the person who does the revert, but an edit of this person has been reverted).
character_count: A dictionary by year, by month, by namespace and then 'added' and 'removed' that counts independently the number of characters added and removed by this editor.
totals: This is a helper dictionary that contains for the edit_count, revert_count, character_count and article_count a quick total by year by namespace. This is useful for quickly determining whether an editor has reached a certain threshold in a given year.
last_edit_by_year: For each year the editor was active, the date of the final edit (Date).

A month or year that is missing means it has 0 observations.

Variables <languagecode><projectname>_editors_raw edit

_id: internal MongoDB ID, not relevant for analysis
editor: the ID of the editor (String).
username: the name by which an editor is identified (String).
edits:a dictionary that contains the edits by year. Each edit contains the following variables:

  • cur_size: The current size of the article
  • delta: The number of characters added / removed compared with the previous version of the article.
  • ns':The namespace of the edit
  • revert: Whether the edit was reverted
  • article: The ID of the article
  • date: The timestamp of the edit
  • bot: Whether the edit was made by a bot
  • hash: MD5 hash of the edit


Variables <languagecode><projectname>_diffs_raw edit

These variables are only collected for the User Talk, Talk, and Wikipedia Talk namespaces articles.

_id:internal MongoDB ID, not relevant for analysis
comment: Comment that was made when editing, can be null / None.
username: Username of the editor
title: Title of the article
timestamp: UTC date and time of the edit
diff: Subversion style diff compared with previous edit.
id: ID of editor (String)
rev_id: ID of revision (Integer)
article_id: ID of article (Integer)