Research:Outreach evaluation

Nimish Gautam
Ayush Khanna
Duration:  2012-2 – ??
Noto Emoji Pie 1f4c6.svg

This page documents a planned research project.
Information may be incomplete and change before the project starts.

Key PersonnelEdit

  • Mani Pande
  • Nimish Gautam
  • Ayush Khanna

Project SummaryEdit

This project will be to evaluate outcomes of various WMF outreach programs in regard to user contributions and participation in various projects


We will be asking for self-reported information at outreach events, collecting this information, and then comparing various user contribution activities from the accounts of users at outreach events to determine the potential effectiveness of the event.

Survival MetricEdit

Given that an article has revisions   (where   is the most current revision of the article available at time of performing the analysis) and the revision we're interested in is   :

  • A byte is considered significant if it is non-whitespace
  • A byte is considered to have survived if it was put in by the user in revision  , and persisted to revision  
  • The set of survived significant bytes for a revision   is then  
Survival is calculated for this given revision as  

Note on reordering text: There's a small, static "bonus" added to the number of significant bytes if any reordering of text was detected in that revision whatsoever (for instance, paragraphs being moved around). The reordered bytes aren't counted otherwise.


We want to be able to figure out the number of bytes a user has added or changed in a given set of revision differences, and we want to see whether those changes persisted, as an approximation of the community's judgement of the information being added as being of high quality. Although persistence is not always an accurate measure of quality, the chances of a given edit being high quality is higher if it has survived 1000 revisions, more so than if it has only survived 1.


  • The ratio of survived significant bytes to edit count can aid in identifying users whose editing patterns consist of high-content, highly survivable edits
  • The ratio of Survival to edit count can aid in identifying users with high-content, highly survivable edits with consistency over time.
  • The ratio of survived significant bytes to bytes added can aid in identifying users who produce highly survivable edits in general.

  • Ranges : still TBD

Known shortcomingsEdit

  • Edits that occur in sections of articles or articles that are subject to time, such as a sports score. If a user puts in a score of 40, and soon afterwards the team scores 15 more points and the article now says 55, it will be seen as those bytes entered by the user did not survive. This is not a good approximation of quality, as the edit was of high quality.
  • Reversions of vandalism. The edits will count an unfairly large number of bytes as having survived.
    • Note: there are numerous methods to detect vandalism reversion, and in the code implementation there is room for use of these heuristics if they are needed
  • Collaborative editing sessions. This can be remedied by looking at a group of collaborative editors as one unit.


Code that performs this analysis is available under the GPL on the Wikimedia SVN repository


All findings will be publicly available on a WMF wiki.

Wikimedia Policies, Ethics, and Human Subjects ProtectionEdit

Benefits for the Wikimedia communityEdit

Community and foundation will be able to better gauge and use effective outreach practices





External linksEdit