Research:Wikipedia Edit Types

Tracked in Phabricator:
Task T293465
Duration:  2021-07 – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This project seeks to reboot past work on automated classification of edit diffs -- namely Halfaker and Taraborelli -- to identify a basic taxonomy of edit types and set of language-agnostic detectors for each edit type such that they can be used to analyze edits on Wikipedia.

Edit Diffs and Detectors edit

Main article: Python Package

The initial phase of the project focused on the technical implementation of processing Wikipedia diffs and mapping changes to basic edit types. The resulting Python library (mwedittypes) can identify insertions, removals, changes, and moves to the following types of nodes: tables, references, lists, formatting, categories, media, wikilinks, external links, templates, headings, comments, whitespace, punctuation, words (or characters), sentences, paragraphs, and sections.

Edit Categories edit

Main article: Content Maintenance

The second phase of the project focuses on taking the core edit types and mapping them to higher-order categories of edits. For instance, this might be identifying combinations of edit types that differentiate between edits that generate content versus those that maintain or annotate existing content.

Edit Summaries edit

Main article: Edit Summaries

The third phase of the project examines how edit types might be used to help improve edit summaries. It focuses on the hard case of auto-generating edit summary recommendations for edits that changed textual content on English Wikipedia.

Use Cases edit

The mwedittypes library can be used for a wide variety of different use-cases, some of which are mentioned below: