Research:Wikipedia Edit Types/Python package

Tracked in Phabricator:
Task T293465
18:33, 15 October 2021 (UTC)
Jesse Amamgbu
no affiliation
Duration:  2021-07 – 2022-06
This page documents a completed research project.

This sub-project focuses on the technical implementation of the initial edit-types Python package for identifying what information is changed by a revision on Wikipedia. While the Python package will continue to evolve as needs evolve, this initial implementation covered generating wikitext-based diffs for edits to main namespace (0) articles on Wikipedia and identifying the specific nodes -- e.g., Templates, References, Words, etc. as opposed to broader categories such as Content Generation vs. Content Maintenance.

Edit Types


There are two main approaches to building a taxonomy of edit types:

  • Edit actions -- i.e. the "what" or all of the types of changes you might make to wikitext. These are easier to define / detect, more "basic" / "atomic", and can be thought of as a straightforward manner for turning an edit diff into a set of structured features. They are very useful for making edit recommendations / building ML models but can be hard to interpret the "why" behind edit actions for analysis purposes. This is typified by the Structured Tasks.
  • Edit intentions / semantics -- i.e. the "why" or all the goals you might have in making an edit. These are more amorphous, often composed of different edit actions, and tell the story of what a given editor is seeking to do with an edit. They are less useful for recommenders / modeling but more useful for summary analysis / computational social science. This is typified by Yang et al.[1]

This work will begin with the edit actions component. We have mainly defined edit actions based on the various wikitext syntax available. This is both for practical reasons -- it makes detection of the different types far more straightforward as there are existing, well-maintained Python-based parsers (mwparserfromhell) -- and because the wikitext syntax does often define the ease of making an edit and would capture interesting cross-action dynamics such as whether text was added without a citation or tables without any links.

Edit Types Taxonomy


Below are current edit types (presented in a hierarchy that provides some insight into how they are detected):

  • Wikitext Nodes:
    • Tags:
      • Tables
      • References
      • List items
      • Formatting (e.g., bold/italics)
    • Links:
      • Categories (namespace 14)
      • Media (namespace 6)
      • Wikilinks (namespace 0 and in practice all other namespaces)
    • External links
    • Templates
    • Headings (sections)
    • Comments
  • Text:
  • Contextual Information:
    • Sentences
    • Paragraphs
    • Sections

Each edit type then has four associated potential actions: insert, remove, change, move. And the number of edit types + actions are summed up across the whole diff. Most types have clear boundaries but text is aggregated by section, so making a few changes to the text will be counted just once if it's all in the same section but changing text across multiple sections will be recorded independently.

Edit Diffs and Detectors


Computing textual diffs is a long-standing challenge and a central feature of the wikis -- i.e. wikitext diffs (and more recently visual diffs) underlie the ability of editors to efficiently patrol edits to articles. We do not directly reuse these technologies for two reasons: 1) our goal is to support large-scale analyses, which generally means that our implementation must exist in Python (which can easily be applied to the Data Lake via PySpark UDFs), and, 2) the goal of on-wiki diffs is to provide a visually-coherent, human-interpretable explanation of changes whereas our goal is to provide a structured, machine-interpretable description of has changed. That said, our work most closely matches (and draws much inspiration and code from) the Visual Editor diffs, which also contain some semantic explanations of the changes occurring in a diff.

The diffing and detection process can be split into three stages:

  • Tree diffing: this is the high-level determination of what changed and where in an article -- e.g., a template was changed. It's the first stage in the diffing process and is particularly helpful for detecting moves and bringing more structure to the diff. The outputs are then passed on to the node differ (explained below) to further process.
  • Node diffing: this is the specific determination of what happened -- e.g., a parameter was added to that template. This stage also is where we do some more fine-grained disambiguation of what was changed -- e.g., whether a tag is a reference, table, list, etc.
  • Counting: this is the summary of what happened based on all the changes. While this sounds simple, it's actually one of the harder parts because it depends on a clear idea of how to interpret changes in wikitext. For example, how should one count a reference that was added within a template? Just a template edit? Or that plus a reference edit? Or just a reference edit if the template syntax wasn't altered otherwise?





Background on approaches to diffing can be found here: PAWS:Edit Types Diffs.ipynb



The library can be viewed here:

The current state of the detectors can be explored through this interface:

You can also see an example deployment of the library on the cluster here:

See Also



  1. Yang, Diyi; Halfaker, Aaron; Kraut, Robert; Hovy, Eduard (2017). "Identifying Semantic Edit Intentions from Revisions in Wikipedia" (PDF). 2000-2010. doi:10.18653/v1/D17-1213. Retrieved 15 October 2021.