Research:Wikipedia Edit Types
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
This project seeks to reboot past work on automated identification of edit types -- namely Halfaker and Taraborelli -- to identify a basic taxonomy of edit actions (mainly the syntactic types here) and set of language-agnostic detectors for each edit action such that they can be used to analyze edits on Wikipedia.
This project's initial scope are edits to main namespace (0) articles on Wikipedia. While some of the actions and associated detectors will be applicable anywhere that uses wikitext syntax, others will not. It will focus on actions that are identifiable through a single edit diff alone -- i.e. actions that do not show up in the edit history are not considered such as whether an edit is vandalism or not (edit comments / tags / richer revision history) and actions such as page moves / protections (logs) or whether the edit was generated via ContentTranslation (edit tags).
This project has five main phases:
- Determine the edit type taxonomy to be implemented
- Design an approach to generating edit diffs (in Python) that can support detection of the edit types
- Design detectors for each edit type
- Evaluate detectors on more robust sample of data
- Apply detectors to study impact of campaigns or other tooling, better understand newcomers, etc.
There are two main approaches to building a taxonomy of edit types:
- Edit actions -- i.e. the "what" or all of the types of changes you might make to wikitext. These are easier to define / detect, more "basic" / "atomic", and can be thought of as a straightforward manner for turning an edit diff into a set of structured features. They are very useful for making edit recommendations / building ML models but can be hard to interpret the "why" behind edit actions for analysis purposes. This is typified by the Structured Tasks.
- Edit intentions / semantics -- i.e. the "why" or all the goals you might have in making an edit. These are more amorphous, often composed of different edit actions, and tell the story of what a given editor is seeking to do with an edit. They are less useful for recommenders / modeling but more useful for summary analysis / computational social science. This is typified by Yang et al.
This work will begin with the edit actions component. We have mainly defined edit actions based on the various wikitext syntax available. This is both for practical reasons -- it makes detection of the different types far more straightforward as there are existing, well-maintained Python-based parsers (mwparserfromhell) -- and because the wikitext syntax does often define the ease of making an edit and would capture interesting cross-action dynamics such as whether text was added without a citation or tables without any links.
Edit Types TaxonomyEdit
Below are current edit types (presented in a hierarchy that provides some insight into how they are detected):
- Wikitext Nodes:
- List items
- Formatting (e.g., bold/italics)
- Categories (namespace 14)
- Media (namespace 6)
- Wikilinks (namespace 0 and in practice all other namespaces)
- External links
- Headings (sections)
- Contextual Information:
Each edit type then has four associated potential actions: insert, remove, change, move. And the number of edit types + actions are summed up across the whole diff. Most types have clear boundaries but text is aggregated by section, so making a few changes to the text will be counted just once if it's all in the same section but changing text across multiple sections will be recorded independently.
Edit Diffs and DetectorsEdit
Computing textual diffs is a long-standing challenge and a central feature of the wikis -- i.e. wikitext diffs (and more recently visual diffs) underlie the ability of editors to efficiently patrol edits to articles. We do not directly reuse these technologies for two reasons: 1) our goal is to support large-scale analyses, which generally means that our implementation must exist in Python (which can easily be applied to the Data Lake via PySpark UDFs), and, 2) the goal of on-wiki diffs is to provide a visually-coherent, human-interpretable explanation of changes whereas our goal is to provide a structured, machine-interpretable description of has changed. That said, our work most closely matches (and draws much inspiration and code from) the Visual Editor diffs, which also contain some semantic explanations of the changes occurring in a diff.
The diffing and detection process can be split into three stages:
- Tree diffing: this is the high-level determination of what changed and where in an article -- e.g., a template was changed. It's the first stage in the diffing process and is particularly helpful for detecting moves and bringing more structure to the diff. The outputs are then passed on to the node differ (explained below) to further process.
- Node diffing: this is the specific determination of what happened -- e.g., a parameter was added to that template. This stage also is where we do some more fine-grained disambiguation of what was changed -- e.g., whether a tag is a reference, table, list, etc.
- Counting: this is the summary of what happened based on all the changes. While this sounds simple, it's actually one of the harder parts because it depends on a clear idea of how to interpret changes in wikitext. For example, how should one count a reference that was added within a template? Just a template edit? Or that plus a reference edit? Or just a reference edit if the template syntax wasn't altered otherwise?
Details on the approach to diffing used by this project can be found here: PAWS:Edit Types Diffs.ipynb
The current state of the detectors can be tracked through this interface: https://wiki-topic.toolforge.org/diff-tagging?lang=en
You can also see an example deployment of the library on the cluster here: https://gitlab.wikimedia.org/isaacj/miscellaneous-wikimedia/-/blob/master/edit-types-cluster/EditTypes.ipynb
Some examples of how the outputs could be used:
- Fine-grained metrics of the impact of a tool: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/Campaign%20Impact.ipynb
- Inferring editor intention: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Intentions/Editor%20Intention.ipynb
- Detecting vandalism: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Vandalism/Language-agnostic%20Vandalism%20Detection.ipynb
- ↑ Yang, Diyi; Halfaker, Aaron; Kraut, Robert; Hovy, Eduard (2017). "Identifying Semantic Edit Intentions from Revisions in Wikipedia" (PDF). aclweb.org: 2000-2010. doi:10.18653/v1/D17-1213. Retrieved 15 October 2021.