This page is an incomplete draft of a research project.
Information is incomplete and is likely to change substantially before the project starts.



This page has been deprecated. See Research:Detox for current info.

Proposal Overview


The purpose of this project is to understand how abusive or "toxic" language affects the contributor community on Wikipedia. The focus of our analysis will be on talk page comments that exhibit harassment, personal attacks and aggressive tone. We aim to:

  1. build models to detect toxic comments,
  2. determine how prevalent toxic comments are, who they target and what impact they have on user retention and productivity.

We expect the outputs of this research to include:

  1. publication of our methods and findings
  2. an API for scoring talk page comments for toxicity
  3. an open dataset of all talk page comments
  4. an open dataset of talk page comments labeled for toxicity



Nithum Thain will be working with the WMF Research and Data team under an NDA in order to access WMF hadoop cluster to perform analysis.



A challenge in this project is to identify what is meant by toxic language. We are initially focussing on aggressiveness of revisions, personal attacks, and harassment, guided by the resources below:

Data Collection


In order to train our initial models, we used Crowdflower to crowdsource the labelling of talk page revisions as aggressive or attacking. More details about the specific questions asked and preliminary analyses of the data can be found here.

We are always looking for help with data collection. If you would like to help rate the tone of wikipedia revisions, please participate on Crowdflower [here]. (WARNING: This task involves exposure to very explicit language.)



While we have tried several algorithmic approaches to detecting aggressiveness, harassment, and attacks in Wikipedia comments, our best result come from simple n-gram models. Details about these models are available here (in progress). The code we are using to build and analyze these models is available on our github repository. You can test our current models here.



This section contains links to the papers that this research project has produced. We'd love to hear your thoughts on this work in the discussion section below.

Data Sources


Talk Page Comments


The MediaWiki software that underlies Wikipedia does not have any functionality that is dedicated to discussions. Users carry out discussions on dedicated "talk pages". The software does not impose any constraints on how these pages can be edited, but users follow certain conventions in order to structure their discussions into threads of comments. In most cases a talk page revision (e.g. edit) corresponds to adding a comment. Hence we treat the text that was inserted during a talk page revision as a comment. To get the text that was inserted during a revision we compute diffs using the mwdiffs package.

Block Events


Users can have their privileges suspended for violating Wikipedia policies. When a user is blocked there is a log entry containing the name of the user, the timestamp, and the reason for the block. The reason for the block often references the specific policy that the blocked user violated. We collect all block events where a user violated the No personal attacks or No Harassment policies.

Labeling Talk Page Comments


We are using the CrowdFlower platform to annotate a large number of talk page comments. We are currently getting labels for comments using the following 2 questions:

Q1: How aggressive or friendly is the tone of this comment?

  • --- (very aggressive)
  • --
  • -
  • 0 (neutral)
  • +
  • ++
  • +++ (very friendly)

Q2: Does the comment contain a personal attack or harassment? Please mark all that apply:

  • Targeted at the recipient of the message (i.e. you suck).
  • Targeted at a third party (i.e. Bob sucks).
  • Being reported or quoted (i.e. Bob said Henri sucks).
  • Another kind of attack or harassment.
  • This is not an attack or harassment.

Since there are roughly 31M talk page comments, we need a strategy for subsampling the data. We collect 2 types of samples. We started with random sampling, but found that it is relatively rare for a comment to be labeled as an instance of harassment or as very aggressive. Since it can be hard to train machine learning models on very skewed data, we also want to generate a sample of comments that have more bad comments. To do this, we take k the comments that were made by a blocked user before a block event.

Our detailed analysis can be found in the following notebooks on our github repository:

  • Experiment 1: An attempt to test different language for our question as well as compare the random samples to those of blocked users.
  • Experiment 2: We focus on the blocked user dataset and show how the new language leads to better response quantiles and higher inter-annotator agreement.
  • Experiment 3: Focuses on the posts nearest to a block event (within 5 posts) and additionally analyzes how aggressiveness changes before and after a block event.
  • Experiment 4: Compares the revisions at various proximity intervals around a block event.

Model Building


N-Gram Models


Prevalence and Impact Analysis


Questions we hope to address include:

  • How prevalent is toxic language on Wikipedia user talk pages?
  • How does the experience of toxic language change for different segments of the contributor base (e.g. by gender or other variables)?
  • How does toxic language affect contributors (e.g. user retention, user productivity)?

Getting Involved


We welcome input on anything we have outlined in our proposal. Just leave us a message on the project talk page!

You can also get involved by checking out the Discussion-modeling Phabricator project.

Concrete Needs

  1. please reach out if you are interested in volunteering to help us label talk page comments
  • I am very interested in taking part in this project, but would you please direct me to where I would best register my interest? I don't see a link here and I would like to know where I can put my name on the table. Would you please also leave a message at my talk page or ping me so that I know where to respond? I find this project really fascinating and I think it's quite useful probably. The trick will be in defining what is actually harassment or personal attack, verses what is calling out harassment or personal attack done by others. It's really important not to let this thing Boomerang back on to the people who are trying to make a friendlier environment that is more collaborative. There is definitely a huge problem with in Wikipedia of people being toxic obstructionist editor's instead of collaborative. There is a great problem of lack of Integrity in Dialogue on talk pages. And there is really no enforcement mechanism at all that actually works. We have these boards and we have administrators but they really don't seem to work for my experience. Instead I think that we have roving gangs of editors enforcing certain agendas by being toxic and Wiki lawyering. SageRad (talk) 12:34, 30 May 2016 (UTC)