Machine learning models are often biased in unintended ways. You can find a quick introduction to human bias in ML on YouTube. Unintended bias tends to come from skewed or incomplete training data. In text classification, this typically happens because of:

  • Gaps in the training data: for example there are few outright threats of violence posted to wikipedia comments, so machine learning algorithms trained on that data will not be good at identifying them.
  • Skewed annotator population: a skewed demographic of crowd-workers might impact rating, ex. if population of crowd-workers labelling the training data skew toward a particular political orientation, and do not consider attacks on opponents as toxic.[1] Here, we’re loosely using the term “demographic” to refer to personal characteristics of annotators (gender, age, sexual orientation, but also: membership in interest groups, etc.)

For more details of the challenge of unintentional bias in machine learning, see the Attacking discrimination with smarter machine learning interactive visualization and the FAT ML reources page (maintained by the Fairness, Accountability and Transparency in Machine Learning community) .

Detecting & mitigating algorithmic biases


The real question about our models is not do they have unintended biases - we are sure they must - but in what ways are they biased, how much, and how much of a problem is it?

  • Identifying if a model carries unintended biases can be difficult. Carrying out a particular analysis that fails to find it does not mean it is not there. Maybe other methods or additional data might still find it. There is no conclusive test for a model not having an unintended bias.
  • There are several different concepts for what it means for a machine learnt model to be fair, which are not necessarily compatible [2].

We want our models support conversations being more inclusive, and thereby to allow more points of views to be shared.

  • We can measure unintended bias as those that result in excluding a group of people, for instance by failing to recognize toxic attacks targeting their communities - a natural starting point is to consider groups who are frequently targeted (e.g. protected classes).
  • We also do not want our model to be unfairly biased with respect to topic: we want to support conversations on any topic.

Given that our machine learning models are trained by the judgements of crowd-workers, the demographic of crowd-workers may impact the judgements they make on the toxicity of comments.

  • This can be challenging: It is hard to get accurate demographic information on crowdsourcing platforms, and crowd-workers are unlikely to be representative of wider population.

Regarding topics, the challenge is the following: if all examples of about a particular topic were rated as toxic by crowd-workers (ex. all comments containing mentions of breast are in fact examples of sexist attacks), then the trained model would likely believe that that topic is always toxic (ex. breastfeeding forums end up considered toxic). This would prohibit constructive and important conversations on that topic, which is exactly what we do not want to do.

  • This is difficult because the concept of 'topic' is itself is hard to define.
  • Like bias toward a demographic group, failure to find an unintentional topic bias only means that the method used to find it failed.

Of course, topic and demographic challenges intersect: how toxic a comment is perceived to be may depend on both the characteristics of the annotator and the topic.

We believe understanding and mitigating unintended bias in this work requires analysis and scrutiny from an interdisciplinary perspective. This is one of the key motivations for opening access to an API that hosts models and creating public datasets like our Wikipedia datasets; this supports others researchers to help address the challenge of bias in machine learning.

Bias in our models is an active area of research and we welcome others getting involved in WikiDotox Wiki pages or via Conversation AI research projects on github.


  1. Binns, Reuben; Veale, Michael; Van Kleek, Max; Shadbolt, Nigel (2017). "Like trainer like bot? Inheritance of bias in algorithmic content moderation". Proceedings of the 9th International Conference on Social Informatics (SocInfo 2017), Oxford, UK, 13-15 September 2017. arXiv:1707.01477. doi:10.1007/978-3-319-67256-4_32. 
  2. "Inherent Trade-Offs in the Fair Determination of Risk Scores". Proceedings of Innovations in Theoretical Computer Science (ITCS), 2017.