Research:Automatically labeling low quality content

04:02, 7 October 2020 (UTC)
Aaron Halfaker
Nikola Banovic
labeling, machine learning

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

We are trying to build models that can automatically flag issues with statements on Wikipedia. These issues are along the dimensions of improving grammar, removing bias, adding citations, etc. The goal is to use edits across Wikipedia to learn quality improving behaviors on statements along the given dimensions which can then be used to build models that can automatically identify such issues.


Data ExtractionEdit

We extracted about 6 million Wikipedia edits from articles of varying quality and preprocessed them to identify their semantic intention. We extract edits for the following semantic intentions:

  1. Point-of-view
  2. Citations
  3. Clarifications

We use the statements before modification from each of the semantically identified edits as positive examples of needing the semantic improvement. For example, an edit that makes a point-of-view change, is trying to make a statement or a paragraph more neutral. We extract such statements as positive examples. We use these statements to train models that can then automatically identify issues such as point-of-view, clarification, citations on unseen Wikipedia statements.

Statement Quality IdentificationEdit

At present, we are focusing on three categories of improvements: point-of-view, need for citations and need for clarifications. We use the extracted and labeled edits from above to extract statements that were modified in those edits as needing those quality improvements. We then use such statements to build quality identification models to show that meaningful quality improving behaviors can be learnt from non-vandalism good quality edits.

Based on the results, we intent to expand this to detecting a variety of issues with statements using the same approach of learning quality improving behaviors from Wikipedia edits.

Visual depiction of the proposed pipeline for identifying content issues in statements on Wikipedia using edits

Policy, Ethics and Human Subjects ResearchEdit

Our work does not record any data that is not available publicly on the Wikipedia ecosystem. We do not need any information other than the assessments of Wikipedia editors on the statements which are part of the study. The University of Michigan Institutional Review Board Health Sciences and Behavioral Sciences has determined that this study is exempt from IRB oversight (Date: 9/24/2020, IRB No. HUM00187850). In order to solicit feedback from the community, we intend to post a small number of predictions from the model on English Wikipedia's Featured Article Review space to show the potential of the research in helping ease the review process.


This work will directly benefit the Wikipedia community in its efforts to improve the quality of Wikipedia articles. By automatically detecting issues with statements on Wikipedia articles, efforts around article quality improvement can be accelerated.

We use a split of the dataset we created from edit-lables strategy discussed in the previous section for testing. For example, for minor POV problems, testing statements are modified statements in edits with comment "NPOV/POV". Here are the preliminary results for minor POV and missing citation problems:

Category Testing Examples Precision Recall F1-score ROC-AUC
POV 37000 81% 86% 83% 92%
Citations 300,000 65% 73% 69% N/A