Grants:Project/University of Virginia/Machine learning to predict Wikimedia user blocks

Machine learning to predict Wikimedia user blocks
summaryconduct and report machine learning to predict user blocks for misconduct
contactuser:bluerasberry• cws3v(_AT_)
organization• Data Science Institute at University of Virginia
this project needs...
created on05:48, 1 December 2018 (UTC)

Project idea


What is the problem you're trying to solve?


Wikimedia user misconduct includes spam, harassment, vandalism, inappropriate use of a proxy, sockpuppeting, or "Clearly not being here to build an encyclopedia". Right now humans evaluate these things. Since 2017 mw:ORES review tool service is the Wikimedia Foundation's own service for automatic ranking of incoming user contributions, and can detect some kinds of misconduct, but not all. Also since 2017 the Wikimedia Foundation Community health initiative has proposed various interventions for protecting good contributions from the misconduct of other users.

The problem with "misconduct" is that it is challenging to define, hard to detect, tedious for a human to manage, needs a speedier response than humans can give in any case, and that there are many types of misconduct each of which probably merit individual responses. The 2017 research papers Ex Machina: Personal Attacks Seen at Scale (Q28731251) and Algorithms and insults: Scaling up our understanding of harassment on Wikipedia (Q55994649) are correct that an automated response to misconduct is part of the solution. The research in these papers only present the problem, and more experiments are necessary to detect misconduct more quickly, with more precision, and in more variation of circumstances.

What is your solution to this problem?

  1. Conduct Wikimedia community research applying machine learning to the issue of user misconduct
  2. Share the outcomes in Wikimedia community discussion forums and establish a precedent of seeking community conversation on responses to this issue
  3. Develop the Wikimedia documentation so that other researchers can more easily do research using machine learning and seeking Wikimedia community review

Project goals


What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

  1. Use machine learning to research user misconduct on English Wikipedia
    1. identify and characterize all past behavior which has ever resulted in a block to a user account
    2. based on past blocks, rank other accounts to predict which ones seem to merit a block
  2. Seek, collect, and report community response to the research results
    1. Feedback focus will be on community guidance to use this project as a model for research in the field of machine learning to address user misconduct rather than a practical solution
    2. Collect lists of community ethical concerns related predicting blocks for user accounts to advise future, potentially more risky research
    3. Report ways to have casual conversation on the research. This conversation will be in the context of the Wikimedia community value to have transparent discussions, and to making Wikimedia safer, and in reporting technical machine learning methods to a community which will not be familiar with AI applications.
  3. Present this relatively simple research project as a model for others to do similar, more comprehensive research

Project impact


How will you know if you have met your goals?


For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

  1. Publish research results in an academic journal in 2019
  2. Publish research results in Wikimedia community forum, such as The Signpost, to raise community awareness of the increasing role of machine learning in Wikimedia projects and also of the social issue of user misconduct
  3. Publish documentation related to the Wikimedia research process, including
    1. Guide on how to access the public Wikimedia datasets
    2. Guide on how to contact Wikimedia community stakeholders regarding the research which concerns them
    3. Guide encouraging other researchers to do similar research in machine learning

Beyond only the research outcomes, this project also seeks to document its methodology and good Wikimedia collaboration practices in applying machine learning generally to Wikimedia data and on user conduct analysis in particular.

Do you have any goals around participation or content?


This project will not result in any Wikimedia mainspace edits.

There is a goal to produce Wikimedia documentation targeting community researchers and community stakeholders who seek to review research.

Here is an estimate of the three shared metrics outcomes:

  1. Total number of participants - 0
  2. Total number of newly registered users - 0
  3. Total number of content pages created or improved - 10-20, including documentation pages on meta and community discussion forums related to misconduct

Project plan



  1. Do research as described at University of Virginia/Machine learning to predict Wikimedia user blocks
  2. Beyond research and outside the scope of the researcher's goals to do machine learning, connect the outcomes to Wikimedia community forums

The major activities of the project are the research and labor required to demonstrate understanding of data science so that the 3 student researchers can qualify for their Masters of Science degrees from the Data Science Institute.

To complement that research, and beyond the scope of data science, regular Wikimedian Lane Rasberry / bluerasberry will present these results and instructions on general Wikimedia research in on-wiki documentation to the Wikimedia community and future data science researchers.


  • $5000
research sponsorship

The Data Science Institute at the University of Virginia offers graduate student research sponsorship opportunities for companies for $15,000. In this scheme there is no budget allocation or management, because the presumption is that the process is complicated, most of the expenses get subsidized by the university research department, and the cost is relatively small in comparison to other ways to hire this kind of research for this period of time. I am seeking 1/3 of the typical cost, which would sponsor the research as a whole and cover the costs of the aspects of the research which are out-of-scope of the typical university experience but which make the experience more relevant to the Wikimedia community. Most academic research on Wikipedia is distinct and mostly unknown to the Wikimedia community, and I want this project to be both more interconnected and to encourage Wikimedia-compatible community reporting.

Here are some things that this sponsorship covers:

  • 3 masters level graduate students doing research for 6 months
  • faculty adviser
  • staff Wikimedia professional supporting research at the university
  • computation / laboratory costs
  • publishing fees for academic journal

Ethics statement

  • The content and outcomes will conform to the wmf:Open access policy.
  • Note that this project only uses public data which anyone can access and does not seek any confidential Wikimedia user information.
  • Even though this project will only use public data, it is possible to surface information which could harm individuals. The Wikimedia community needs simple introductory projects like this one to begin to identify risks. Because of this, and balanced against the open access policy, this project will not share all of its data openly and will withhold identifying information of Wikimedia user accounts. This data will be available to the WMF on request. The intent here is user safety.

Community engagement


How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

Reporting targets for this project include the following:

  • Research:Index
    • Target audience is other researchers
  • English Wikipedia community forums addressing user misconduct
    • These can be varied depending on research outcomes, as spam, harassment, etc may be similar from a machine learning perspective but have different communities which address each
  • Technical spaces
    • There still is limited documentation on how Wikimedia researchers can access public datasets
    • This project will build documentation to make research easier for future data science teams to begin

Get involved




The team is described at University of Virginia/Machine learning to predict Wikimedia user blocks.

  • Lane Rasberry / User:bluerasberry, Wikimedian in Residence at the Data Science Institute, is the point of contact for reporting project outcomes to Wikimedia. Projects from this organization are posted to the University of Virginia project page.
  • Arnab Sarka, Charu Rawat, and Sameer Singh are researchers and candidates for Master of Science in Data Science degrees from the university
  • Raf Alvarado is faculty at the Data Science Institute and adviser for data science aspects of this project.
  • There are other expert advisers who will get credit for their contributions to the project. Advisers will include others who have done research in online user interaction, in online misconduct, and in machine learning on Wikimedia data sets.

Community notification


You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

  • English Wikipedia policy spaces on harassment
posted en:Wikipedia_talk:Blocking_policy#Seeking_comment_on_WMF_funding_request_for_blocking_research
posted en:Wikipedia_talk:Harassment#request_for_funding_-_machine_learning_research_on_wiki-misconduct
  • English Wikipedia blocking noticeboards
posted - en:Wikipedia_talk:Sockpuppet_investigations#Seeking_comment_on_WMF_funding_request_for_blocking_research
  • Meta-Wiki spaces on data science research policy
perhaps does not yet exist
posted Research_talk:Index#Seeking_comment_on_WMF_funding_request_for_blocking_research



Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).