Research:Community-centered Evaluation of AI Models on Wikipedia


This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In this project, we are interested in understanding opportunities to better support Wikipedians in making collective decisions around the evaluation and deployment of artificial intelligence (AI) systems on Wikipedia, such as ORES. Based on this understanding, we plan to develop a tool that can better support Wikipedians in making collective decisions around AI on Wikipedia.


Wikipedia has been using AI systems to maintain edit and article quality for years. Examples of these AI systems include the well-established ORES and the developing Liftwing. However, there might be disconnections between the technical evaluation metrics that the engineering team uses and the objectives that Wikipedians expect the AI systems to achieve. This disconnection might prevent Wikipedians from evaluating the AI systems and making informed decisions about AI's deployment on Wikipedia.

This project aims to explore potential mechanisms that may support Wikipedians' needs to evaluate AI systems on Wikipedia. Specifically, we are interested in exploring whether there's a way to facilitate the curation of evaluation data that may be used for evaluating AI systems. Usually, the evaluation data is shaped and curated by the engineering team. At most, Wikipedians help label data pre-selected by engineers. Wikipedians with rich domain expertise do not have the chance to decide the composition of the evaluation data or the evaluation criteria they value. Through a series of interview studies and prototyping, we hope to develop a tool that may support the curation of evaluation data led and driven by Wikipedians. The research outcome can benefit the Wikipedia community by supporting Wikipedians in collectively evaluating AI systems on Wikipedia and making more informed decisions around AI deployment.


This research contains three main activities described here:

Formative studyEdit

The research begins with a formative study that aims to understand opportunities to better support Wikipedians in making collective decisions around the evaluation and deployment of AI systems. We plan to interview Wikipedians who self-identifies as at least one of the five roles listed here. The interview questions for the semi-structured interview are publicly available for reference.

  • community organizer: who organizes community efforts in AI on Wikipedia
  • AI reviewer: who evaluates AI's effectiveness by identifying and reporting its errors
  • AI users: who uses AI predictions for their daily work on Wikipedia
  • engineer: who builds AI and/or launches data labeling campaigns on Wikipedia
  • data labelers: who labels data for AI on Wikipedia and helps engineers understand their Wiki

System prototypingEdit

Based on our findings in the formative study, we plan to iteratively design a tool with interested Wikipedians. We plan to conduct several rounds of pilot study before the formal community evaluation. The final artifact will be a functioning prototype.

Community evaluationEdit

We plan to recruit Wikipedians who are interested in evaluating AI models on Wikipedia to participate in the formal evaluation of the tool we build. During the study, participants will use the tool to collectively curate a dataset that can be used to evaluate AI models deployed on Wikipedia. We plan to analyze the results both quantitatively and qualitatively. For quantitative analysis, we will measure various metrics around the datasets and the interaction between participants, such as the engagement level in deliberation. For qualitative analysis, we will interview participants about their subjective experiences using the tool we build and their desire for further improvements.


The following timeline is aspirational and is subject to change based on the actual progress of the research.

  • 10.2022: create project page, finish interview protocol [done]
  • 11.2022: start recruitment for the formative study, finish interviews and analysis [done]
  • 12.2022: define concrete design objectives and start ideation [done]
  • 01.2023: continued ideation [in progress]
  • 02.2023: wrap up ideation, iterate on low-, mid-, and high-fidelity prototypes
  • 03.2023: evaluate the system
  • 04.2023: finish paper writing, submit paper

Policy, Ethics and Human Subjects ResearchEdit

This research has been approved by the Institutional Review Board (IRB) at Carnegie Mellon University on August 1st, 2022. Researchers are required to ask for participants' verbal consent before the study begins.



  • share preliminary data once available
  • describe the implication of results
  • make status=complete above when the research is done



  • Provide links to presentations, blog posts, or other ways in which you disseminate your work.