Research:Curation workflows on Wikimedia Commons

Tracked in Phabricator:
Task T175185
Duration:  2018-January – 2018-April
This page documents a completed research project.


The Structured Data for Commons project will fundamentally change how media metadata are entered, stored, and discovered on Wikimedia Commons.

This research project seeks to understand the current workflows of Commons contributors who curate media (categorize it, delete it, link to it from other projects, etc.) in order to identify opportunities to support these workflows better with new software or new metadata, and to avoid disruption of critical workflows during the transition to storing most/all existing metadata in WikiBase.

Overview edit

The primary goal of this project is to figure out how structured data-based functionality can support editors who are doing important work to:

A. improve Commons itself as a media repository

B. improve integration between Commons and other projects like WikiData or Wikipedia(s)

C. improve individual media files/pages, or collections of media files within Commons

In addition to figuring out how the structured data on Commons project can build new features to make these editors more effective or their work easier, we also—just as importantly—need to identify key workflows that we should avoid interfering with or breaking when we roll out new functionality.

What do we mean by curation? edit

At the start of the project, we are using the term 'curation' very generally. It refers to any Commons editing work that involves editing existing media, metadata, and tools (e.g. bots, gadgets). It's meant to exclude one major activity: uploading new media files and working with their metadata, which is addressed in a previous study. It also (tentatively) excludes building and maintaining some supporting content (e.g. process documentation, policies), or general communication and collaboration (e.g. Village Pump, working in Wikiprojects), unless those activities are shown to be directly relevant to our central curation focus.

We are very interested to know whether this definition aligns with community members' understanding of what 'curation' means in the context of Commons, and/or if there are other terms (or meaningful sub-divisions under the umbrella of 'curation work') that important to capture or interrogate more deeply.

Examples of curation workflows edit

  • Deletion requests and file deletion
  • Patrolling for unsuitable content (e.g. copyright violations)
  • Anti-vandalism patrolling
  • Categorizing media
  • Improving media metadata
  • Building and maintaining templates and categories
  • Building and maintaining curation tools

Methods edit

Semi-structured interviews with Commons administrators who are involved in curation activities.

Participant Joined Commons Edit count
p1 2009 70k
p2 2008 12k
p3 2005 200k
p4 2004 100k
p5 2004 200k


Timeline edit

  • January 2018: scope project, identify potential interview participants
  • February 2018: develop interview protocol, begin interviews
  • March 2018: complete interviews and publish results

Policy, Ethics and Human Subjects Research edit

Interviews will be conducted and data stored in accordance with Wikimedia's guidelines for research consent, data access, and data retention.

Results edit

  • Categorization is the work that engages Commonists most. It’s the work they think is most important. And one of the main reasons they think it is important is that categories are (currently) the best way for readers/end-users to navigate Commons content.
    • Which makes sense, because search works poorly on Commons, and there are relatively few wikilinks on File pages to facilitate browsing of related content like you would on Wikipedia.
    • Right now, Commonists categorize using categories. If we want to encourage Commonists to start capturing media metadata in WikiBase properties instead of categories, we need to make that work feel as satisfying, meaningful, and (at least as) easy as the current category-first approach.
  • Commonists think of categorization as a local thing, not a global one. Commonists generally work on creating a consistent and comprehensive category system within a particular topic, collection, etc. that is of interest to them. As a Commonist, I’m less concerned about there being one true way to categorize paintings by artists by era than I am about figuring out the best way to categorize paintings by Degas that feature ballet.
    • This is important to note, because WikiBase properties need to be consistent in scope and granularity across all instances, whereas it’s not necessarily a big deal (from a Commonist’s perspective) if Degas’ paintings are categorized differently than Matisse’s paintings, or Seattle’s historic buildings are organized by neighborhood whereas Cologne’s are organized by street.
  • There is a feeling among Commonists that their needs have been ignored in the past by WMF and by other projects, especially Wikipedias and WikiData.
    • Since the success of Commons relies on the work of these volunteers, we need to make sure that we communicate our reasons for making the changes we do in terms of value to Commonists and Commons readers, rather than, say, WikiData or “the semantic web”.
    • When WMF pursues software changes that are intended primarily to support non-Commonist audiences (structured data community, new contributors, external consumers), we should consider showing good faith by also agreeing to also devote resources to address specific long-standing community concerns—community tech wishlist-style.
  • Commonists are receptive to UI and tool improvements overall. They know that categories are (in many ways) an awkward kludge.
    • But there is concern about unintended consequences of WMF-driven UI improvements that could make their work more difficult or less satisfying.
      • For example, if we change Upload Wizard by creating a dozen new required metadata fields that must be filled out individually for each upload.
      • Or if we make it easier for inexperienced uploaders to create new (orphan/redlink) categories when they upload media, but don’t encourage them to connect those categories to existing ones.
    • Various search improvements are promising direction. Interviewees didn't talk much about using search during their curation work, probably because it doesn't work very well. But better faceted search (e.g. intersections on categories, properties, not-categories, and not-properties) and multilingual search support would both certainly be well received.
    • Improvements to existing tools, like VisualFileChange and Cat-a-lot would be welcome. Add support for filtering by and performing batch actions on properties, as well as categories and page text, in these widely-used tools to familiarize Commonists with the potential of structured data.
    • Some Commonists are also excited about microtasks/gamification, but there are a few caveats here.
      • It’s not clear that these Commonists are excited to perform microtasks themselves, or for someone else to perform them. So there’s an open question of, for instance, “if we build a caption translation app, who is going to perform the translating?”
      • Commonists will want to be able to monitor and review the work done through these interfaces, so it’s important to make it easy to see feeds and reports of microtask edits.
  • Some Commonists are excited about machine learning approaches to categorizing and search. But it can’t be fully automated; there needs to be a human in the loop to ensure quality control (similarly to microtask/gamification workflows).

See also edit

References edit