Research:Supporting Commons contribution by GLAM institutions/Preserving important metadata about media items

"The categorization system on Wikimedia Commons requires insider knowledge."

Certain kinds of metadata are considered to be high priority to GLAM contributors. They will go through extra effort, including creating novel hacks and workarounds to include this metadata. Understanding why certain types of metadata are considered high priority, and how GLAM contributors currently add this metadata will help identify item properties that are high priority for structuring.


Preserving important metadata within Wikimedia Commons' existing (limited, ideosyncratic) structured metadata capabilities is the most common challenge that GLAM projects face when donating media. Whether a GLAM organization is using Wikimedia Commons as the primary hosting site of a media collection, or donating selections of media from their existing online repository, the ability to preserve existing metadata about uploaded collections and the individual items in those collections is critical to GLAM project success.

Rich metadata not only document the provenance of a media item, but also its historical, cultural, and educational significance. For many GLAMs, the most important piece of metadata is the connection to the original institution or collection—the name of the donating institution, and links to the institutions web presence, their online collection, or even the individual item ID from within that collection.

Issues with categories edit

Participant Issues reported
p1 "lots of decision making: because many of the images are not 'traditional' images of animals, we needed to create new categories."
p6 "It was tricky to find the categories. I don’t have a complete knowledge of how categories in Commons work. I have this big generic collection of media from the museum. So I categorize by artist. If they have many pictures, for example I do 'glass plate photography by ARTIST NAME' I try to make it so most categories don’t have too much media in them so people aren’t overwhelmed. I came up with the categories by thinking 'if I were looking for this media, which steps would I take?' I figured I’d first go to the museum [category] then look for the artist. I tried to categorize each item with two different categories: for example by who made the artwork, and what does the artwork represent, what artistic movement is this artist a part of. And I do a lot of exploring Commons for art, and I know there are different discovery paths."
p4 "It’s difficult to categorize them in commons. It’s not easy to find all the categories you need to add to an item to make it easy to find. The search tool works pretty nice because we have OCR on the books, but the frustrating thing is the categories. We have everything in one category for the project. After that we look for them under the author/book, if the book’s famous. There’s a lot of categories like 'this file is a PDF' but they don’t extend to all the items, so hopefully other categories will be added in time."
p2 "Categorizing was hard. I researched what other museums were doing. I did searches on basic terms and tried to understand what categories were being used by other museums. I used the Smithsonian first. Didn’t have that many categories. Tried to search the most basic term I could, see what was in those categories. I tried to find English equivalents for [Maori terms for artifacts] and had to qualify items under generic names/ categories based on what other museums did. Example: putting up Maori sinkers (fish weights). Currently just categorized under fishing weights and Maori fishing culture. Couldn’t find a New Zealand archeology category. Couldn’t find English equivalent of Maori names for items. Some things I categorized as jewelry, even though they are more significant than that [in Maori culture].
R_UhCRgWd0FjJKwr7 "Search for categories (in English for a non-English based GLAM...), or create a new category directly related to the upload."
R_2aDpIRqOGFNbzhG "I used AWB to add some items to specific categories (but I did most of the category adding by hand)."

Issues with the Commons category system were widely reported and often seriously affected the success of the GLAM project.

  • Finding relevant content categories. Many participants were unable to locate relevant categories for their content, even when they believed these categories existed (or, in some cases, confirmed that the categories existed when they were added later by volunteers). Strategies for finding relevant categories differed: some interview participants reported that they used the site Search feature to look for keywords associated with potentially-relevant categories; others looked at the categories that had been added to similar media items (for example, pictures of other monuments) and selected those that seemed most relevant; still others attempted to find the right category by reading the category 'tree' directly: starting with a generally-related high-level content category and browsing sub-categories until they found one with an appropriate level of specificity. In general, participants reported that they often avoided adding categories beyond those that were explicitly necessary (for example, a category for the GLAM organization, if it existed), or very general (e.g. a category for a country where the events depicted in a media item occured).
  • Language of categories. The majority of GLAM projects discussed by interview participants took place in countries where the predominant language was not English. Although all interview participants were comfortable with spoken and written English, many GLAM project participants are not. Commons' lack of multilingual categories, and its policy of using English language variants for most common and proper names, made the process of identifying, creating, and using appropriate categories a struggle for many participants.
  • Selecting the right category. Selecting the correct category can be a fraught endeavor: participants reported that they were concerned about adding the 'wrong' category, and becoming the target of negative attention (although no participants reported actually being targeted or scolded). Even when an ambitious GLAM participant identified a promising-sounding category, it was not always clear that this category was in fact the 'right' category for a media item. For example, one interviewee reported that they initially used the Glass Plates category for a collection of photographic images on glass plates, only to discover later that this category is actually meant for dinner plates made of glass.

Issues with templates edit

Participant Issues reported
p11 "Normally, Upload Wizard would be fine. But we have to use the BASA-image template and include all of that information, so we have to use the old-type upload form. The image should be categorized at upload, so you have to add the template at upload. But that’s a barrier for participation for some people, who don’t get template syntax and are intimidated."

"

p6 "Creator, photographer templates. These had to be done manually. We couldn’t fish the information out of the API about the photographer who took the photo. So I took the most common photographers who had a red link and just created them."
p4 "I started with a template in commons, go to WikiData, there is structure there. I checked the links and realized some of the links were broken, and information was wrong. I don’t think it has a validation process. Not in Commons, maybe in WikiData we have some validation, but still it’s pretty free to do what you want and make a lot of mistakes/inconsistencies. Definitely in dates ( this is very specific). And maybe when I link to a place where the book was printed. The template will tell me it’s wrong, but it’s the only info I have. I link to external database and try to fix that. I do another book and see that it has the wrong author. You need to know which templates exist. They’re not in the upload wizard, it would be nice to be able to do that first, and not after. Right now, you need to do a lot of things manually. If we could automate it, like having templates in upload wizard, and having the information you put in those templates already in WikiData, that would be good."
p7 "In terms of the content, the issue is that we need to comply with metadata structure that we can follow. Metadata fields on Artwork tempalte are not descriptive enough, don’t match the fields we’d like to map to/upload. When we chose the artwork template, there was no template for video content. We couldn’t choose between the infobox template or the artwork template, or any other template because they couldn’t find the mapping to be very clear."
R_2aDpIRqOGFNbzhG "I would have liked to see more features or suggestions for "best practices" for a mass upload, like having a creator template and stuff like that. I feel like I'm just guessing at what the best format is."
R_6s4jWKOBeeKv5xn "The Art templates and similar metadata templates on Commons were useful, but in future I hope it will be possible to share separate metadata about the image and about the thing itself (the political cartoon or map)."

Interview and survey participants reported similar issues with templates as with categories: the primary barriers to using these mechanisms for capturing media metadata in a structured way were awareness of the existence of templates, deciding which template to use, and the labor involved in creating new templates that describe the collection appropriately.

  • Awareness of templates. Simple awareness of the existence of relevant templates such a Book, Art, and Creator is the first challenge that GLAM participants face. None of these templates are called out in the upload tools. Unless the uploader is already very experienced, or knows where to ask for guidance (and how to frame the right question), most uploaders won't take advantage of the capabilities of these templates to capture relevant metadata about their media items.
  • Selecting an appropriate template. Even when GLAM participants are aware that relevant templates exist, selecting a template that fits both the type of media items in the collection and the available metadata is an additional challenge.
  • Creating custom templates. One option that some GLAM participants explored was creating new templates in order to describe the items in their collection. At least one participant (working on a recent GLAM project) decided against this option because they were aware the structured data functionality was coming to Commons in the near term, and they didn't want to put in the work necessary to create a new template that they thought would be rendered obsolete, opting instead to make do with existing options and update their collection metadata after structured data was supported.
  • Linking to Wikidata. Several interview and survey participants were aware of WikiData and how media items on Commons could be associated with WikiData entities (such as GLAM institutions like museums or universities) through special templates. One interview participant who had performed this linkage on several GLAM projects expressed frustration that the process was so manual, requiring edits to both Commons and WikiData, and in some cases the creation of the relevant WikiData entity.

Hacks and work-arounds edit

  • Capturing metadata in the item title. GLAM participants sometimes used the title field to capture metadata about each media item. Items in the same collection were given titles with consistent formats. One advantage of using titles for metadata, instead of categories or templates, is that filenames can be defined in the GLAM's preferred language without violating policy (unlike categories) and without needing to provide information in English first and then translate to the preferred language (as in templates and the description field). In one case, a GLAM contributor uploaded each scanned page of the book Een aardig prentenboek met leerzame vertellingen as a separate file, titled each file with the format Book-title-in-Dutch_Book-accession-number_Page-number.jpg, and used the book template on a category page to tie all the images together. Using this naming format made it much easier to find the images through Search. People from the GLAM or familiar with the GLAM's collection could search by the accession number, which was unlikely to change; the title was presented in the language that Dutch-speaking readers would expect; and individual pages in the book could be easily referenced and re-used.
  • Re-purposing fields from existing templates. Rather than creating new templates, which can be technically challenging and may be discouraged by Commons community members, some GLAM participants elected to 'hack' existing template by using existing fields in new ways to fit in important metadata. One participant used the 'notes' field of the Artwork template to provide a list of keywords or tags (gleaned from the GLAM database) in order to make the items more searchable (example).
  • Creating compound categories. In cases where the correct category was not clear or could not be found, participants who were more comfortable with editing wikis may elect to create new categories to ensure some sort of coverage. Often these were compound categories: categories that contained multiple pieces of metadata about the item or collection. In the example of 'glass plates' above, the participant eventually decided to create a set of nested, compound categories of the form media-type_media-creator_location, and then adding those categories as children of the categories for each sub-component of the compound category: for example, the category Glass plates by Baldomer Gili i Roig at Museu d'Art Jaume Morera was created as a child of the existing categories Category:Glass_plates, Category:Baldomer_Gili_i_Roig, and Category:Collections_of_the_Museu_d'Art_Jaume_Morera. In other cases, ambitious GLAM participants created new multi-level category structures. This type of work is labor-intensive, and the labor is sometimes of dubious worth: compound categories and hierarchical relationships defined by GLAM participants are liable to be overwritten or extensively altered by Commons contributors later on, because the new categories do not conform with established standards or do not align with the preferences of other Commons volunteers. Even if the categorization scheme for a collection integrated in this way eventually stabilizes into a form acceptable to both the GLAM and the general Commons community, the outcome may not justify the amount of work involved, and unpredictable changes to the categorization of a collection can make tracking impact and reuse difficult.
  • Uploading to Flickr instead. One ingenious participant elected to upload the GLAM's collection to Flickr instead of Commons (but under a Commons-compatible license). Flickr offered an easier and more robust batch upload workflow at the time, and the GLAM participant (a Wikimedian in Residence) knew that community-created bots would eventually port these items to Commons. The GLAM participant also knew that Flickr's community would likely add additional metadata (such as tags) to these items in the meantime, and that new metadata could eventually be ported to Commons once the files had been transferred there. This work-around saved the GLAM project time at the point of upload, yielded valuable new metadata, and offloaded some of the responsibility of how to best capture that metadata to the Commons volunteers. But at the cost of delaying the availability of the collection on Wikimedia Commons by several months.