Wikimedia Blog/Drafts/The Open Access Media Importer

This post is now available at

Title edit

  • Suggestions
    • Making scientific multimedia files routinely available on Wikimedia Commons
    • Wikimedia Commons as a repository of scientific multimedia files
    • The Open Access Media Importer makes scientific multimedia files available on Wikimedia Commons
    • The second life of scientific multimedia files on Wikimedia Commons
    • PubMed Central's openly licensed multimedia files are now on Wikimedia Commons
    • Mutual enrichment of scientific multimedia and Wikimedia platforms
    • ?

Keywords edit

Text edit

On Wikimedia projects, audio and video content has traditionally taken a backseat relative to text and static images (however, changes are underway). Conversely, more and more scholarly publications come with audio and video files, though these are — a legacy from the print era — typically relegated to the "supplementary material" rather than embedded next to the relevant text passages. And a rising number of these publications are Open Access, i.e. freely available under Creative Commons licenses that allow for the materials to be reused in other contexts.

Why not enrich thematically related Wikimedia pages with such multimedia files? That's where the Open Access Media Importer (OAMI) comes in. It makes scientific video and audio clips accessible to the Wikimedia community and a broader public audience. The OAMI is an open-source program (or 'bot') that crawls PubMed Central — a full-text database of over 3 million biomedical research articles — and extracts multimedia files from those publications in the database that are available under Wikimedia-compatible licenses.

Over 700 OAMI-contributed media files are currently used in Wikipedia and other Wikimedia projects. This X-ray video of a breathing American alligator — originally published by Claessens et al. (2009) in PLOS ONE — is currently being used for illustrating the "Respiratory system" entries in the Bulgarian, Chinese, English, German, Russian, and Serbocroatian Wikipedias.

Such reuse-friendly terms are the key ingredient to making scholarly materials useful beyond the article in which they have originally been published. However, OAMI aims to make this material even more useful by making it accessible:

  • in places where people actually look for them (Wikimedia platforms are a prime example),
  • in one coherent format (in our case Ogg Vorbis/Theora, which isn't encumbered by patent restrictions), and
  • in a way that allows for collaborative annotation with relevant metadata. This makes it a lot easier to browse and search the media files.

Status and Statistics edit

Since the first tests with the bot in mid-2012, the amount of video files on Commons has more than doubled from about 15k to well over 38k (of which about 10% use the WebM format, the rest Ogg Theora). Much of this increase is due to the OAMI, which has uploaded over 14k files so far — mainly videos, with a few hundred audio files among them. According to the BaGLAMa tool, over 700 of these files are currently being used across over 50 Wikimedia projects, which exposes them to a total of about 3 million page views a month — a scale most of them would never reach via the supplementary materials they had originally been published in.

Once the bot had been approved for automated uploads in October last year, its initial focus was on importing multimedia files from articles published in the past, but in parallel, it has already processed new ones as they came in. Since mid-2013, the focus has shifted: with the exception of about a hundred files that failed to convert or upload properly, the import of backfiles has been completed, so the bot is now chiefly processing files from newly published articles, several hundreds per month.

This video is one of over 14,000 multimedia files that OAMI uploaded to Wikimedia Commons: Macromolecular juggling to tunes based on music by Ernst Toch, whose entry on the English Wikipedia it currently illustrates. It was originally published by Lorenz et al. (2013) in BMC Biology.

Contributors edit

The Open Access Media Importer is being developed by a team of three: Daniel Mietchen (project lead), Nils Dagsson Moskopp (software development), and Raphael Wimmer (infrastructure). Its initial development has been supported by a grant from Wikimedia Germany e.V.. Once the materials are uploaded to Wikimedia Commons, the community there helps in improving categorization and file descriptions, fixing file conversion or thumbnail problems, renaming files or cropping them. On that basis, Wikipedia editors looking for illustrations can then readily find these materials and incorporate them into the articles they work on. WikiProject Open Access helps with that too, and it has featured a number of OAMI-provided files in its Open Access File of the Day series that highlights files from Open Access sources used multiple times in Wikimedia contexts.

Unforeseen but welcome side effects edit

Operating the bot uncovered inconsistencies in the metadata that publishers deliver to PubMed Central along with their articles. These issues range from license information to keywords and MIME types and have reached a scale that did not only lead to some adjustments of PubMed Central's workflows, but also to a conference paper that was to be presented on Tuesday at JATS Con, a conference on the XML standard that publishers use to exchange information about the content of their publications. Due to the recent shutdown of the US government, the conference had to be canceled, but a JATS user meeting will take place that same day, where an abbreviated version of the talk will be presented.

Converting all multimedia files to Ogg Vorbis/Theora, OAMI is encountering a wide range of codecs and container formats. Some of these — e.g., animated GIFs inside a Quicktime container — were not initially supported by the GStreamer library used for conversion. Bug reports have been submitted to the developers and resulted in several code changes that benefit all GStreamer users.

Another side-effect is that the bot has been nominated as a finalist in the first year of the Accelerating Science Award Program, for which the award ceremony is to take place today as part of the kick-off event for this year's Open Access Week. This nomination further increases the visibility of Wikimedia Commons within the scientific community. Video interviews with all six finalists are available on Commons.

While the majority of OAMI-contributed media files are from biomedical research articles, there are also videos from nearly every field of research, such as this video of a driver assistance system, originally published by Morales et al. (2013) in the journal Sensors, which now illustrates the article about vehicle reversing on the English Wikipedia.

Future plans edit

The bot was designed to be extendable to sources other than PubMed Central (manual tests with materials from Dryad and from PANGAEA have already been performed), to media types other than audio and video and to target sites other than Wikimedia Commons. Work on a derived pipeline for exporting these videos to YouTube has started, and we welcome anyone who wants to submit patches or plugins, e.g. to make other collections of openly licensed media more accessible (CERN just started to release materials under an open license), to use video tagging for improving the file description pages and their categorization, or to suggest Wikipedia articles where a given file might fit in. Of course, the code for the bot is free and open-source software, and forks to build something cool based on the OAMI (e.g. games or citizen science projects) are most welcome.

This recording of an advertisement call of the frog Atelopus franciscus was originally published by Boistel et al. (2011) in PLOS ONE and now enriches articles about the species or its genus in eight Wikipedias.

Daniel Mietchen (WikiProject Open Access) and Raphael Wimmer

Notes edit

  • Discoverability
    • files:
      • embeds in Wikimedia pages (or, via InstantCommons, MediaWiki pages more generally)
      • search engines
    • categories and pages
      • links from within Commons, especially its category tree
      • links from other Wikimedia projects
  • Problems with reuse:
    • language: most of the materials are English-only
    • from an embedded multimedia file (especially in gallery environments), it is not always obvious how to get to the corresponding file page on Commons
    • focus (e.g. concatenated videos about different species)
    • video metadata often provided in individual frames rather than via the video container's metadata or the video's captions in the original scholarly article

  • Wikimedia Commons is not widely known as a repository of PDF files but it has more than 215,000 of them, which dwarfs against the over 18 million image files but is about the same amount as all audio and video files combined.
  • The overall number of files on Wikimedia Commons currently stands at 17 million, which is about twice the number of topics covered across all Wikipedias.
  • Half a dozen Wikipedias have more than a million articles

Examples edit

The top ten most embedded items edit

Zenaida macroura vocalizations - pone.0027052.s009
Taeniopygia guttata song - pone.0025506.s001
Atelopus franciscus male territorial call - pone.0022080.s002
Archilochus alexandri (Black-chinned hummingbird) vocalizations - pone.0027052.s003
Carpodacus mexicanus vocalizations - pone.0027052.s006
X-ray video of a female American alligator (Alligator mississippiensis) while breathing - pone.0004497.s009
Spizella passerina vocalizations - pone.0027052.s005
Praealticus labrovittatus jumping - pone.0011197.s006

A hydrophilic termite (Schedorhinotermes sp.) attached to the surface of a wetted citrus leaf - pone.0024368.s007
An interview about the project

Other examples edit

  • Rare instruments or conditions
  • Animations
  • Interviews
  • Conference talks
  • Corrected -NC file

Further notes edit

File initially uploaded by the bot, then deleted because of unclear license information, now finally up again after original file had been corrected to clarify the license information