Community Tech/Allow categories in Commons in all languages

Tracked in Phabricator:
Task T120451

The Commons category project project aims to make categories on Wikimedia Commons work in many different languages.

Rationale edit

Content on Wikimedia Commons is categorized to make it easier to find the right media to use. This is very helpful if you speak English, but if you don't, Commons becomes very difficult to navigate. It would be very good if categories worked in more languages. This was the sixth most popular suggestion in the 2015 Community Wishlist Survey, with 78 supporting votes.

Status edit

After evaluating various possible solutions, we've decided that the best long-term solution to this request is to add a multilingual tagging system via Wikibase (the same software that powers Wikidata). This will best facilitate doing complex searches and queries on Commons and will be language agnostic.

The Wikidata team is currently working on implementing structured data support for Commons (See task T68108 and task T125822). The Community Tech team did a deeper investigation into what we could do to assist Wikidata with the project. Unfortunately, the conclusion is that we won't be able to help in any meaningful way in 2016. The Wikidata team is still working through some foundational architecture issues, and adding more people to that stage of the project won't make it go any faster.

See ticket T121731 for the full evaluation.

We hope that we'll be able to help out with this project in the future, once it progresses to a further stage. That won't happen in 2016, so for the purposes of the 2015 Community Wishlist Survey, we have to consider this wish referred to Wikidata.

Update, July 28 2016: Wikidata has announced that they have a first demo of a new entity type, mediainfo, which can be attached to a file in Commons. It's still early days for the project, but this is an important milestone. The demo is at http://structured-commons.wmflabs.org

Technical discussion and background edit

Internal Community Tech team assessment edit

Support: Very high. Unanimous support votes for the concept; comments were about different ways to implement the idea.
Impact: High with structured data, Medium to Low for a straight translation. Searching and sorting images on Commons is already difficult and confusing in English. If we really want to make Commons work well across languages, then we may need to do more than take the confusing English categories and make them confusing multi-lingual categories. Incorporating structured data would be a more scaleable long-term solution.
Feasibility: Difficult. There are hundreds of thousands of categories, maybe more than a million. Starting with A in Special:Categories, the #20,000th category on the list is Abraham. There are also a lot of overlapping categories, ex: Camels, Camelus, Camelus dromedarius, Camel anatomy, Camels eating, Drinking camels, Camel markets, Camel milk, Camels in art, etc. Is it possible for humans to realistically make a dent in translating all these categories in every language? If the solution is creating duplicate categories in each language, it's daunting to think about how to manage up to 200x the current number of categories. That being said, using structured data concepts is also difficult.
Risk: High. This needs scoping and consensus on how this could work, with the Commons community as well as Wikidata.
Status: This is an important problem, and we want to help figure out the best solution. Right now, the most promising line of thought seems to be the "concept tagging" that Wikidata could provide through a structured data platform. The idea is: tag images using concepts that exist in Wikidata, and then cross-reference concepts to find the images you're looking for. Made-up example: instead of having separate categories for Camels, Camel milk and Camel markets, you might be able to mark the images with the concepts "Camels", "Milk" and "Markets". Then you can search for images that have both Camels and Markets, and then drill down into concepts like specific locations. If the concepts are pulled from Wikidata, then they're already translated or translateable. We don't know for sure if that's going to be the ideal solution for this project, but we want to learn more when Wikidata has a working prototype. We'll have more to say as we learn more.