Community Tech/Allow categories in Commons in all languages/Notes

Email from Johan, Dec 15 edit

When you want to illustrate something, you typically go look for images on Commons using the category system. At the moment, this is only possible if you speak English, which is the language of the category system. Understandably, this is a pretty big issue for some communities.

Apart from the technical solutions, one thing to keep in mind is that languages do not always have words that correspond well to specific words in other languages. An example (not the best one, but the first that comes to mind) is that Swedish doesn't have a word for "grandmother" (or "grandfather", "aunt" or "uncle" for that matter), meaning either your father's or mother's mother. You have to specify if it's your maternal (mormor) or paternal grandmother (farmor). "Grandmothers" would have to be translated as "maternal grandmothers and paternal grandmothers" (mormödrar och farmödrar). This is normally not an issue, but occasionally, it would probably cause some problems. On the one hand, the same is true for Wikipedia articles already. One the other hand, they can describe different concepts in different languages. Here we'd all have to try to describe one concept, whether it fits the language or not. We'd have to talk to Wikidata folks.

Notes from preliminary assessment meeting, Dec 17 edit

Johan's got a good point -- would we want separate category trees for different languages? Put a delineating line in for some cases?

But building a new category system on top of the existing Commons is almost certainly out of scope. Asked whether they want separate category trees or translatable tags, everyone agreed trans tags is better and less confusing, even though in some cases it's going to be difficult to do an exact translation.

The tricky thing is - how to make them translatable, and how people would edit them. Ideally, the titles would be MediaWiki messages, then the category would have a key or an ID rather than a name. The key would map to a translatable MW message.

Tech challenges? In category table, it stores each category with ID and title. The title is the canonical name of the category page. We would want title to also change based on user language. That's a significant departure from how wiki pages work -- although there are translateable pages, and you could do subpages.

But switching dynamically is tricky, because of cacheing. We wouldn't want to change the URL because somebody looked at it.

The title could be a message key? In database it could be, but there's also the title at the top of the category page, which needs to be shown in their language.

It's possible to vary content based on language -- it's catched, but fragments the cache. It negatively affects performance, but it's still possible. For example, you can use int: parser functions.

Another q: How do we implement the interface How do people translate title names? Can it be done on translatewiki? We could set it as MW messages and not provide an interface on-wiki, people could do it on translatewiki. That's probably the simplest approach to start iwth.

Is a one-to-one translation with people had in mind when they voted for this, or something more complicated? Looking at the discussion, there isn't an overall consensus on that. We need to make sure we have agreement on what people want.

Some people mentioned Wikidata support. Not sure how well that would work, because there isn't a one-to-one mapping. It would work for basic nouns -- people, cities, animals -- but there are many categories (most?) that are the cross-section of multiple concepts.

Example: Category:Radio -- there's a template at the top saying that the category is too crowded, and encouraging people to put images into subcats like Radio by country, History of radio, People associated with radio, Radio events, etc. These don't have a clear Wikidata mapping.

Multichill created a Next generation categories proposal, with a Commons Wikidata roadmap explaining how to tie categories to Wikidata concepts. We should talk to Multichill and look into these ideas.

We should also talk to Multimedia team, Language team.

For the future: A tagging system would be much easier. Rather than "images of children with dogs in black and white," each of those could be a tag, and then the category is the intersection of these tags.

Note: There are hundreds of thousands of categories. Starting with A in Special:Categories, the #20,000th category on the list is Abraham. There are also a lot of overlapping categories, ex: Camels, Camelus, Camelus dromedarius, Camel anatomy, Camels eating, Drinking camels, Camel markets, Camel milk, Camels in art, etc. Is it possible for humans to realistically make a dent in translating all these categories in every language?

Dev Summit notes, Jan 4-5 edit

Talked to Lydia. The big problem: There are x million images. It's hard to sort and search in English; in another language, even harder. But just replacing the English words with other languages doesn't really solve the problem -- it's confusing in English too. A real solution would actually make it easier to sort and search images.

Wikidata is working on adding structured data to Commons, using "concept tagging". (Not "tags", which are seen as freeform and not translateable. These will be tied to the Wikidata concepts.)

They're going to work on support for Commons this quarter, no detailed plans yet. Commons:Structured data has documentation on the Commons/Wikidata work.

They hope to have a rough prototype demo in three months (?), a useable version in 6-9 months.

Ryan says: Community Tech should also discuss other options so we can give a full report to the community. One option: A bot that adds categories in every translation. (There are overlaps -- in German, "gift" means poison.) Another option: a gadget that replaces the category name when you view it, translates on the fly. (Wouldn't help with search, and where would it pull the translation from?)

Talk to Multichill about ideas for Next generation categories and Commons Wikidata roadmap.

Wikidata/WMF sync, March 2 edit

Wikidata is currently working on structured data for Commons, and they plan to have a first bare-bones prototype by the end of March. The current tickets:

  • T68108: (Epic) Store media information for files on Wikimedia Commons as structured data
  • T125822: (Epic) Basic first prototype for structured data support for Commons

There are several WMF teams interested in helping with this project, including Community Tech. We may be able to work on some of these tasks next quarter (the quarter starting in April).

Possible tasks:

directly helping edit

  • allowing content handler to expose structured data to the search engine: T89733 (Discovery team will start working on this one)
  • ability to use items/properties from Wikidata to make statements on other wikis: T76007
  • a new Wikibase datatype for smart URIs: T127929
  • Multi-Content Revisions T107595 (in close collaboration with Daniel)
  • Thoughts/concepts on integration of query and search in the context of multimedia meta data

indirectly helping edit

  • Add comments on associated namespaces RfC: T487
  • make Parser:getTargetLanguage aware of multilingual wikis: T114640
  • Per-language URLs for multilingual wikis: T114662

Wikimania notes, June 25 edit

Niharika's notes, after talking with Lydia:

The main thing they want to do is have a new Type on Wikidata, like we have Items and Properties only right now. They want a new type for storing media info. I asked her if it would be similar to Item and she said it'll include some of the Item properties but some new ones also, which is why a new type.

Ability to use items/properties from Wikidata to make statements on other wikis (T76007)

If I upload a picture of a Mango tree on Commons, I should be able to pick what kind of tree, what color etc. from Wikidata options (sort of like an auto-complete interface for specific properties the user chooses). If it's something new, the data has to first go on Wikidata and then can be used on the wikis.

A new Wikibase datatype for smart URIs (T127929)

Wikibase (the software that Wikidata runs on) supports these data types as of now: They want support for a new data type: T127929 -- Basically that we accept the user's profile link (for a bunch of possible sources) and display only the relevant handle while retaining the URI link underneath.

Multi-Content Revisions (T107595, in close collaboration with Daniel)

Need to talk more to Daniel about this one.

Thoughts/concepts on integration of query and search in the context of multimedia metadata

Ability to run complex searches from the wiki itself. For example: {{dog:white|male|poodle}} should turn up all images of dogs with those specifications. The syntax and logistics of this task are still up in the air and possibly dependent on the first task being completed.

Conclusion, July 21 edit

Niharika has written up an evaluation of what the Community Tech team can do to assist Wikidata on this project: T121731. Unfortunately, the conclusion is that we won't be able to help in any meaningful way this year. The Wikidata team is still working through some foundational architecture issues to support multilingual content tags.