Linguistic merging

One of the greatest triumphs of the wiki way is the scale and efficiency at which people can pool their resources and work together. Wiki has its limits and hits them rather quickly when people do not speak the same language. Because of the language barrier, for example, a French and an English Wikipedian are forced to work apart. What is especially frustrating is when some of these communities actually speak languages which are mutually intelligible (or mutually readable at least), especially when they are too small to each support their own wiki.

This page aims to provide a central resource and discussion centre for dealing with the political and technical issues behind the linguistic merging of Wikipedias. Linguistic merging is when Wikipedians of similar languages decide to work together within in a single wiki. The largest example of a linguistically merged wiki is the English Wikipedia which effectively merges the US and UK variants of the English language. Perhaps the next largest is the Chinese Wikipedia which merges Mainland and Taiwanese Mandarin, separated by a simple change of script.

The point of this page is not that we should merge certain Wikipedias (this is a decision for those communities), but to provide resources for dealing with all issues that come up if they were to merge. Also note that decisions can applied differently by the various mergings because languages could differ in different ways.

We classify issues into three overlapping categories: political, technical, readability.

political: Even if languages are actually the same from a linguistic standpoint, it is sometimes very important politically for them to be called different names. A possible solution is to use all of these names simultaneously:

Welcome! This is the Wikipedia for the Dutch and Flemish languages.

technical: Wikipedias should have the option of having two domain names, for example, one for Indonesian (id) and another for Malay (ms). This might be useful for other technical reasons below.

technical: Mainland Chinese uses a simplified script where as Taiwanese Chinese uses a traditional one. Is it possible to convert between the two? If one-to-one mappings are not possible, does that necessarily mean that all hope for automation is lost?

Perhaps the different domain names could be useful for this. For example, if you create a page on the zh-tw domain (or if your preferences are set to Taiwanese) then the page has a linguistic flag set so that it can be displayed differently or that edits to the page are properly converted.

case studies: The Further Pitfalls and Complexities of Chinese to Chinese Conversion

Some articles might have different names in different languages. This can be handled with the page inclusion method

technical/readability: How do we address the problem of different spellings? The English Wikipedia ignores it quite successfully (with occasional exceptions like the Guerilla UK spelling campaign :-) ) but maybe this is not an acceptable solution for other languages.

The lowest tech solution (ignoring the problem) might be the most prudent. Otherwise we could use simple tables, but if the differences are systematic and large enough, some kind of morphological analysis might be in order. See Lexical alternates for a possible technological fix

case studies: English (ignores it)

readability: What happens when different languages simply use different words? One solution is simply to put clarification text the first time one of these words appears:

An elevator (a lift in British English) is blah blah blah. Elevators (now we don't bother to tell the user that this is lift) blah blah bla

I would argue that a low-tech solution (such as ignoring the problem) would be prudent. Imagine if we had some kind of word choice converter for American and British English:

before after
I parked my car by the curb I parked my car by the kerb ok
I should curb my smoking I should kerb my smoking NOT ok

Unless you are willing to run a part of speech tagger on your documents, the technical solution might not be a good idea. (but what happens when there are many differences?)

technical/readability: How should interwiki links be displayed? Should we duplicate links for each language?

Where could linguistic merging possibly be done?

  • Serbian and Croatian?
  • Malay and Indonesian?
  • Chinese (simplified and taiwainese)? [ some issues left ]
  • Portuguese and Galician - with some well-designed spelling correction, it would be quite well readable.
  • Moldovan and Romanian (modern Moldovan is not written in Cyrillic)

