Talk:Abstract Wikipedia/Representation of languages

Latest comment: 11 months ago by Cscott in topic Redefine Z11

More on lists of languages

edit

There are several places of language of language supports:

  1. MediaWiki Message files - 376 languages (not a requirement for language support: see phab:T201461)
  2. MediaWiki i18n - 408 languages can be used in interface; such files stores I18n of MediaWiki in some languages - MediaWiki languages supported in interface is union of #1 and #2)
  3. Names.php - 458 languages; containing autonym only; proposed to phase out, see phab:T190129 (superset of both #1 and #2)
  4. language-data - 642 languages; containing autonym (superset of #3)
  5. CLDR extension - names of 619 languages in 166 languages (not complete); some languages are not supported by MediaWiki; this is where translated language names comes from
  6. CLDR local list - names of 222 languages; translation needs a patch (see phab:T231755); also containing some special codes (und, zxx) and overriddes of CLDR names
  7. LanguageInfo API: result is actually union of Names.php + CLDR + CLDR local + wmgExtraLanguageNames
  8. IANA English language names in Babel extension: including 7859 languages but used nowhere other than Babel extension

(more comments continues below soon)--GZWDer (talk) 00:49, 12 March 2021 (UTC)Reply

Thank you! I referred to this enumeration from the front page. Please feel free to actually integrate this list there (or tell me to do it). --DVrandecic (WMF) (talk) 00:20, 27 March 2021 (UTC)Reply

Using QIDs?

edit

@DVrandecic (WMF): We can Wikidata IDs for identifying languages, which may save the need to invent a new kind of object. For well-known languages, there may be a hardcoded mapping between QID and code and therefore eliminate the need to query Wikidata every times.--GZWDer (talk) 00:52, 12 March 2021 (UTC)Reply

  • Note I does not support pre-emptively assign ZIDs to each ISO 639-3 entries as the standard is evolving and there may be duplicated or spurious language codes.--GZWDer (talk) 02:23, 12 March 2021 (UTC)Reply

Using QIDs would mean that we start relying on QIDs, their stability and possibly representation for a very core component of Wikifunctions. I am happy to rely on QIDs for higher-level components, but the list of languages probably needs to be under local control of the Wikifunctions community. And I can totally see a future where the topics for d:Q7850 or d:Q1860 or d:Q9301 change their exact meaning over time in Wikidata.

We should definitely map to QIDs, though. --DVrandecic (WMF) (talk) 00:27, 27 March 2021 (UTC)Reply

A language such as Toki Pona has no good ISO 639 entry ("mis-x-Q36846"!?), as far as I know. Wikidata lexemes are to some degree already relying on the language Q-items. — Finn Årup Nielsen (fnielsen) (talk) 23:06, 28 May 2021 (UTC)Reply
That's good to know. Another reason to rely on our own catalog of languages. DVrandecic (WMF) (talk) 23:42, 25 June 2021 (UTC)Reply

Redefine Z11

edit

To me, a "monolingual text" is a text in some form of natural language. It is a type of string. An "English text" is a text in some form of English. It is a type of monolingual text. An "Oxford English text” is a text in the form of English mandated by the Oxford University Press. It is a type of English text. The string "internationalization" is an English text. It is also an Oxford English text and an American English text. Arguably, it is also a British English text. The more common British English text would be "internationalisation", which might equally be a French text. The important [fr, en] word is "type" [fr, en]. When we come to define natural language functions, we expect them to accept and return texts of a specific type (i.e., in a particular natural language). That type will not (or should not) generally be "monolingual text" but something more specific, like "English text" or "French text". If that is where we intend to end up, I suggest we should make Z11 a generic type. I suspect this helps with fallback, but that’s a topic for another day. Thinking only about labelization, I don’t know if it makes much difference, but it’s still the right thing to do.--GrounderUK (talk) 13:49, 12 March 2021 (UTC)Reply

I think you should probably try to represent writing system or variant as well. Mediawiki distinguishes between zh (roughly, Mandarin, but orthography unknown) and zh-Hans (Mandarin, written in Simplified characters) and zh-Hant (Mandarin, written in Traditional characters). Mandarin speakers are not typically literate in more than one of these orthographies, so generating zh-Hans for a zh-Hant reader would be unusable (as well as potentially a political faux pas). (Not to mention the political distinction between zh-yue and yue.)
Really this should be reusing bcp-47 instead of trying to reinvent the wheel, as there are other relevant distinctions drawn in BCP-47 which are useful: es-MX is in many concrete ways different from es-ES and similarly for pt-BR and pt-PT. Using MediaWiki as a guide isn't terribly useful, as there are many social/political reasons for "how language groups are divided into Wikipedia projects" which aren't necessarily relevant to Wikifunctions. Languages which are distinct can be treated as mere "variants" of each other in order to efficiently pool resources; and conversely languages which are "the same" can be separated into different projects for political reasons. (For example, a number of Indic languages are artificially divided into "devanagri script" and "arabic script" by the India/Pakistan border, despite the languages being fundamentally variants.)
I agree with the idea of making language a generic first class type, so that some of these distinctions can be captured in code. In most cases a string which is es-ES can be processed by a function which accepts es as the distinctions are mostly in vocabulary, not grammer -- but a function which takes zh is not likely to get very far without distinguishing zh-Hans and zh-Hant. Cscott (talk) 18:08, 17 January 2024 (UTC)Reply

Three Types and Z60 ZIDs

edit

The main page talks about two "lists of languages", interface and target, and suggests separate ZID ranges for each. To me (and, obviously, I might be wrong), this implies two Z4/Types ("interface language" and "target language"). "English" as an instance of the "interface language" type would have one ZID and "English" as an instance of the "target language" type would have a different ZID. Both these objects would refer to (have a key with a value equal to the ZID of) the same Z60/natural language, which implies that "English" as an instance of the Z60/natural language type would have a third ZID, different from the other two. The proposal on the main page proposes to reserve only two ranges of ZIDs. I propose a naive hash for Z60 ZIDs, based (where possible) on the ISO 639-1 codes. Each of the two characters is translated into a two-digit numeric string from "01" through "26" and prefixed with "Z1". Thus, for example, "en" becomes "Z10514" and "zh" becomes "Z12608". This obviously leaves a lot of guaranteed gaps, which can be used when there is no ISO 639-1 code (for example, "sco" might be translated to "Z11929" or "Z11933"). If we need to stick to 4-digit ZNumbers, we could use, say, 9 instead of a leading zero for the first character and prefix with just "Z". This would mean "en" would become "Z9514" and "zh" would become "Z2608" (and "sco" might be "Z1929"). This would overlap the "interface language" (instance) range and reduce the "target language" (instance) range, however.

Interfaces

edit

Alternatively, relabel Z60 to "interface language" (or "label language", or...) and relabel Z11 to "interface text" or ("label", or...). We can worry about target languages later, but we should aim to have only one object with a value equal to any given code (for the same referent), and that's the role Z60 has, as proposed. Of course, we could look at an approach that is not specific to languages. See, for example, Talk:Abstract Wikipedia/Object creation requirements#Wikifunctions and Wikidata. At that time, I supposed that Wikidata would be our route out to external authorities, and I still suppose that will generally be the case. From a Wikifunctions viewpoint, however, Wikidata is just another external authority. And MediaWiki could be another. For each set of values we wish to adopt, we need a different Z4/Type, and for each value within the set we are adopting, we need an instance of that particular Z4/Type, containing the value. At the higher level, I would call these Z4/Types "interface types" (or just "interfaces"), and generic interface design is a task that we might bring forward. In particular, we should look at a generic validator, a generic type constructor and a generic instance constructor.--GrounderUK (talk) 01:04, 14 March 2021 (UTC)Reply

Generics

edit

Just to clarify, in the previous paragraph I used “generic” in the ordinary sense of the English word, not to refer to “generic types” in the Wikifunctions sense. So “generic type constructor” means a type constructor that does not construct only one Z4/Type. It need not be a Z8/Function. A (Wikifunctions) “generic type” is its own constructor (or the resultant Z4/Type). For a simple interface, we don’t have to import all the valid values (which may not even be possible) because we can use a “generic type” to specify the applicable validation, which could be non-functional. If we need to specify the valid values explicitly, we can do so (non-functionally) within the validation or we can fully enumerate the type as persistent objects, as is proposed for the Z60. This is where we might use an “instance constructor”. This constructs persistent objects of a particular type, but the same implementation can be re-used for many different types (so it’s a generic constructor of instances whose type is not generic, I think). In effect (and I apologise in advance for using terms that I would usually avoid), such a fully enumerated type is a “whitelist”. We can also use the same approaches to implement a “blacklist”. That is, we can declare particular values to be invalid within the validation (non-functionally) or we can construct a persistent object for each invalid value, re-using our “generic instance constructor” where appropriate. --GrounderUK (talk) 11:28, 14 March 2021 (UTC)Reply

Reply

edit

Thank you for your thoughts! I think that "English text(value: x)" is isomorphic with "Monolingual text(language: English, value: x)", so that shouldn't really make a difference. At the same time, we can use the languages objects for other things too.

@DVrandecic (WMF): Isomorphic but not equivalent ;) It’s off-topic if we’re not considering target languages. I’m happy to expand on this elsewhere, but I really do not expect the results of natural-language rendering functions to be Z11/monolingual text (although they might be converted to that representation as a final step). We could do it that way (because of the isomorphism), but I currently favour explicitly-typed text, particularly for intermediate results. Beyond that, I just favour consistency, all else being equal (and isomorphic).—-GrounderUK (talk) 10:26, 27 March 2021 (UTC)Reply

Regarding your points about the three types, yes, that triggered some more thinking about it, and yeah, we changed the model to have only one type, and not several. I hope we can get away with that :) Thanks! --DVrandecic (WMF) (talk) 02:09, 27 March 2021 (UTC)Reply

Comment from Verdy P.

edit

Note however that only with ISO 639-3, if we wanted to be able to represent all of them, we would need about 7000 ZIDs (even if a few of them are deprecated or merged, but kept as stable references in the IANA database for BCP47), so the minimum needed range would be from Z1000 to Z8000, and may be we should keep the whole range of 4-digits ZIDs (Z1000-Z9999) reserved

Then it's up to us to define how to map these ZIDs for each ISO 639-3 entry or each 2-letter or 3-letter entry for base language subtags in BCP 47. The mapping could be easily made arithmetically, because these 3-letter subtags are limited by their syntax and a limited set of letters. 2-letter subtags that may (or may not) be recommended for localisation instead of ISO 639-3 codes kept as aliases in BCP 47 may be ignored byt just foxusing on the 3-letter subtags canonically mapped to them (there will be some difficulties for example with Chinese, depending on whever we focus only on the written language, or add distinctions for the oral languages, but also because there are important variants using script or variant subtags, that could need extra assignments outside the canonical space for 3-letter ISO 639-3-based subtags).

For example each 3-letter subtag can be seen as an base-26 encoded integer, so add the first ZID Z1000 as the base and we instantly get a non-ambiguous mapping for 26^3=17576 possible subtags (so it would map to the range from Z1000 to Z18575, including the reserved ranges for special or local-use subtags like "qaa"). We immediately see that this simple mapping would need more than just 4 digits as it would not be compact. So we could then reserve a larger range of 20,000 ZIDs for all languages we want to represent (e.g. Z10000 to Z11999), including additional space for out own special languages (Z10000 for a "neutral" root language, Z10001 to Z28576 for ISO 639-3 languages, or 3-letter codes also already assigned to ISO 639-5 for language groups, then Z28577 to Z29999 left for our own local use, e.g. for variants).

— The preceding unsigned comment was added by Verdy p (talk) 03:46, 12 March 2021 (UTC). It was NOT a comment but made on the page as it was a "draft" (itself not signed as well). If so the page is also a talk thread and should have been signed ! verdy_p (talk) 03:44, 16 March 2021 (UTC)Reply

I oppose creating entries for all ISO 639-3 codes. As the code is mostly stable, you can just use the code itself without inventing a new kind of ID. However, There are ISO 639-3 IDs deprecated (merged with others or retired as spurious).--GZWDer (talk) 00:55, 13 March 2021 (UTC)Reply
That's definitely NOT what I proposed. I just proposed to reserve a suitable and sufficient space for all languages we'll need now, and in the future.
The "estimated" space currently described in the page is clearly insufficient, and it is already inventing a new kind of ID (which will require its own maintenance)
And that's exactly all what I wanted to avoid, by just using an automatic, predictable static assignment (the fact that some ISO 639 codes are deprecated is NOT relevant for this discussion.
In fact even ISO 639 is not the good reference because it is unstable: translations are based on BCP 47 which has very long term stability warranties that ISO 639 does not offer (the IANA database for language subtags), and still allows extensions with subtags (in addition to the leading subtag for the primary language or language family, coming partly from ISO 639, with a few additional special assignments). The IANA database correctly encodes how to use "deprecated" codes but NO code are removed, they remain valid even if some of them may be "replaced" (the database contains enough info about if and when this can be done automatically, but BCP 47 applications are never required to make such changes at any time, so existing translations tagged with BCP 47 are stable and usable "as is").
The alternative would be simply to NOT use any numeric ZID for each human language, and represent them directly with BCP47 tags (i.e. making them as a type using another king of ZID). This would require minor change in the parser (but still based on JSON, it would still use a short JSON string). And despite of this, this still allows Z-Objects (of the special type "Natural language") to be created (only the encoding form of the ZID would be different, not "Znnnn" but directly the BCP47 code: no need then to manage a registry, and there's still the possibility to use it for page names on the wiki (possibly in its own namespace, or using a common prefix (could be a numeric ZID followed by a "/") in the common namespace used by all other ZOjects). So the representation would just be a single numeric ZID, a slash "/" and the BCP47 tag (such as "Z1000/en", "Z1001/fr", etc. instead of "Z1000", "Z1001", etc.). This alternative would also be easier to read in encoded form.
Why do all unique ZID's assigned to ZObjects have to be limited to the "Znnnn" numeric form? This is not needed if some ZObject types are allowed to their own ZID format.
And this remark applies to many other types of identifiers that don't need to be remapped to numeric "Znnnn" for creating a ZID (e.g. URIs, ISBN, ISSN: each type of identifier can have its own ZID datatype, and its instance can be "subpages" of the numeric ZID for the datatype; this could also apply to instances of enumerated types in a small set, such as "false" and "true", representable as "Z70/0" and "Z70/1" if "Z70" represents the Boolean type, or "Z80/0","Z80/1" for representing the integer values zero, one....). As well this would apply to Q-IDs if we want to reference Wikidata objects (no need to reinvent the mapping between Q-ID and Z-ID is they represent the same object).
I don't like the idea of using the new numeric "Znnn" form for all ZID's (this is clearly not needed, this numeric form is only suitable as a default for new objects specific to the local Wikilamda instance, i.e. Wikifunctions if this is that instance), when in fact this can be any form acceptable for a page title in a wikilink (possibly with its namespace), as long as this unambiguously and universally describes the object independantly of the wiki or applications that will use our repository (so if namespaces are used, they should be described locally in our repository itself, defining its own ZID resolver for its known namespaces, each namespace being bound to a ZType).
Other common ZID namespaces that could be used: SI measurement units, ISO currency codes... In all theses cases we no longer need to invent and maintain a local mapping to out default and local numeric ZIDs. verdy_p (talk) 03:28, 16 March 2021 (UTC)Reply

I'll re-read in detail, but I didn't see what to do with script (uz-latn) or locale (en-in) tags. The base26 transformation, although interesting and explainable, wouldn't cover for that. --DVrandecic (WMF) (talk) 00:37, 27 March 2021 (UTC)Reply

Regarding the other points on introducing more namespaces: I think the advantages of having uniform ZIDs for the object far outweight an explicit mapping in the namespaces. In short, in that case we could just work on URIs directly instead of using our own namespace. Which is probably something we should at some point enable - but not for objects which are so core to the inner working of the system. --DVrandecic (WMF) (talk) 02:12, 27 March 2021 (UTC)Reply

Return to "Abstract Wikipedia/Representation of languages" page.