Community Wishlist Survey 2020/Wiktionary/Context-dependent sort key

Context-dependent sort key

  • Problem: In most Wiktionary projects, words of different languages share a page if their spellings are identical. Currently, the magic word DEFAULTSORT works for an entire page, which means we cannot define a default sort key for each language in the same page. That is an issue especially for Chinese, Japanese and Korean (hanja). They share characters but their sort keys are totally different (radicals or pinyin for Chinese, kana for Japanese, hangeul for Korean). If it is allowed to define a default sort key for each section, it will be much easier to correctly categorize pages.
  • Who would benefit: Editors of Wiktionary, especially those who edit Chinese and Japanese entries.
  • Proposed solution: Introduction of a new magic word, say, SECTIONSORT, that works for all categories after it up to the next usage of the same magic word. SECTIONSORT should override DEFAULTSORT if both are defined. The use of SECTIONSORT without a sort key should clear the previous sort key (and should not define an empty sort key).
  • More comments: see Community Wishlist Survey 2017/Wiktionary/Context-dependent sort key for a discussion in 2017. It is still a problem.
  • Phabricator tickets: phab:T183747
  • Proposer: TAKASUGI Shinji (talk) 12:19, 11 November 2018 (UTC)[reply]


How it will be visible in category? Sections can't be added to category. --Wargo (talk) 21:48, 16 November 2018 (UTC)[reply]

Currently, one adds a sort key to an entire page. The goal of this proposal is to allow more than on sort key per page: one per section; e.g. one sort key for the Chinese section of , one sort key for the Japanese section of the same entry, etc. This is because a same word may not be sorted the same way in different languages, and Wiktionaries often have entries from multiple languages in the same page, as a page corresponds to a specific spelling (which may occurs in multiple languages). — Automatik (talk) 14:06, 20 November 2018 (UTC)[reply]
Notifying WargoAutomatik (talk) 14:07, 20 November 2018 (UTC)[reply]

See also my somewhat related proposal (I keep missing the deadline) Community Wishlist Survey 2017/Archive/Allow multiple entries within each category. Urhixidur (talk) 13:30, 17 November 2018 (UTC)[reply]

  • I've been thinking a bit about this. The problem here is that you have multiple types (languages) of content inside a single page, with a single title. The page日本#References for instance (quoted as an example in the ticket) is English. And therefor all categorisation of the page is based on the English title of the page (even though the title is not in the english language). This is a fundamental problem (a mismatch to the wikipage concepts). It really means that the entire system should be changed to make use of MCR and specialised MW contenthandlers, so that more semantic info can be extracted out of the page. (Like how wikidata deals with different types of information in a single page). And then on top of that, you could have a Category be in a certain language, and the category could use the correct sort key for a page, by referring to the information of the applicable 'language section' inside the Page. —TheDJ (talkcontribs) 11:25, 6 November 2019 (UTC)[reply]
    • To further clarify, the community has laid meaning (a convention) into some of the content, which MW cannot contain for them. When you want software features that makes use of those meanings, that meaning first has to be machine extractible (at scale) before we can do things with it that are not; 'a simple wiki page that complies with the assumptions of the original wikipedia' —TheDJ (talkcontribs) 11:28, 6 November 2019 (UTC)[reply]
      If I got your idea right, you are saying that "Page content language" in Page information should be able to deal with more than one language, through a specific tagging in the page or by using a template use for language section title. Then, the ordering for each language could be fixed in MediaWiki. I think this is another way to solve the same issue, and maybe a more MediaWiki-centered one. Noé (talk) 10:05, 9 November 2019 (UTC)[reply]
      This is not what I understand. For en.wikt, the "Page content language" is always English (for apple as well as for pomme or Apfel), for fr.wikt, it's always French, etc. Anyway, there is no such issue with the "multiple collations" proposal. Lmaltier (talk) 13:57, 10 November 2019 (UTC)[reply]
  • This proposal seems to become useless if the "Multiple collations per site" proposal is adopted (i.e. a magic word stating the language for each category). Or do I miss something? Lmaltier (talk) 20:27, 8 November 2019 (UTC)[reply]
    It is mainly for Japanese and optionally for Chinese and Korean (hanja). You cannot generate a correct sortkey for each language in a page of Chinese characters. In the example above, the correct sortkey for 日本 is “にほん” for Japanese and “일본” for Korean. You can have only one default sortkey now. — TAKASUGI Shinji (talk) 23:08, 10 November 2019 (UTC)[reply]
    That concerns far more languages than japanese or chinese and Korean. For exemple, Ásia shouldn't get the same sort key in Portugese and in Northern sami. Unsui (talk) 13:36, 21 November 2019 (UTC)[reply]
  • Using the "multiple collations" proposal together with a language-dependent sortkey seems to me a more correct solution than a context-dependent sortkey. (This would probably require a magic word that contains a language code and a sortkey, that specifies the sortkey to be used in categories of that language. I recall reading a discussion about such a magic word on Phabricator, but can't find where that was.) Then instead of specifying a section-dependent sortkey you would specify a language-dependent sortkey: the sortkey for Japanese categories (such as "Japanese nouns"), for Korean categories, for Chinese categories.

    This is more correct, I think, because the sortkey actually depends on the language of the category. It's only by convention or because of practical considerations (for instance, that headword-line templates include category links in them) that categories for a given language are added in that language's section.

    If categories are added by mistake to the wrong section (usually at the bottom of the page), a context-dependent sortkey would be applied to the wrong category, whereas with a language-dependent sortkey, the wrong sortkey would be applied if the category is classified under the wrong language or under no language.

    TAKASUGI Shinji, do you think that the language-dependent sortkey idea would work as a solution? I could be missing some details about how CJK(V) entries work. Erutuon (talk) 20:23, 1 December 2019 (UTC)[reply]

@Erutuon: unfortunately it doesn’t work for Japanese or Chinese. The sort key for 日本 in Japanese is にほん while that for Mandarin is Rìběn (or 日00木01 based on strokes, depending on each Wiktionary project policy). It is impossible to generate correct sort keys algorithmically for all the entries. You need to give a sort key manually in some cases, but you can’t have two default keys in a page now. We need both of the two proposals. — TAKASUGI Shinji (talk) 23:59, 1 December 2019 (UTC)[reply]
Hmm, it doesn't sound like a problem for the language-specific sortkey idea. In the case that you describe, there could be a magic word like {{LANGUAGESORT:cmn|Rìběn}} or {{LANGUAGESORT:cmn|日00木01}} in the 日本 entry to set the sortkey for all Mandarin categories and {{LANGUAGESORT:ja|にほん}} to set the sortkey for all Japanese categories. There would have to be another magic word on the pages for the Mandarin and Japanese categories (for instance Category:Mandarin lemmas, Category:Japanese lemmas) to indicate the language, like {{LANGCAT:cmn}} and {{LANGCAT:ja}}. Then, when there are multiple Mandarin categories in the Chinese section (and multiple Cantonese, Wu), as is true in the English Wiktionary, the sortkey doesn't have to be specified for each category link, and the categories for each Chinese language do not have to use the same sortkey even though they are in the same section. This actually is different from the multiple collation proposal, but it might be compatible with it. Erutuon (talk) 00:55, 10 December 2019 (UTC)[reply]