Requests for comment/Switch default category collation to UCA collation with numeric sorting

The following request for comments is closed. Although there were very few responses to this RfC (despite being open for a year and a half), it seems there are no significant objections to changing the default from "uppercase" to "uca-default-u-kn", except that Wiktionaries should be exempted (although there were differing opinions on this). Another suggestion added by Bawolff is to switch to the localized version of icu where available at the same time that the default is changed.

Currently the default category collation for all wikis is "uppercase" (essentially case-insensitive sort by code point). This collation is extremely basic and doesn't handle grouping letters like "U" and "Ü", or putting numbers in the proper order (e.g. 99 before 100), necessitating the widespread use of DEFAULTSORT keys. We now have more sophisticated collation available and most of the larger wikis have already switched over to one of the better options. This RFC proposes that we change the default from "uppercase" to "uca-default-u-kn" (essentially a language-agnostic version of the Unicode Collation Algorithm with numeric sorting).

What is UCA collation?

The long answer is at http://www.unicode.org/reports/tr10/. The short answer is that it is the official standard for how to sort Unicode characters. The most noticeable difference is that UCA groups together letters that differ only in diacritics. For example, uppercase collation would sort articles in the following manner: Abbot, Aztec, Mango, Sabbeth, Ärsenik, Åland Islands; while UCA collation would sort them as: Abbot, Åland Islands, Ärsenik, Aztec, Mango, Sabbeth. There are several language specific implementations of UCA, but for the default, it would be best to use the language-agnostic version.

What is numeric sorting?

Under numeric sorting, pages would be sorted as such: 1, 2, 9, 10, 11, 20, 21, 99, 100. Under regular (uppercase) sorting, pages are sorted as such: 1, 10, 100, 11, 2, 20, 21, 9, 99. If numeric sorting is used, all pages starting with a number will be sorted together under a single header ("0–9") rather than under separate headers for whichever number each title begins with: "0", "1", "2", etc. Note that numeric sorting only works for unbroken sequences of digits. Digits separated by commas, periods, or spaces are treated as separate numbers.

Which wikis would be affected?

English Wikipedia and Test Wikipedia are already using uca-default-u-kn. Most other large wikis are using a language-specific version of UCA collation, some with numeric sorting, some without. (See the full list at Collation.) None of those wikis would be affected by this change. Only wikis not listed at Collation (and that don't opt out) would be affected.

Can wikis opt out?

Yes, this proposal is only about changing the default. It won't affect any wikis that have an explicit collation set in the config files, so any wiki that wants to opt out can simply request that their collation be set to "uppercase" (or whatever they want).

More info

To test out the "uca-default-u-kn" collation, visit https://test.wikipedia.org/ or the ICU Collation Demo. For more information on the different collation options in MediaWiki, see Manual:$wgCategoryCollation.

Discussion

For any wiki where we have a localized version of icu but that wiki is still using the default collation, can we switch to the localized version instead of uca-default-u-kn? Bawolff (talk) 02:52, 15 November 2016 (UTC)[reply]
- That seems reasonable to me. Kaldari (talk) 05:05, 15 November 2016 (UTC)[reply]
I am not sure whether this would be good for Wiktionaries, which are effectively multilingual projects -- not only do they have page titles in many different languages, but they also have many monolingual categories. While it might be OK to sort wikt:Category:en:Fish using a diacritic-equivalent collation, it doesn't make sense to do the same with wikt:Category:sv:Fish. As such, it might be better to leave Wiktionaries with the existing "uppercase" collation for now. Certainly, applying language-specific collations to these projects does not make sense, and I am surprised to see that this has been done in a few instances. The English Wiktionary community has been clear for a number of years that it wants the ability to specify a collation for each category (phab:T30397), which is obviously not in scope here, but really adds weight to the idea that this collation change is not suitable for Wiktionaries. This, that and the other (talk) 00:04, 19 November 2016 (UTC)[reply]
- @This, that and the other: I originally planned to exclude Wiktionaries, but we started discussions on a few of the larger Wiktionaries and surprisingly they were all in favor of switching to UCA, although they re-iterated their desire for per-category collation setting. I'll see if I can dig up some of the discussions for you. Ryan Kaldari (WMF) (talk) 23:09, 22 November 2016 (UTC)[reply]
  - Actually I remembered that wrong. We asked them if they wanted numeric sorting, not UCA specifically. I'm fine with excluding Wiktionaries for now. Ryan Kaldari (WMF) (talk) 18:30, 23 November 2016 (UTC)[reply]
I think UCA collation is a sensible default, potentially also in the worst case mentioned above (multilingual Wiktionary categories). --Nemo 14:37, 28 November 2016 (UTC)[reply]