Community Tech/Numerical sorting in categories

Tracked in Phabricator:
Task T8948

The numerical sorting in categories project aims to make sure numerical sorting in categories is done in a way that's easy to read for a human.

Nov 2016: There will be one more batch of wikis converted to numerical sorting (T149002) and then we're going to take a break from it. We'll be looking into making numerical sorting the default across all wikis. Feel free to ask questions on the talk page.

Numerical sorting is now live on the following Wikipedias:

Bengali (bn)
Bosnian (bs)
Croatian (hr)
Czech (cs)
English (en)
French (fr)
Hebrew (he)
Hungarian (hu)
Italian (it)
Macedonian (mk)
Norwegian (no)
Polish (pl)
Russian (ru)
Swedish (sv)
Ukrainian (uk)
Vietnamese (vi)

Rationale

When categories are sorted, numbers are not treated as numbers but rather as numerical characters. This means that 100 comes before 27, because 1 comes before 2 (en:Category:English-language films). This makes them more difficult to read, as it doesn't reflect how humans think about sorting. We need to treat numbers like numbers.

There is also a request from the German community to sort umlauts correctly – it was one of the top wishes on TCB's wishlist survey. (survey entry)

Technical discussion and background

Status

This is a slightly less technical overview. For more details, please see our meeting notes and the Phabricator task.

Sept 22, 2016

We're now putting out the call to all projects and languages, offering numerical sorting to projects that want to use it.

See How to request numerical sorting on your wiki!

The following messages have been posted to Wikimedia-L and Wikitech-Ambassadors-L:

Hi everyone,
In the Community Wishlist survey that the WMF Community Tech team did last year, the #5 wish was numerical sorting in categories (e.g. let 99 come before 101). This is now working and has been rolled out to Swedish and English Wikipedia.
If your wiki wants it, the Community Tech team is happy to help, of course. Re-sorting the categories is done using a script, which can take a day or so to complete, depending on the size of the wiki. During the time that the script is running, sorting in some categories will be unreliable. This issue goes away when the script is done.
If you’d like numerical sorting on your wiki:
1) Please start a community discussion – RfC, vote, or however your wiki normally decides these things – to make sure there’s support for it.
2) Once you’re sure it has support, post on User:DannyH (WMF)’s talk page on Meta with a link to the discussion.
Regards,
//Johan Jönsson, User:Johan (WMF)

Also posted on Tech News: Tech/News/2016/39

Sept 7, 2016

English Wikipedia now has numerical sorting! The script took longer to run than we estimated -- we thought it would be one day, and it actually took seven -- but it's done, and it works the way we wanted it to.

Next: Bringing it to the other languages and projects that want it. There are some wikis that already have uca-collation; see Collation for the list.

The list of big Wikipedias that don't have numerical sorting yet includes: Arabic, Catalan, Chinese, German, Indonesian, Japanese, Korean, Norwegian, Romanian and Spanish. There are also a lot of people who voted for numerical sorting from Commons and Wikidata, and a long tail of other projects and languages.

Danny and Johan are going to work on talking to wikis about deciding whether they want the new sorting. When we have agreement from a wiki, then we can run the script. If you can help to open a discussion on your wiki, please let Danny know on his talk page.

Aug 29, 2016

The new collation was rolled out to Swedish Wikipedia last week, with no problems. We're deploying the new collation to English Wikipedia today. There will be some instability with categories during the rollout, which we expect will take 24 hours. For more information, see the announcement on ENwiki's village pump.

Aug 2, 2016

The German Wikipedia community asked to see a test version of the new collation, so that they can discuss it. That test version is live now on de.wikipedia.beta.wmflabs.

June 17, 2016

Okay, here's the bad news: We need to switch languages to UCA collation in order to change the numerical sorting, but switching a language to UCA turns out to be more complicated than we'd first thought. Many languages have unique rules for diacritics, digraphs and other special characters. Switching a language to uca-default collation could result in a character slipping out of its correct place in that language's alphabet, so we have to do some hand-coding to make sure that each language's rules are being respected. That's an unreasonable amount of work, if we have to do it for each of the 292 Wikipedia languages.

So we looked at the people who support-voted the Wishlist Survey proposal, checking the language of each person's top two most-edited wikis. From that not-really-very-scientific process, we discovered:

The top three by far were English WP (en), Commons and Wikidata.
There were 2–10 people from the following languages: German (de), Persian (fa), French (fr), Italian (it), Polish (pl) and Portuguese (pt).
There was 1 person from each of the following languages: Afrikaans (af), Bavarian (bar), Belarusian (be-tarask), Bengali (bn), Catalan (ca), Sorani (ckb), Greek (el), Hebrew (he), Hungarian (hu), Icelandic (is), Ripuarian (ksh), Ladino (lad), Malay (ms), Dutch (nl), Norwegian (no), Sanskrit (sa), Swedish (sv) and Ukranian (uk).

The good news -- some of these languages are already using uca-collation. Those languages are: Persian, French, Italian, Portuguese, Polish, Hungarian, Icelandic, Dutch, Swedish and Ukranian.

So the languages/projects that had more than 1 support vote and don't already use uca-collation are: English WP, German WP, Commons and Wikidata. There are 13 languages that had one support vote each.

Our current plan is to switch the four projects with more than 1 support vote: English WP, German WP, Commons and Wikidata (pending each community's agreement). Once we've done that, we'll be available to set up other languages who don't currently use uca, but we'll need help from someone who speaks that language to help us understand the sorting conventions for that alphabet.

May 24, 2016

New indexes have been added to the categorylinks tables for all the wikis (T130692). This will allow us to run the updateCollation.php script on even the largest wikis in a reasonable amount of time (T58041), which is a pre-requisite for changing the collations.

March 11, 2016

There's a discussion on ticket T128502 about the impact on non-Arabic numerals. The new sorting works correctly for Japanese numbers, but Eastern Arabic numbers don't work, and we're not sure about Chinese. We're currently trying to start some conversations on Arabic, Japanese and Mandarin Chinese Wiktionaries, to see if we can figure out the right path.

Birgit and Tobi from WMDE's TCB team say that we need to make sure that we have a conversation with the German WP community before we run the collation script -- they've got a system of Defaultsorts that manually sort all of the numbers by hand. (See Kategorie:Literaturverfilmung for an example.) If the community decides they want to use the new sorting, they/we will have to change all of these Defaultsorts.

March 1, 2016

There's now a new, faster conversion script, which will perform the re-sorting. Tech Ops is currently doing some more performance testing, to make sure the new script doesn't interrupt any existing queries. Once that's done, we'll be able to run the new script and get natural number sorting.

There's an ICU Collation Demo page for the library that we'll be switching to. You can input a series of page titles, and check that the collation works properly. (Just make sure that the "numeric" setting is switched to on.)

We also need to check that the Wiktionary sites are okay with switching to natural number sorting, because they might prefer to stick with lexigraphical sorting. We've posted a question on the EN Wiktionary Grease pit, and we'll be checking with some other languages as well.

Feb 22, 2016

We're currently evaluating options for how to do this. There's an available library that we can use. The current problem is that the conversion script we'd need to run is slow, possibly taking up to three days on EN.wp. During this time, people would see inconsistent category rankings, which isn't optimal. At worst, a script with bad performance issues could take down the server.

We're discussing possible options on Phabricator ticket T58041. The answer may be to add a new index to the database table.

Timeline

At press time (late February), the team is actively working on two projects: Migrate dead external links to archives and Pageview stats tool. We'll probably be able to start running some Numerical sorting tests in April.

Internal Community Tech team assessment

Support: High. Near-unanimous support votes, with several people saying "Long overdue".

Impact: Medium. There's a clunky workaround, but it's a pain. The Wiktionary community may have an objection and opt out.

Feasibility: High. We may be able to solve this using the ICU library, which sorts numbers correctly. The current ICU conversion script is inefficient, and takes way too long on the biggest wikis. We’ll need to fix the conversion script if that's how we approach this.

Risk: Low. We need to make sure that we can efficiently regenerate all the sortkeys on large wikis like English Wikipedia. We also need to consider RTL and non-Latin numerals.

Status: We'll be looking into this soon; it seems like low-hanging fruit. We'll need to investigate the ICU library, potential drawbacks and i18n concerns. Wikimedia Deutschland's TCB team have a similar request on their wishlist: "Correct sorting of umlauts" (link in German). We'll work together with them on this.