Wikidata/Notes/Normalization

It seems like there are problems with several charsets that will be part of the project. At least there are problems with some comparisons in Malayalam (note the url and title at this page and compare to url and title at this page) and Arabic, with sort orders in Arabic, Persian and Hebrew, and with composition in Bangla.

There are several places where we do run into trouble due to this

  1. Adding and removing sitelinks as the page names must match the page names reported by the sites and also must match the strings in the items
  2. Adding and removing aliases in specific languages in the items as the strings must match
  3. Adding and removing label-description pairs in specific languages in the items as the strings must match
  4. Lookup of items according to site-page pairs in sitelinks in secondary storage
  5. Lookup of aliases in specific languages in secondary storage
  6. Lookup of label-description pairs in specific languages in secondary storage

See also mw:Unicode normalization considerations.

This directory contains some Unicode normalization routines. See includes/normal/README for more information.