Wikidata/Notes/Normalization

It seems like there are problems with several charsets that will be part of the project. At least there are problems with some comparisons in Malayalam (note the url and title at this page and compare to url and title at this page) and Arabic, with sort orders in Arabic, Persian and Hebrew, and with composition in Bangla.

There are several places where we do run into trouble due to this

Adding and removing sitelinks as the page names must match the page names reported by the sites and also must match the strings in the items
Adding and removing aliases in specific languages in the items as the strings must match
Adding and removing label-description pairs in specific languages in the items as the strings must match
Lookup of items according to site-page pairs in sitelinks in secondary storage
Lookup of aliases in specific languages in secondary storage
Lookup of label-description pairs in specific languages in secondary storage

Directory `/includes/normal/`

This directory contains some Unicode normalization routines. See includes/normal/README for more information.

Wikidata/Notes/Normalization

Directory /includes/normal/

Directory `/includes/normal/`