Grants:IEG/Solve the display issue on rare characters in CJK fonts

statuswithdrawn

Solve the display issue on lack characters in CJK fonts.

summaryUse Ideographic Description Sequence to support lack characters in Chinese-Japanese-Korean fonts in MediaWiki projects.

targetChinese, Japanese, and Korean Wiki Source, Wikitionary, And Wikipedia

strategic priorityimprove quality

amount300 USD

grantee• Shangkuanlc, shoichi, reke

contact• shangkuanlc@gmail.com

give feedback

join

endorse

created on18:21, 3 February 2015 (UTC)

2014 round 2

Friendly space expectations

Project idea

What is the problem you're trying to solve?

Right now in Chinese-Japanese-Korean MediaWiki, if we want to put an ancient book to Wikisource,in high probability we will find there are always some lacking Hanzi characters aren't encoded in unicode. Sometimes new characters created in the most modern scientist are also lacking. (EX: a new found chemical element) Now the only solution is displayed by a jpg file . This is a bad solution, because the image file of the character not only is impossible to be indexed searched by computers semantically,indexed,but also hardly composited with other characters. Even hardly input by users - Drawing the character in a graphics software.

Back to the problem origin ,Hanzi,or say "Kanji" is an open characters set used in Chinese,Japanese,Korean. A Hanji is a 2D composited by other Hanjis or basic component(in Chinese / Japanese is "末級部件"). In early CJK computer developing age(1970-1985), due to technology limit ,it was very hard to handle 2D composition characters on conputers,so the tactics of Hanzi character encoding is just to encode "words" one by one,not to encode the basic elements of characters. Of course, it's very hard to give codes to every Hanzis, so only modern,daily used characters was given a code. It make a lot of rare used characters which are not used in daily life are not encoded. They are used in ancient culture books,rare names,and advanced science reserach. Although Unicode exist now,there are still many characters were encoded. The past soluiton is to apply the new found character to Unicode org. Then wait several years until Unicode accepting it to new Unicode standard(They may ,they may not) and the fonts company implementing. During these years,the character only can be jpg image files.

What is your solution?

Implement Uncode Ideographic Description Sequence ([ref http://www.unicode.org/versions/Unicode6.0.0/ch12.pdf]) rendering system to MediaWiki.

It will be a web font server ,and a Mediawiki extension. The web font server output SVG files with semantic meta data(the new character is assembled by which characters). The Mediawiki extension recognize and send IDS string into the web font server to get rendered SVG link back.

The webfont server can be put on an independent site or as a wikimedia project site. The extension need to be installed to wikimedia project sites which needs the technology.

We invited 2 open source teams to Collaborate the project in February. They areMoe dictionaryteam, and YiCHUANteam. Then We held a hackathon on 2,May.

Now the web font server is almost ready. we need to complete the mediawiki extension next.

Project goals

Implement our solution,to make the hundreds of lacking characters in Dr.Wu's Taiwanese dictionary more ideally put into Wikisource. Also lacking characters of other ancient CJK books could be solve nicely.

Project plan

Activities

Several times of hackathon for developer's meeting and testing.

Budget

Food and drinks for hackathon -- 50 USD for 5 times

Community engagement

We are working with other open source community and an University professor. In Wikimedia Taiwan, there is also a new donation which is a approximately 4000 pages Taiwanese - Mandarin dictionary, we will use the new technology in Wiki Source to test whether the newly donated dictionary could be displayed properly,and be searched.

Sustainability

After the grant ends, not only the TW dictionary project of our board use the technology,but also other editors on Wikisource ,Wikipedia,Wikitionary use the new technology to display lacking characters instead of drawing lacking characters into jpg files. Even it will be used widely on the internet.

About continuing and growing, to solve lacking characters problem is the first, the second is to make new character assembled more beautiful. It need to continue improving the algorithm.

We will form an maintenance group to keep development.

Measures of success

Make the rare characters could be displayed just like other normal characters, and be searched in Chinese Wiki Source search bar.

Get involved

Participants

Shoichi Chou - A board member of WIkimedia Taiwan. He is a programmer and a musician. He participated Wikimedia projects since 2005 and also attended IDS rendering engine development , he is the project manger.

Sing Hong Sih - he is the leader of YiCHUAN . He provide his web font render engine source code and algorithm.

Audrey Tang - She is the leader of Moe Moe dictionary team. She is a famous perl hacker in Taiwan. In the project he lead our team to digitalized the dictionary and to make a new web font server based on YiCHUAN's algorithm.

Reke - secretariat of Wikimedia Taiwan, who contact with Donors of the dictionary.

Ted Chien - Chair of the board of WIkimedia Taiwan, will advise the program.

Chi Pan Wang - An ancient Chinese language professor who has skills in IDS, will be an advisor too.

Community notification

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).