Grants:Project/Developing and Enhancing Gurmukhi-Shahmukhi Machine Transliteration Service for Wikipedias
Project idea
editWhat is the problem you're trying to solve?
editExplain the problem that you are trying to solve with this project. What is the issue you want to address? You can update and add to this later.
One of the great challenges before Information Technology is to overcome language and script barriers across the whole humanity so that everyone can communicate with everyone else on the planet in real time. South Asia is one of those unique parts of the world where a single language is written in different scripts. This is the case, for example, with Punjabi language, the twelfth most widely spoken language in the world, but written in Indian East Punjab (20 million) in Gurmukhi script (a Left to Right script based on Devanagri) and in Pakistani West Punjab (80 million), written in Shahmukhi script (a Right to Left script based on Arabic). Whilst in speech Punjabi spoken in the Eastern and the Western parts is mutually comprehensible, in the written form it is not so. The existence of two scripts for Punjabi has created a script barrier between the Punjabi literature written in India and Pakistan. More than 60 per cent of Punjabi literature of medieval period (500-1450 AD) is available in Shahmukhi script only, while most of the modern Punjabi writings are in Gurmukhi as well as in Shahmukhi script. In Wikipedia also, we have the Punjabi writings split into Gurmukhi and Shahmukhi scripts. As on 19th January 19, 2017, there are 24,301 articles in Gurmukhi script and 43,092 articles in Shahmukhi script. Readers on Wikipedia from Pakistani west Punjab cannot read the 24,301 articles in Gurmukhi even though Punjabi language and the readers from India cannot read the 43,092 articles in Shahmukhi script. Thus the existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and Pakistan. Our Punjabi Transliteration system is a wonderful tool for creating bi-script real-time articles between Wikipedias. Being a Punjabi, and with the motivation to do something for our mother tongue we have focused on this issue and put over efforts in this directions. With the support of Punjabi University and PAN ASIA ICT Grants we have developed both Gurmukhi-to-Shahmukhi and more challenging reverse systems till 2008. Now we are thinking to redefine this system for the needs of WikiPedias, so that with the help of this system bi-script articles can be easily transliterated into other script as or when requested by the reader. Potentially, all members of the substantial Punjabi community will benefit vastly from this integration of transliteration system into Wikipedias resources.
What is your solution?
editOur research has developed a new Punjabi Transliteration system for the first time of its kind.[1][2][3][4][5][6][7] The big advantage of this system is that it can work on Shahmukhi text without diacritical marks because Shahmukhi is written without short vowels and other diacritical marks. The transliteration approach has been implemented with various research techniques based on language corpus. The corpus analysis of both scripts is performed for generating statistical data of different types like character and word frequencies and bi-gram frequencies. This statistical analysis is used in different phases of transliteration. As a result the transliteration Accuracy is 97% at word level. Hence, this Punjabi Transliteration system is a wonderful tool for creating bi-script real-time articles between Wikipedias.
Project goals
editExplain what are you trying to accomplish with this project, or what do you expect will change as a result of this grant. This project ultimately aims to create bi-script (Gurmukhi-Shahmukhi) Wikipedia articles on the fly for Punjabi language, and ensure readers get to see more articles in their own scripts, and have access to more of the knowledge that exists in either script on Wikipedias
Project plan
editActivities
edit- Bi-script Data collection for Wiki and other sources, Analysis and system Integration
- Enhancement of Shahmukhi-Gurmukhi Dictionaries
- Redefining System Architecture
- Code Migration from existing standalone application to Wikipedia Service
- System Training
- Testing Wikipedia Service
- Actual System Deployment
Budget
editItem | Description | Commitment | Months | Cost INR | Cost USD (1USD=68.19 INR) |
---|---|---|---|---|---|
1 | Coordinator | Part Time | 6 | 180000 | 2639.68 |
2 | Co-Coordinator | Part Time | 6 | 120000 | 1759.78 |
3 | Software developer | Part Time | 6 | 120000 | 1759.78 |
4 | Lexical Resources Developer 2 @ 15000 INR for 6 Months | Full Time | 12 | 180000 | 2639.68 |
5 | Web hosting, Incidental costs & Contingencies | 100000 | 1466.49 | ||
6 | Administration Cost | 100000 | 1466.49 | ||
Total | 800000 | 11732 |
Community engagement
editOnce system is running at a stable point, it is easy to add new words in dictionary, and if we manage to recruit other developers they also should be able to keep the data expanding
Sustainability
editIn NLP applications like this there is always chance of improvement. Although bi-script Wikipedia vocabulary and frequency lists would be used for development during this project. We keep on improving the system for better accuracy.
Measures of success
editThis is well tested system for the first time of its kind. As a mark of success, the existing system transliteration Accuracy is 97% at word level. We are happy to integrate it with Wikipedia resources.
- Release of Shahmukhi to Gurmukhi Transliteration Service
- Release of Gurmukhi to Shahmukhi Transliteration Service
- Both the milestones will break the script barrier of Punjabi Language and useful product to the Wikimedia
- Other Measurable goals include size of the Gurmukhi-Shahmukhi dictionaries that will be more that 30000 most frequent words
Get involved
editParticipants
edit- Professor Gurpreet Singh Lehal/Cordinator, More than 25 years hand on experience of handling research projects funded by various organizations in India and abroad in the fields of NLP and OCR See Biodata
- Dr. Tejinder Singh/Co-Cordinator 10 years experience of handling research projects Biodata
Community notification
editPlease paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?
Endorsements
editDo you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
References
edit- ↑ T S Saini and G S Lehal “Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach”, Research in Computing Science (Mexico), Volume 33, pp. 151-162 (2008).
- ↑ Gurpreet Singh Lehal and Tejinder Singh Saini, “Sangam: A Perso-Arabic to Indic Script Machine Transliteration Model”, Proceedings of 10th International Conference on Natural Language Processing, Goa. (2014)
- ↑ Gurpreet Singh Lehal and Tejinder Singh Saini, “Conversion between Scripts of Punjabi: Beyond Simple Transliteration”, Proceedings of the COLING 2012: Posters, Mumbai, pp. 633-642. (2012)
- ↑ Gurpreet Singh Lehal, Tejinder Singh Saini and Savleen Kaur Chowdhary, “An Omni-Font Gurmukhi to Shahmukhi Transliteration System”, Proceedings of the COLING 2012: Demonstartion papers, Mumbai, pp. 313-320. (2012)
- ↑ Tejinder Singh Saini and Gurpreet Singh Lehal, “Word Disambiguation in Shahmukhi to Gurmukhi Transliteration”, Proceedings of the 9th Workshop on Asian Language Resources, IJCNLP 2011, Chiang Mai, Thailand, pp. 79–87. (2011)
- ↑ G. S. Lehal, “A Gurmukhi to Shahmukhi Transliteration System”, Proceedings of 7th International Conference on Natural Language Processing, pp. 167-173, Hyderabad, India. (2009).
- ↑ T. S. Saini, G. S. Lehal and V. S. Kalra, “Shahmukhi to Gurmukhi Transliteration System”, Coling 2008: Companion volume: Posters and Demonstrations, Manchester, UK, pp. 177-180 (August 2008).