Research:Creating a standard orthography of an unwritten language in order to use it in creating Wikimedia Projects

Created
14:11, 1 July 2015 (UTC)
Duration:  2015-04 – 2015-08
This page documents a completed research project.


Project Summary edit

A new Latin and Arabic Script based orthography is postulated for Tunisian. The Latin Script is a mixture of Deutsche Morgenländische Gesellschaft Umschrift and Buckwalter transliteration. The Arabic Script is simplified with reference to Al-Toma guidelines. The obtained new Latin and Arabic orthographies had been proved to be absolutely interconvertible as a Latin to Arabic script converter had been successfully created for them.

Reasons edit

When creating Wikipedia and Wiktionary in Tunisian Arabic since 2011, it was seen that over 60% of young users use Latin Script for Tunisian because it is the most supported layout by Computers and because users from Tunisian Diaspora do not have sufficient proficiency of the Arabic Script. Moreover, Tunisian Arabic contains many borrowed phonemes that are used in Loanwords and these phonemes and mainly vowel ones are not supported in Arabic Script. The use of additional diacritics for this is possible. However, this can cause confusion for users. The only solution was the creation of a Script Converter that is intuitive and that could be used to convert Latin Script in Arabic Script. The user has to edit in Latin Script and the written words can be converted to Arabic Script by the website. This is technically possible thanks to the development of MediaWiki Language Converter by ZhengZhu Feng. However, when examining the Latin Script for Tunisian. it was clearly seen that it is based on a phonemic transcription of words and without any consideration of the morphology of Arabic. This gave an inconvertible Latin Script orthography for Tunisian Arabic because of the phenomenon of pronunciation simplification:

  • If they are in the end of a word, [i:] and [ɪ] are pronounced as [ɪ], [u:] and [u] are pronounced as [u], and [a:], [ɛː], [a] and [æ] are pronounced as [æ]. This is what explains the lack of accuracy of the grammar specification of Tunisian. For example, none of the works had made an interest to explain why the present of /mʃæ/ is /yimʃi/ and the present of /bdæ/ is /yibdæ/...
  • If a word finishes with a vowel and the next word begins with a short vowel, this short vowel and the space between the two words are not pronounced. The lack of consideration of this simplification made of some rules of Tunisian difficult like for the situation of the determinant "il-" meaning The.
  • If a word begins with two successive consonants, an [ɪ] is pronounced in its beginning.
  • There are even several simplifications that exist in some varieties of the dialect. For example, Short vowels are pronounced as schwa in Northwestern and Southwestern Tunisian dialects as they are the varieties of Tunisian Arabic pronounced using Algerian Phonology. Another interesting example is the simplification of /θ/ as /t/ in the Sahil Dialect when it comes in the beginning of a word. These simplifications should not be considered when transliterating Tunisian. Tunisian should be transliterated in a way that let it easily read using Tunisian, Algerian and Libyan Phonology.

The solution was the creation of an orthography that is based transliteration of Arabic words. The idea was created by Mr. Mohamed Maamouri in 2004 and was based on the work of Timothy Buckwalter for Standard Arabic and has been later developed by Mr. Nizar Habash since 2012. Although the method is efficient, it included many graphs. For example, there are four letters for the glottal stop… It attributes a Latin graph for any Arabic graph without any consideration of Arabic Morphology… Furthermore, this method differentiates between the phonemes obtained respectively of uppercase and lowercase letters. That is why this method is also deficient. That is why we had been obliged to review the Deutsche Morgenländische Gesellschaft Umschrift by considering the principles of Buckwalter Transliteration

Principles edit

In the Basic DMG Transliteration, c and e were not used. So, we attributed them for two consonant phonemes that were using Additional Latin Letters. Dhah and Dhad are two different letters that are corresponding to the same consonant phonemes and by that, they are transcribed in DMG using the same Letter. So, we will attribute two Latin Letters for Dhah and Dhad as they are differently transcribed in Arabic Script using two Latin Letters. Final Forms are also added: ħ for Ta Maghluqa meaning «of» and ä for Alif Maqsura.

In Buckwalter transliteration, h is added to feminine nouns finishing with a short a. This is useless because there are limited words that are finishing with a short vowel and that are not finished in Arabic Script with «ه». They can be dropped. In the new transcription, if a word finishes with a short vowel. «ه» is automatically added to it. The conversion of the glottal Stop to Arabic is very difficult because the choice of the graph depends of what is the letter before it or what is the letter after it. That is why we agree on the idea of Al-Toma about the need of a reform of Arabic Script. The transcription of the Glottal Stop became fully automatic. It is written as a '. When it is in the end of the word and preceded by a long vowel, it is written as ء. In all other situations, it is written as ئ.

The determinant "il-" that means «The» is written as "il" + hyphen + defined noun so that it can be differentiated from the "il" with which begin several indefinite nouns beginning with and so that a hamza would not be added on the Alif of "il-". "il-" is always written as "il-". However, when it is written as i+Sun Consonant+-, it is converted to Arabic Script like "il-" in order to let the transcription method more flexible for users. When the noun begins with a vowel, it is converted in Arabic automatically as an Alif without having to add anything even if it is preceded by the "il-". To indicate a stressed consonant, it is done in Arabic Script by adding a Shaddah after the consonant. However, in Latin Script, it is done by doubling the consonant as it is done in DMG Method.

Influenced by the ideas of Al-Toma, all prepositions became separate from nouns. For example, b- il-sīf. This is done for four reasons:

  • This structure was used in the works of Taoufik Ben Brik and Ali Douagi about Tunisian Arabic.
  • This form ameliorates the quality of the tokenization and understanding of Tunisian
  • When il- is not preceded by a space, it is not detected by the script converter.
  • This helped the differentiation between some prepositions and the first syllable of some indefinite nouns. For example, b- niyya and bnayya.

We can also benefit from the use of uw and iy in Buckwalter transliteration in our method.

  • [u:] is transliterated as ū when it is totally dropped and uw when it is not totally dropped when it changes of gender or number.
  • [i:] is transliterated as ī when it is totally dropped and iy when it is not totally dropped when it changes of gender or number.

For example:

  • /tu:nsi:/ (sing.), /twa:nsa/ (plur.). So, it is written as tuwnsī
  • /li:l/ (sing.), /lya:li:/ (plur.). So, it is written as liyl

Results edit

The method has been very efficient as it is the first morpho-phonologic method of transcription of Tunisian and as it served in creating a 21.9 KB perfect script converter for Tunisian. Before, Script Converter for Stanndard Arabic and Arabic Dialects are based on a very extended database of Arabic Script Substrings and their Latin Script Corresponding Substrings.

Benefits for the Wikimedia community edit

  • Maghrebi communities can write their wikis in Latin Scripts... So, more contributors mainly from Maghrebi diaspora will contribute to them.
  • A Script Converter for Tunisian is an important advance that can help using Wikipedia and Wiktionary for NLP Projects related to the Arabic Dialects.

Timeline edit

  • Begin: 15 June 2015
  • End: 15 August 2015

Funding edit

References edit

Introduction:

  • Jabeur, M. (1987). A sociolinguistic study in Rades, Tunisia. Unpublished PhD dissertation. Reading: University of Reading.‏
  • Singer, H. R. (1994). Ein arabischer Text aus dem alten Tunis. Semitische Studien unter besonderer Berücksichtigung der Südsemitistik, 275–284.
  • Maamouri, M. (1967). The Phonology of Tunisian Arabic. Ithaca: Cornell University.
  • Gibson, M. (2009). Tunis Arabic. Encyclopedia of Arabic Language and Linguistics, 4, 563–71.
  • Ben Abdelkader, R., & Naouar, A. (1979). Peace Corps/Tunisia Course in Tunisian Arabic.
  • Zribi, I., Graja, M., Khmekhem, M. E., Jaoua, M., & Belguith, L. H. (2013). Orthographic transcription for spoken tunisian arabic. In Computational Linguistics and Intelligent Text Processing (pp. 153–163). Springer Berlin Heidelberg.‏
  • Wikimedia Foundation (2010). Language Converter, https://doc.wikimedia.org/mediawiki-core/master/php/classLanguageConverter.html#details
  • Younes, J., & Souissi, E. (2014). A quantitative view of Tunisian dialect electronic writing. 5th International Conference on Arabic Language Processing, CITALA 2014
  • Maamouri, M., Graff, D., Jin, H., Cieri, C., & Buckwalter, T. (2004). Dialectal Arabic Orthography‐based Transcription. In EARS RT‐04 Workshop.
  • Habash, N., Diab, M. T., & Rambow, O. (2012). Conventional Orthography for Dialectal Arabic. In LREC (pp. 711‐718).
  • Lawson, D. R. (2010). An assessment of Arabic transliteration systems. Technical Services Quarterly , 27 (2), 164-177.

Letters:

  • Jabeur, M. (1987). A sociolinguistic study in Rades, Tunisia. Unpublished PhD dissertation. Reading: University of Reading.‏
  • Singer, H. R. (1994). Ein arabischer Text aus dem alten Tunis. Semitische Studien unter besonderer Berücksichtigung der Südsemitistik, 275–284.
  • Maamouri, M. (1967). The Phonology of Tunisian Arabic. Ithaca: Cornell University.
  • Gibson, M. (2009). Tunis Arabic. Encyclopedia of Arabic Language and Linguistics, 4, 563–71.
  • Zribi, I., Graja, M., Khmekhem, M. E., Jaoua, M., & Belguith, L. H. (2013). Orthographic transcription for spoken tunisian arabic. In Computational Linguistics and Intelligent Text Processing (pp. 153–163). Springer Berlin Heidelberg.‏
  • Maamouri, M., Graff, D., Jin, H., Cieri, C., & Buckwalter, T. (2004). Dialectal Arabic Orthography‐based Transcription. In EARS RT‐04 Workshop.
  • Habash, N., Diab, M. T., & Rambow, O. (2012). Conventional Orthography for Dialectal Arabic. In LREC (pp. 711‐718).
  • Lawson, D. R. (2010). An assessment of Arabic transliteration systems. Technical Services Quarterly , 27 (2), 164-177.

Advanced Reforms:

  • Singer, H. R. (1994). Ein arabischer Text aus dem alten Tunis. Semitische
  • Zribi, I., Graja, M., Khmekhem, M. E., Jaoua, M., & Belguith, L. H. (2013). Orthographic transcription for spoken tunisian arabic. In Computational Linguistics and Intelligent Text Processing (pp. 153–163). Springer Berlin Heidelberg.‏
  • Maamouri, M., Graff, D., Jin, H., Cieri, C., & Buckwalter, T. (2004). Dialectal Arabic Orthography‐based Transcription. In EARS RT‐04 Workshop.
  • Habash, N., Diab, M. T., & Rambow, O. (2012). Conventional Orthography for Dialectal Arabic. In LREC (pp. 711‐718).
  • Al-Toma, S. J. (1961). The Arabic writing system and proposals for its reform‏. The Middle East Journal , 403-415.
  • Attia, M. A. (2007, June). Arabic tokenization system. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources (pp. 65-72). Association for Computational Linguistics.‏
  • Habash, N., & Rambow, O. (2005, June). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 573-580). Association for Computational Linguistics.‏

Conclusion:

  • Sherif, T., & Kondrak, G. (2007, June). Substring-based transliteration. In Annual Meeting - Association for Computational Linguistics (Vol. 45, No. 1, p. 944).‏
  • Al-Onaizan, Y., & Knight, K. (2002, July). Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 workshop on Computational approaches to semitic languages (pp. 1-13). Association for Computational Linguistics.‏

External links edit

Contacts edit

  • You can contact us using Wikipedia Mail or by writing a comment in our user pages.
  • You can also contact us in the talk page of this page.