CIS-A2K/Events/Wikimedia orientation Session for volunteers group working on digitisation of North-East Indian languages

CIS-A2K is exploring collaborations with several organisations in North-East region of India for the last two years. The knowledge about this region as well as the various language communities are very less represented in Wikimedia projects. The region has rich tribal culture and diversity of languages. One of the volunteers group, VANI working on vulnerable and endangered languages approached CIS-A2K for collaboration. First session for orientation on Wikimedia projects and to decide the components of proposed project was conducted on 3 June 2021. In this session two coordinators of VANI participated.

Background edit

The focus of this project is on offering barrier-free open access to resources of North East Indian languages. North-East India is home to more than 200 languages, out of which 82 are listed as Vulnerable, 63 as Definitely Endangered, 6 as Severely Endangered, 46 as Critically Endangered and 6 as Extinct (The Guardian Dataset). Some of these languages are the official languages of the states and are widely spoken in this region. On the other hand, some of the languages have a few hundred native speakers. However, irrespective of the size of the native population or official status of the language, they all lack free and open source data. Given the huge number of languages spoken in this region, the current scope of the project covers 4 spoken languages in the 2 states, leaving the rest of the languages as a future objective. These languages are - Adi, Idu Mishmi, Khasi and Nyishi.

Goal edit

The goal of this project is to facilitate the study of these languages by making existing resources discoverable and building open-source structured datasets and tools using Wikimedia sphere to enrich the language research landscape of North East Indian languages.

Local institutional partners edit

  1. Idu Mishmi Cultural and Literary Society (IMCLS) - The apex body of the Idu Mishmi tribe.
  2. Native speakers of the Nyishi language
  3. Research Institute of World’s Ancient Traditions Cultures and Heritage (RIWATCH)
  4. Native speakers of the Miju Mishmi language
  5. Adi Bane Kebang - Adi literary body.
  6. Native speakers of the Khasi community

Report edit

In the beginning the VANI coordinators explained the background and context of the project. They also shared their work regarding revitalisation of languages with community participation. After several gatherings by the elders of the community, the bilingual dataset of 981 English and Idu Mishmi words in Roman script was created. It is a structured dataset in excel. A dedicated team of Nyishi speakers of different dialects are contributing speech samples for the creation speech corpora. After this we introduced them to various Wikimedia projects. The specific platforms like Lexicographical data in Wikidata, Lingua Libre project to build repository of pronunciations on Wikimedia Commons and Wikisource for uploading digitised books were discussed at length. The various components of the proposed collaborative project have been discussed. The MoU will be developed with elaborate process and roles of both the parties in the next month. Follow-up training sessions are planned in August 2021.