Diversifying Wikipedia: Technological and socio-political aspects

Submission no.
Title of the submission
Diversifying Wikipedia: Technological and socio-political aspects
Type of submission (discussion, hot seat, panel, presentation, tutorial, workshop)
presentation
Author of the submission
Ritesh Kumar
E-mail address
riteshkrjnu@gmail.com
Username
riteshkrjnu
Country of origin
India
Affiliation, if any (organisation, company etc.)
Dr. Bhim Rao Ambedkar University, Agra
Personal homepage or blog
Abstract (at least 300 words to describe your proposal)

Despite making huge strides towards great diversity in the coverage of articles, Wikipedia still lags behind in the diversity of languages in which knowledge is disseminated. According to the data from Census 2001, there are 1635 mother tongues in India. Out of these, just 21 languages are represented in Wikipedia and the number of articles created in most of these languages are negligible in comparison to the number of articles in a language like English. Besides English, the largest Indian language represented on Wikipedia is Hindi, which has a global rank of 49 (based on the number of articles in the language) with over 116 thousand articles. Table 1 below gives a summary of the number of pages in each of the Indian languages on Wikipedia and their global rank.

Sl. No. Global Rank Language Articles
1 1 English 4,678,838
2 49 Hindi 116,629
3 59 Newar 71,163
4 61 Tamil 65,589
5 63 Urdu 63,568
6 66 Telugu 60,070
7 76 Marathi 40,994
8 78 Malayalam 37,576
9 81 Bengali 33,519
10 83 Western Punjabi 33,240
11 95 Nepali 26,926
12 97 Gujarati 25,636
13 98 Bishnupriya Manipuri 25,126
14 107 Kannada 17,306
15 109 Punjabi 16,237
16 127 Sanskrit 10,274
17 132 Oriya 8,560
18 148 Bihari (Bhojpuri) 5,569
19 184 Assamese 2,998
20 243 Sindhi 577
21 248 Maithili 525

Table 1: Wikipedia articles in Indian languages (extracted from [1])


This biased representation of languages on Wikipedia could be understood in terms of two major factors:

a. Technological factors: Till very recent times, non-availability of technologies for input as well as rendering of the scripts used for writing Indian languages was a major hurdle in widespread use of these scripts over the web in general. While these languages were used sporadically (and mostly using the Roman script), use of Indian languages on the web was not very widespread. However, with the inclusion of most of the characters from almost all the major Indian scripts and development of phonetic keyboards, both the issue of rendering of Indian scripts have been solved to a large extent, except for a few grey areas. However, even though the input methods of Indic scripts have been simplified and standardised to a great extent, a large population is still unaware of these developments and so continuously refrain from using these tools for their purpose. This issue could be addressed through awareness programmes, workshops and training in Indic scripts input in different parts of the country. It could be achieved through widespread collaboration activities carried out across different institutions and Universities across the country. In addition to this, another way of improving the participation of the community and increasing their contribution so as to enrich the content in a particular language is the use technology, especially natural language processing tools, to make the task of the contributors relatively easy. One of the ways it could be achieved is by using the machine translation systems to translate the articles from a major language like English into other languages and then getting those articles edited and adapted by the experts into their language. This would not only provide parallel articles in a large number of languages but also considerably ease the job of the editors and contributors. As more articles are translated, the parallel copora, thus created, could also be used to further improve the machine translation systems, thereby, feeding back into the Wikipedia content generation.

A bigger issue is the non-representation of a large number of languages on Wikipedia. While some of these languages have a script of their own, majority of the world's languages are spoken languages without a script of their own. The question is how do we create content in those languages which do not have a script of their own. There could be two solutions – using a standardised, universal script like International Phonetic Alphabet (IPA) or using one of the larger known scripts that is used for nearby major language. Both the solutions have certain problems associated with them – IPA is not a commonly-known script and its use may defeat the purpose of accessibility and using script that is commonly used for other languages may not be acceptable to the community members. However, using larger-known scripts could provide the solution to the accessibility issue.

b. Socio-political factors: Along with the technological factors, the socio-political status of languages and the politics of language, in general, in India have played a very crucial role in creating an imbalance in the representation of languages in Wikipedia. Officially, languages in India are divided into 22 scheduled (the languages that are included in the 8th Schedule of the Constitution) and 100 non-scheduled languages. Besides these there are hundreds of languages which are not counted in the official figures because those have less than 10,000 speakers (even though most of these lesser-known languages have a very robust and stable population and they are not endangered). Furthermore there are several languages which are classified as the dialects/varieties of some major language without any convincing reason to do so. As a result of all these, it becomes really difficult to distinguish between distinct languages and their varieties.

On the other hand, the global policy of the Wikimedia Foundation on opening a new language edition gives a contradictory picture for Indian situations – on the one hand it states that it does not consider “political differences” such that it could give “unbiased access” of the “sum of all human knowledge” to every single person; and on the other hand it categorically states that “regional dialects”, which are inherently political entities (in the sense that the distinction between languages and dialects are always political), are excluded from opening a new language edition. This policy, along with socio-political status assigned to Indian languages, has created an environment where a lot of languages are actively excluded from Wikipedia, thereby, undermining the basic goal of Wikipedia. While the solution to the larger linguistic issues of India is beyond the scope of this paper, in order to improve the linguistic diversity on Wikipedia so as to make it accessible to all, a revision in eligibility criteria and policy of the Wikimedia foundation with regards to the language, informed by the research on languages, dialects and varieties in sociolinguistics, is imminent.

Thus in order to increase the linguistic diversity of Wikipedia, thereby, increasing its accessibility among a large population, two-fold effort is necessary – use of latest technologies including the improved input methods and the advanced NLP techniques for quick and huge development of articles in several different languages and at the same time understanding the linguistic scenario of India and adopting a more well-informed stance towards languages so as to encourage Wikipedia articles in a large number of languages.

Track
WikiCulture & Community
Language of Track
English
Length of session (if other than 30 minutes, specify how long)
30 minutes
Will you attend Conference at Kolkata with own cost if your submission is not accepted?
Yes
Slides or further information (optional)
Special requests


Interested attendees

edit

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).

  1. riteshkrjnu.