CIS-A2K/Research/Study on Infrastructural Needs of Indian Language Wikisource Projects

Introduction

This is a pilot study to collect baseline data on infrastructural needs related to Wikisource platforms in India. It will primarily focus on identifying current challenges and knowledge gaps in technological capacity, resources and training on Wikisource platforms in selected languages. Through an analysis of this data, the study aim will provide learnings and recommendations on potential strategies to address these gaps, including through collaborative intervention and training.

Context

Indian Wikisource projects have been a crucial aspect of content enrichment efforts in the open knowledge movement in India. As a project dedicated to creating a ‘growing free content online library of source texts, as well as translations of source texts in any language', it serves as an important hub of knowledge that supports content development on various sister projects. Wikisource has seen tremendous growth in Indian languages, especially over the last decade with increased efforts in digitization, availability of source texts in open licenses and better Optical Character Recognition (OCR) technology. However several communities also continue to grapple with persistent challenges, including lack of active participation and efforts related to bringing more content on the platform, translations, and encouraging the use of source texts across projects among others. While, the majority of the contributors are comfortable with transcribing texts, more technical tasks such as importing new books, creating Index pages and transcluding books are left to a very small number of contributors.

A short research needs assessment conducted earlier this year also highlighted the need for technological resources for Wikimedia projects, such as advanced tools and platforms, and the capacity and training to work with them as important areas of work in the Indian language communities. While this is not specific to Wikisource alone, observations by community members and active Wikisource contributors over the last few years illustrate that these concerns persist in the Wikisource community as well. This pilot study is therefore an attempt to identify these challenges, by collecting baseline data on priority areas of work on Wikisource platforms in India, beginning with a focus on selected language communities, and areas of interest. This would help contributors arrive at a better understanding of the requirements of communities, and aid in developing potential strategies to address them.

Research objectives

The study will have two areas of focus:

What are key challenges with working on Indian language Wikisource projects currently? These may include anything from obstacles in Wikisource workflow, policies and open licenses, to challenges such as quality of content and lack of community engagement?
What are gap areas and spaces for improvement in the infrastructure of these platforms, especially related to technological capacity, resources and training?

Research methods

Through a mixed methods approach, comprising surveys, interviews with community members and analysis of Wikisource discussion pages and statistics over the last 2 years, the study will collect data related to common challenges and gaps on Wikisource platforms in India. The team will work with three selected Indian language communities; the criteria of selection would include an assessment of language families, and size, the size of the content on Wikisource, and level of activity, among others. These criteria for selection will be clearly outlined in the report as well. The project team will work with 2 short-term volunteers/research assistants over a 2-3 month period for the data collection through interviews and surveys. The research assistants would also provide translation support as needed, and work closely with community members. The analysis of data collected would be helpful in identifying some of the key areas of work on Indian Wikisource projects.

The criteria considered for selection of the language communities for the study are language family and size, amount of content on Wikisource (according to bytes/number of proofread pages), recent activity and a good track record/sustained progress and challenges with the same over the last several years. Non-Wikimedia factors, such as visibility and prevalence of the languages on other online platforms, technical and cultural resources and complexities of working with certain languages etc. were also considered during the selection process. Keeping these in mind, the languages selected for this study are as follows:

Tamil Wikisource (One of the largest Wikisource communities in India, which has considerable content, is active and has seen steady growth over the last few years)
Assamese Wikisource (A growing Wikisource community, which has also seen a lot of activity in recent years)
Malayalam Wikisource (A large and active Wikisource community, which in recent years has some decline in engagement, despite good resources and activity on other Wiki platforms)

Given the limited sample size of just three language communities, the study is an attempt to offer an insight into the current challenges of the Indian language communities, and ways to address the same. The study, therefore, utilises a purposive sampling technique and may need several steps before the observations/findings may be considered to be representative at scale.

Research outputs

The outputs of this study include a final report outlining key observations and learnings from an analysis of the data collected. These would comprise identifying thematic knowledge gaps, challenges and potential areas of work, including community and volunteer efforts. The report would aim to share a brief set of recommendations/strategies on how to address the above issues. The team will also produce at least two short blog posts on progress of work, and final recommendations.

Updated Timeline

July-August, 2021: Research plan/concept note, including objectives, methodology and proposed outputs finalised and published on Meta. Open call for community volunteers/research assistants for data collection published.

September-December, 2021: Survey and interview methods finalised, orientation and on-boarding for research assistants. Translations of questionnaires etc. where required.

January-March, 2022: Fieldwork to be completed (mostly online interviews and surveys). Analysis of data collected. RA to draft and publish a blog post on data collection and review so far.

April 2022: Draft report to be shared for internal and external review. Recommendations if any, to be considered in the proposal plan for 2022-23.

May-June, 2022: Report to be finalised and published.

Updates

July-December 2021

The first couple of months on this project were spent on meticulously discussing, drafting and finalising the research design for the study. This process involved drafting an initial concept note, selection of languages of focus and drawing up a list of reviewers for feedback. The concept note was published and was shared for internal and external feedback; namely with team members, a small cohort of community members and the Wikimedia India mailing list as well. Apart from this, interview questionnaires and survey forms were developed for data collection. Translation of the interview questionnaire was undertaken with the help of community members. The list of respondents for the interviews across languages was finalised as well, and initial communication with them has been completed, before the data collection process begins.

Given certain unforeseen challenges, such as technical/infrastructural issues/logistics for data collection, timelines for review of the research design, and availability of people there have been changes in the timeline for the project, which are reflected above. Data collection is presently underway, and over the next couple of months the team will look at finishing this work and moving on reviewing and putting together our learnings in the form of a final report.

January-September 2022

The first few months of this period were spent in completing data collection through the survey and interviews with community members from the selected languages. Given the limited number of responses, as also the time period coinciding with a surge in Covid cases across the country, the deadline for the survey had to be extended a few times, with the exercise being completed finally by the end of May 2022 and a total of 21 responses received. Similarly, there were some challenges with conducting interviews with community members across the selected languages, given limited availability, challenges due to the pandemic and technical issues with connectivity. As a result, the data set and the scope of the study had to be reduced significantly, and the team decided to work primarily with the available data collected through the survey. A draft report based on these responses was prepared and shared with the team for internal and external review. The report offers several insights on key aspects related to Wikisource, namely technological infrastructure, community engagement, capacity-building, open access and content curation policies. Apart from an overview of challenges, it also provides reflections on potential strategies to address the same. The final report on the study is now published and available on Wikimedia Meta-Wiki.