Research:Lexeme based approach for the development of technical vocabulary for underserved languages: A case Study on Moroccan Darija
This page documents a planned research project.
Information may be incomplete and change before the project starts.
While English dominates academic research [1][2], many languages struggle to keep pace with emerging scientific concepts [3]. This linguistic disparity creates significant obstacles for knowledge equity and accessibility within Wikimedia projects, particularly Wikipedia.
This project addresses the challenge of developing technical vocabulary for low-resource, unstandardized languages, using Moroccan Darija as a case study. The research aims to create a methodology that empowers Wikimedian communities to generate new terms, thereby enriching their wikis and promoting knowledge equity.
The methodology will leverage Wikidata lexemes and involve editors in creating new, grammatically sound, and semantically accurate technical words. This approach seeks to overcome the limitations of current ad-hoc methods, which lead to inconsistencies, edit conflicts, and hinder the development of comprehensive knowledge bases in underrepresented languages.
The project will work on data collection, linguistic analysis, and the development of accessible guidelines, resulting in a methodology that will be applied to generate and evaluate new terms to be assessed through a community survey. The research findings will be disseminated to facilitate adoption by other language communities, with project outcomes including enriched Wikidata lexemes, datasets, scripts, and a scientific publication.
Research Questions
editThis project seeks to answer the following research questions:
- How can Wikidata lexemes be effectively leveraged to generate new technical terms in low-resource languages?
- What linguistic patterns can be extracted from existing lexical data to facilitate the creation of new terms that adhere to a language's grammatical, syntactic, and morphological rules?
- How can a methodology be developed to enable Wikipedia editors, without specialized linguistic expertise, to contribute to the creation of new, standardized technical vocabulary in their native languages?
By investigating these questions, the project aims to provide a robust and accessible methodology for generating new terms, directly addressing the challenges faced by editors and promoting greater linguistic diversity and knowledge equity within the Wikimedia movement.
Methods
editOur research employs a mixed-methods approach to develop and evaluate a methodology for generating new terminology in low-resource unstandardized languages, taking Moroccan Darija as a use case. Here below is a description of the different activities and processes that will take place during our project implementation.
Data Collection
editThe project will gather a comprehensive dataset in Moroccan Darija from a variety of sources. This process involves multiple stages:
- Source Identification - We will identify and prioritize high-quality sources, both online and in paper format. Online sources will encompass digital dictionaries (including Wiktionaries) and other reputable online content in Moroccan Darija. Offline sources will include scanned versions of relevant dictionaries and linguistic texts.
- Data Gathering - Data will be collected using two main techniques. For physical documents, Optical Character Recognition (OCR) software will be employed to extract text. For online sources, web-scraping techniques will be utilized to systematically gather lexical data.
- Data Preparation - The raw data obtained will undergo a rigorous preparation and cleanup process. This will involve correcting errors introduced by OCR, standardizing the data format, and organizing the words and their associated information (definitions, etymologies, etc.) into a format compatible with Wikidata. We will also establish the relationships between these words to capture their semantic and morphological connections.
- Wikidata Integration - The prepared data will be uploaded to Wikidata using QuickStatements and OpenRefine, which are well established tools for efficient data import and management for Wikidata. This step ensures that the generated lexical data is structured, accessible, and interoperable.
Data Analysis and Methodology Development
editThe core of this research involves analyzing the collected lexical data to identify patterns in terminology generation and developing a robust methodology. This will be conducted in collaboration with a linguistics researcher, specialized in our use case language, Moroccan Darija. The steps that will be followed in this part are:
- Pattern Extraction - We will analyze the data to extract existing patterns of terminology generation in Moroccan Darija, considering various linguistic factors such as morphology, syntax, and semantics. This analysis will examine how new terms are formed, adapted from other languages, and used in different contexts.
- Methodology Creation - Based on the extracted patterns, we will develop general guidelines and methods for generating new words. This methodology will be designed to be accessible and applicable by individuals without specialized linguistic expertise.
- Methodology Application - The developed approach will be applied to a chosen list of terms. This list will be drawn from a set of known concepts (e.g., a subset of the list of articles every Wikipedia should have) to ensure that the generated terminology is relevant and widely applicable.
Community Survey
editTo assess the acceptance and usability of the newly generated terminology, a survey will be conducted among native speakers of Moroccan Darija, both within and outside the Wikipedia movement. The aim of this survey will be to gather feedback on these new words. The activities involved in this part are:
- Survey Design - The survey will present a set of approximately 100 words generated using our developed methodology. Participants will be asked to assess the terms for clarity, accuracy, naturalness, and appropriateness for use in technical articles.
- Participant Recruitment - The survey will be shared with Moroccan Darija speakers, prioritizing Wikimedians, through various channels, including Wikis, social media, and relevant community networks, to ensure a diverse range of perspectives. We will actively engage with Wikipedia editors and other interested stakeholders to maximize participation.
- Analysis - The survey results will be analyzed to evaluate the effectiveness of the terminology generation methodology. We will measure the level of agreement among respondents regarding the acceptability of the generated terms and identify any patterns or areas for improvement.
Dissemination and Reporting
editThe findings of this research, including the developed methodology and its evaluation, will be disseminated through a project report, conference presentations, and an academic publication. We will also explore ways to integrate the methodology and its results into Wikimedia resources and tools to facilitate its adoption by the Wikimedian communities. A visual representation of our approach is provided below, outlining the process flow and the input-output relationships between individual activities.
Timeline
editThe timeline below summarizes the planned activities to be performed during the project implementation phase (July 2025 - June 2026).
Period | Activity |
---|---|
July 2025 - June 2026 | Project Management |
July - September 2025 | Gathering Sources for Lexemes |
July - September 2025 | Data Preparation and Cleanup |
October 2025 - February 2026 | Data Analysis & Methodology Development |
February - March 2026 | Survey Preparation and Sharing |
April - May 2026 | Analysis of Survey Results |
January - June 2026 | Reporting and Paper Writing |
Policy, Ethics and Human Subjects Research
editResults
editTBD
Resources
editReferences
edit- ↑ Marginson, S., & Xu, X. (2023). “Hegemony and inequality in global science: Problems of the center-periphery model”. Comparative Education Review, 67(1), 31-52.
- ↑ Ortega, R. P. (2020). “Scienceʼs English dominance hinders diversity, but the community can work toward change”. Science.
- ↑ Amano, T., González-Varo, J. P., & Sutherland, W. J. (2016). “Languages are still a major barrier to global science”. PLoS biology, 14(12), e2000933.