Grants:Programs/Wikimedia Research Fund/AI-based Categorization of new Wikipedia Articles

statusnot funded
AI-based Categorization of new Wikipedia Articles
start and end datesJuly 2023 - July 2024
budget (USD)50,000 USD
fiscal year2022-23
applicant(s)• Ashutosh Modi, Shreyansh Agarwal, Vansh Bansal, Aryan Vora, Mandar Wayal and Prem Bharwani

Overview edit

Applicant(s)

Ashutosh Modi, Shreyansh Agarwal, Vansh Bansal, Aryan Vora, Mandar Wayal and Prem Bharwani

Affiliation or grant type

IIT Kanpur

Author(s)

Ashutosh Modi, Shreyansh Agarwal, Vansh Bansal, Aryan Vora, Mandar Wayal and Prem Bharwani

Wikimedia username(s)

Project title

AI-based Categorization of new Wikipedia Articles

Research proposal edit

Description edit

Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

Problem Description:

Wikipedia categorizes similar groups of articles in the form of overlapping trees of categories. Although the process is reasonable, it's entirely manual & the categories on Wikipedia in languages other than English are inconsistent.

Manual categorization is prone to errors. The editor has complete discretion when categorizing an article, which may cause issues like

-Creation of similar category pages, resulting in redundancy

-Inaccurate & incomplete categorization, leading to longer review times

Wikipedia supports 328 foreign languages along with English. Although some articles have English translations, the links aren't reliable enough to use as a navigation tool. This raises some issues, including:

-Incorrect categorization of foreign language articles, e.g., a Hindi article on ‘यंत्र शिक्षण’ (Machine Learning) is incorrectly categorized under ‘शिक्षा की पद्धतियां’ (methods of education). This vastly differs from categories associated with its English counterpart

-Uncertain & discordant classification of foreign texts without an English equivalent, such as a Hindi article on ‘षड्रस’ (shadras), which has limited regional usage in India. This leads to missing English categories, giving rise to linguistic differences

Insufficient community-specific contributors, inconsistent grammatical standards, and locally developed jargon all increase the probability of editorial errors. Therefore, Wikipedia's categorization process must be standardized across languages.

Proposed Solution:

We worked on automating the categorization of Wikipedia articles by using the hierarchy (knowledge graph) over Wikipedia categories in an Extreme Classification model and obtained encouraging results on English Wikipedia.

We want to address the challenges with categorization in foreign languages after optimizing the English Wikipedia's categorization. As a result, we suggest utilizing NLP to develop a multilingual knowledge graph that can connect documents and categories in other languages to those in English Wikipedia. Furthermore, we will use this to construct document-document correlation, making classification more robust and harmonious across all languages.

Our research will help avoid category loops/duplication, suggest the most relevant categories along with hierarchical relationships, use foreign language embeddings for English categorization of foreign text, propose new categories if none exist, and build an efficient article suggestion mechanism.

Personnel edit

N/A

Budget edit

Approximate amount requested in USD.

50,000 USD

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

The tentative associated costs are-

Equipment (CPUs + GPUs) - $26500

Man Power (Research Assistant + 3 Students + Project Manager) - $11000

Contingency Overhead- $5000

University Overhead - $7500

The total grant request is of $50000.

Impact edit

Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

-Spur more contributions from academics worldwide and assist WikiProjects in automating the already time-consuming and arduous process of categorizing Wikipedia articles

  • By listing the items under every appropriate category, we make the Wikipedia categories page a more useful resource for providing appropriate articles
  • Help build an error-free, user friendly and efficiently navigable networks of Wikipedia pages, thereby boosting user experience with Wikimedia Projects
  • Our approach intends to standardize linking of multilingual articles and streamline the sharing of free information across communities and interfaces in line with Wikipedia's 2030 strategic goal

Dissemination edit

Plans for dissemination.

We (the professor and researchers of the Indian Institute of Technology Kanpur) propose to bridge the linguistic gaps in all Wikis through this research. The end users are the academic and community advocates who can take findings and initiate action to address the issues identified in the proposal. Our solutions for English Wiki fared better on publicly available datasets than those already in use. We will fully use the Wikimedia community's resources to further our research objectives.

Past Contributions edit

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

Recently, two of our proposals got accepted and received cumulative funding of $60,000 by Google AI for Social Good program in 2021. One of the projects focused on ‘Digitizing and AI-fying the Maternity Care in Rural India’ and the other one on ‘Predicting the evolution of deforestation in Uganda’. Both projects have been completed and successfully deployed in their respective regions.


I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.

Yes