This page is a translated version of the page Abstract Wikipedia/Data and the translation is 7% complete.

Abstract Wikipedia via mailing list Abstract Wikipedia on IRC Wikifunctions on Telegram Wikifunctions on Mastodon Wikifunctions on Twitter Wikifunctions on Facebook Wikifunctions on YouTube Wikifunctions website Translate

Get important modules and find modules similar to each other

GitHub: abstract-wikipedia-data-science
Demo Video (3 minutes, YouTube)
Abstract Wikipedia Data Science with Outreachy Demo
Demo Audio (41 minutes)


Scribunto modules across wiki projects and languages are used to perform various functions. With the aim of Abstract Wikipedia, we now need to pool all community authored functions in one place, remove redundancy, and modularize functions if possible. This tool gives users and contributors a place to analyze and start merging wikifunctions starting with important modules and then merging or refactoring similar modules.

This task started as an Outreachy internship project with Liudmila Kalina and Aisha Khatun as interns. Read blog posts they (and others) have posted thoughout the internship period in biweekly reports.

What it contains

  • A list of important modules. The idea of importance may differ slightly across tasks and so we provide a method to weight features. The weights are normalized later, so users can put any number in the weights inputs, higher number indicating more importance for certain features.
  • Wiki project wise filters (select a few or all projects like Wikipedia, Wikibooks, etc.)
  • Language filters.
  • On clicking a module, you get a list of similar modules. Users can start contributing to merge these or make more modular versions of these functions.


To accomplish the task of finding important modules and modules similar to each other, following subtasks were completed in order. All these work led to the final product in the GitHub repository.

  • Collect source code of all modules in Module namespace using MediaWiki API (T270494).
  • Collect data related to these modules from replica databases (T270492):
  • Analysis of collected data to identify priority modules (T272003):
    • Summary report of data analysis: PDF.
    • Summary report on scoring mechanism: PDF.
    • Performing data analysis: notebook, PDF.
    • Scoring modules in terms of importance: notebook, PDF.
  • Clustering modules to isolate similar modules (T270827):
    • Summary report on clustering methods tested and findings: PDF.
    • Analysis of contents for modules under the same title: notebook, PDF.
    • Similarity analysis: notebook, PDF.
    • Tuning clustering methods: notebook, PDF.
  • Additionally, an attempt to collect pageview data was made (T271400): notebook, PDF.