Get important modules and find modules similar to each other
- GitHub: abstract-wikipedia-data-science
- Demo Video (3 minutes, YouTube)
- Abstract Wikipedia Data Science with Outreachy Demo
- Demo Audio (41 minutes)
Scribunto modules across wiki projects and languages are used to perform various functions. With the aim of Abstract Wikipedia, we now need to pool all community authored functions in one place, remove redundancy, and modularize functions if possible. This tool gives users and contributors a place to analyze and start merging wikifunctions starting with important modules and then merging or refactoring similar modules.
This task started as an Outreachy internship project with Liudmila Kalina and Aisha Khatun as interns. Read blog posts they (and others) have posted thoughout the internship period in biweekly reports.
What it contains
- A list of important modules. The idea of importance may differ slightly across tasks and so we provide a method to weight features. The weights are normalized later, so users can put any number in the weights inputs, higher number indicating more importance for certain features.
- Wiki project wise filters (select a few or all projects like Wikipedia, Wikibooks, etc.)
- Language filters.
- On clicking a module, you get a list of similar modules. Users can start contributing to merge these or make more modular versions of these functions.
To accomplish the task of finding important modules and modules similar to each other, following subtasks were completed in order. All these work led to the final product in the GitHub repository.
- Collect source code of all modules in Module namespace using MediaWiki API (T270494).
- Collect data related to these modules from replica databases (T270492):
- Analysis of collected data to identify priority modules (T272003):
- Clustering modules to isolate similar modules (T270827):
- Additionally, an attempt to collect pageview data was made (T271400): notebook, PDF.