Wikipedia Administrative Pages Analytics/Technical Documentation
General information
editThe aim of the project Wikipedia Administrative Pages Analytics is to generate datasets, visualizations, and tools to understand admin. pages across Wikipedia language editions.
This page serves as technical documentation for understanding the code, datasets, and databases generated.
Code. The scripts to generate all outputs are available in the public github repository:
https://github.com/marcmiquel/WAPA
It is build with:
- Python 3 - To manage the data.
- Sqlite 3 - To store the data.
Databases. The code generates several databases containing the relevant data and features for the content gap metrics.
Pages and features database:
https://wapa.wmcloud.org/databases/wikipedia_administrative_pages_analytics_production.db (3.5 GB)
This database contains the admin pages mapping for every Wikipedia language edition (pages, features, and groups).
Stats and metrics database:
https://wapa.wmcloud.org/databases/stats_production.db (< 1 GB)
This database contains the basic metrics and statistics (e.g., selection, average of metrics, etc.).
Database
editIntermediate database
- Wikidata Qitems, properties and Labels (wikidata.db):
https://wdo.wmcloud.org/databases/wikidata.db (24 GB) This database annotations of Wikidata-items which we later use to create Wikidata-derived features for pages.
wikidata-item, features of wikidata-items
This database is not a final output, but it is used to generate the final datasets database.
Final database
- pages, features, and groups database (wikipedia_administrative_pages_analytics.db):
https://wapa.wmcloud.org/databases/wikipedia_administrative_pages_analytics_production.db
This database contains the admin pages mapping for every Wikipedia language edition (pages, features, and groups).
CREATE TABLE cawiki_pages (
# general qitem text, page_id integer, page_title text, page_title_same_category_title text, # when there is a category or page with the same title. date_created integer, first_timestamp_lang text, # language of the oldest timestamp for the page num_interwiki integer, ######### #### ANNOTATION 1: NAMESPACES page_namespace integer, #### ANNOTATION 2: CATEGORIES # characteristics of categorization # for categories only num_categories_contains integer, # the num. of category pages it contains. num_pages_contains integer, # the num. of pages it contains. num_pages_admin_contains integer, # the num. of pages (NS = 4, 12, 14, 100) it contains. num_level_from_top integer, # the number of jumps from the top of the crawling (based on the the very top: "Main"). # for all pages num_categories_has integer, # the number of categories it has. actual_categories text, # names separated by ; main_category text, # name of the largest category at that level, i.e., ategory containing more pages. # admin space type admin_categories_top_level text, # from the top of the crawling (based on the selected category or the very top). names separated by ; and with the level category:level. #### ANNOTATION 3: WIKIDATA # wikidata instance of instance_of_Wikimedia_project_page integer, instance_of_Wikimedia_internal_item integer, instance_of_Wikimedia_project_policies_guidelines integer, instance_of_Wikimedia_help_page integer, instance_of_Wikimedia_wikiproject integer, instance_of_Wikimedia_wikimedia_portal integer, ######## # # # characteristics of relationships num_inlinks_from_admin_pages integer, num_outlinks_to_admin_pages integer, percent_inlinks_from_admin_pages real, percent_outlinks_to_admin_pages real, # characteristics of page relevance num_bytes integer, num_external_links integer, num_images integer, num_inlinks integer, num_outlinks integer, num_pageviews integer, # metrics of history # initial metrics num_edits integer, num_discussions integer, num_anonymous_edits integer, num_bot_edits integer, num_reverts integer, num_editors integer, num_admin_editors integer, median_year_first_edit integer, median_editors_edits integer, # new last month metrics num_edits_last_month integer, num_edits_last_month_by_admin integer, num_edits_last_month_by_anonymous integer, num_edits_last_month_by_newcomer_90d integer, num_edits_last_month_by_newcomer_1y integer, num_edits_last_month_by_newcomer_5y integer, # regularity and engagement metrics total_months integer, active_months integer, max_active_months_row integer, max_inactive_months_row integer, percent_active_months float, editing_days integer, percent_editing_days float,+ days_last_50_edits integer+ days_last_5_edits integer+ days_last_edit integer, date_last_edit integer, date_last_discussion integer, # metrics of wikidata sister_projects text, num_multilingual_sisterprojects integer, num_wdproperty integer, num_wdidentifiers integer, PRIMARY KEY (qitem,page_id)); |
- Stats and metrics database (stats.db):
https://wdo.wmcloud.org/databases/stats_production.db (10 GB) This database contains the basic metrics and statistics.
CREATE TABLE wapa_cumulative (content text not null, editor text not null, set1 text not null, set1descriptor not null, set2 not null, set2descriptor not null, abs_value integer,rel_value float,period text,PRIMARY KEY (content,set1,set1descriptor,set2,set2descriptor,period));
CREATE TABLE wapa_incremental (content text not null, editor text not null, set1 text not null, set1descriptor text, set2 text, set2descriptor text, abs_value integer,rel_value float,period text,PRIMARY KEY (content,set1,set1descriptor,set2,set2descriptor,period)); CREATE TABLE wapa_stats (content text not null, set1 text not null, set1descriptor text, statistic text, value float, period text,PRIMARY KEY (content,set1,set1descriptor,statistic,period)); CREATE TABLE admin_categories (langaugecode text not null, qitem text not null, category_name text, category_name_local text, page_id integer, run integer, alternative_category integer,PRIMARY KEY (languagecode, qitem, category_name)); |
Dumps
editDumps: all the dumps are available at dumps.wikimedia.org. We use a direct symbolic link to access them without having to download them (the default location is /public/dumps/public/ on wikimedia-cloud).
(wikirank is an external dataset used for the calculation of the extent score; in the future, we might calculate this score directly from the dumps as well).
The used dumps are the following (in order of appearance):
- wikidata dump https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
- page titles dump: https://dumps.wikimedia.org/cawiki/latest/cawiki-latest-page.sql.gz
- page views dump: https://dumps.wikimedia.org/other/pageview_complete//
- external links dump: https://dumps.wikimedia.org/cawiki/latest/cawiki-latest-externallinks.sql.gz
- images dump: https://dumps.wikimedia.org/cawiki/latest/cawiki-latest-imagelinks.sql.gz
- mediawiki history: https://dumps.wikimedia.org/other/mediawiki_history/
Some of the dumps are independent of a specific language (such as the wikidata-dump), while most other dumps are language-specific. Note that we may want to use the latest and not have to generate the dates.
Source code
editScripts are available at this Github address: https://github.com/marcmiquel/WAPA
- wikipedia_administrative_pages_analytics.py
- creates the wikidata.db from the current wikidata JSON dump and includes the information about a) groups of annotation, b) interwiki links and sister projects, and c) labels and page titles for each page in every Wikipedia language edition.
- creates the database wikipedia_administrative_pages_analytics.db with a table for each Wikipedia language edition.
- adds the page_ids to the wikipedia_administrative_pages_analytics.db tables using a page_titles dump.
wikipedia_administrative_pages_analytics.py
- wd_dump_iterator()
it creates the wikidata.db from the current wikidata JSON dump and includes the information about wikidata-qitems’ to a) instance of annotation, b) interwiki links and sister projects, and c) labels and page titles for each Wikipedia language edition.
- create_wikipedia_administrative_pages_analytics_db()
it creates the database wikipedia_administrative_pages_analytics.db with a table for each Wikipedia language edition. This table is populated in the following steps.
- insert_page_ids_page_titles_qitems_db()
it adds the page_ids to the wikipedia_administrative_pages_analytics.db tables using the page_titles dump (at this point the wikidata-dump only provides wikidata-qitem id and page-title in the respective language).
- extend_page_title_same_category_title()
stores a value 1 when there is a category or a page having the same title.
- extend_instance_of()
stores the values of instance_of properties from Wikidata into the corresponding fields:
instance_of_Wikimedia_project_page
instance_of_Wikimedia_internal_item
instance_of_Wikimedia_project_policies_guidelines
instance_of_Wikimedia_help_page
instance_of_Wikimedia_wikiproject
instance_of_Wikimedia_wikimedia_portal
- extend_interwiki_qitem_properties_identifiers_sister_projects()
stores the value of the following fields:
num_interwiki - number of interwiki links
num_wdproperty - number of wikidata properties
num_wdidentifiers - number of wikidata properties which are an identifier
num_multilingual_sisterprojects - number of sister projects across langauges
sister_projects - names of the sister projects in which the page has an equivalent
- search_highest_largest_category_from_list()
it receives a list of categories and runs down the category graph for each of them (without repeating paths) accumulating categories and pages in order to find: the largest category (more pages below it) and the highest one (more levels below it). it returns them both.
- extend_categories()
computes and stores all the metrics related to the categories.
# for categories
num_categories_contains # the num. of category pages it contains.
num_pages_contains # the num. of pages it contains.
num_pages_admin_contains # the num. of pages (NS = 4, 12, 100) it contains.
num_level_from_top # the number of jumps from the top of the crawling (based on the the very top: "Main").
# for all pages
num_categories_has # the number of categories it has.
actual_categories # names separated by ;
main_category # name of the largest category at that level, i.e., category containing more pages.
- store_admin_categories_local()
checks the availability of the 10 categories in all the language editions and it stores in a table on stats.db.
- retrieve_admin_categories_local()
retrieves the categories of the 10 admin types for all languages.
- admin_category_category_crawling()
takes one category and it does the category crawling.
- extend_admin_categories_existing_crawling()
does the category crawling for the existing categories.
- extend_admin_categories_interwiki_approach()
looks for the category most equivalent to the missing one and does the category crawling.
- extend_links()
computes the number of inlinks and outlinks from and to administrative pages.
- extend_pageviews()
gets the number of pageviews from the previous month for each page.
- extend_external_links()
gets the number of external links for each page
- extend_images()
gets the number of images for each page
- extend_editing_history()
# metrics of history
# initial metrics
num_edits
num_discussions
num_anonymous_edits
num_bot_edits
num_reverts
num_editors
num_admin_editors
median_year_first_edit
median_editors_edits
# new last month metrics
num_edits_last_month
num_edits_last_month_by_admin
num_edits_last_month_by_anonymous
num_edits_last_month_by_newcomer_90d
num_edits_last_month_by_newcomer_1y
num_edits_last_month_by_newcomer_5y
# regularity and engagement metrics
total_months
active_months
max_active_months_row
max_inactive_months_row
percent_active_months
editing_days
percent_editing_days
days_last_50_edits
days_last_5_edits
days_last_edit
date_last_edit
date_last_discussion
- extend_first_timestamp_lang()
looks for the language edition in which an article was created first.
- create_stats_db()
It creates the database and tables where we store the extent of each type of admin page among other metrics.
It stores stats such as the number of pages on each month of a Wikipedia language edition history (“monthly increment”, “monthly cumulative”).