Data analysis/mining of Wikimedia wikis

See also e.g. Research:Data or wikitech:Analytics#Datasets

This page aims to collect relative strengths and weaknesses of different approaches to data mining Wikimedia wikis. Other data sources (e.g. squid logs) are out of scope.

XML Dumps

Strengths

XML dumps act as abstraction layer which shields user from physical database changes

These schema changes were not uncommon in early years, probably less common now

Existing support structure
Dumps available for all projects / languages
Suited for offline processing by community / researchers

Weaknesses

Only part of database contents are available in XML format (some in SQL dump format, some not at all)
Dump generation is lengthy process, despite improvements even in best circumstances some dumps trail live data with many weeks

English Wikipedia dump job runs for about two weeks, start of job could be up to two weeks after closure of month

Scripts for dump generation are maintenance intensive
Restructuring of dumps could make them more download friendly

This is a long standing issue but not trivial (for instance incremental dumps would still require updates due to article /revision deletions)

Dump generation, although much improved, is still inherently unreliable process
- Code is intertwined in general parser code, therefore
- Process is suspended on mediawiki code upgrades
- No regression tests

Wikistats scripts

Strengths

Lots of functionality, developed in constant dialog with core community
Produces large set of intermediate csv files

Many of these are reused by community and researchers (but lack proper documentation)

Monthly batch update for all reports
New functionality and bug fixes work for all historic months, as all reports are rebuilt from scratch every time

This has flip side as well: long run time, deleted content vanishes from stats

Services all Wikimedia projects and languages on an equal footing

(some toolserver projects also have this approach)

Many reports are multilingual

(but much work needed here, translatewiki.net seems path to go)

Wikistats portal as navigational aid
Extensive, well formatted activity log that helps to track program flow

This is a great tool for bug fixing, but also for learning what the script does: it partly compensates the lack of documentation

Weaknesses

Prerendered reports, not designed for ad hoc querying
Hardly any documentation (but see activity log above)
Many reports are too rich in details/granularity for casual readers
Some scripts score low on maintainability
- It contains many optimization tweaks, part of which may be entirely obsolete with current hardware resources.
- Not KISS: WikiReports section contains lots of code to fine tune layout (even per project), with added complexity as result
- Some scripts still contain some test code, tuned to one particular test environment (file paths)
- Where WikiCounts might be seen as largely self documenting code (sensible function and variable names etc), this is less so for WikiReports

(being a one person hobby project for many years this simply had not highest priority)

WikiCounts job not restartable.

WikiCounts evolved since 2003, when a full English dump took 20 minutes on one thread rather than a full week on 15.
In a new design, reprocessing collected data would have been punt in a separate step, in order to maximize restartability.

Other dump clients

Strengths

Existing code - 3rd party scripts in several script languages can process the XML dumps
Simplicity - these scripts tend to be single purpose, are often very simple and efficient, therefore can be good starting point to explore the dumps

Weaknesses

Support - presumably only some 3rd party scripts are supported, of course simplicity makes this less of an issue

Mediawiki API

Strengths

Well designed feature set
Lots of functionality
Acts as abstraction layer

(like XML dumps, shields user from physical database changes)

Data from live database, always up to date results
Supports several data formats
Expertise available among staff and community

Weaknesses

Even with special bot flag limited to x calls per second, each with limited quantity of data returned
Inherently too slow to transfer large volumes of data
Data from live database also means
- Regression testing more difficult
- Limited ability to rerun reports for earlier periods, either based on new insights or because of bug fixing

MySQL

Strengths

Flexibility - ad hoc queries easy
Access to all data
Querying language widely known, even among advanced end users

Weaknesses

Too slow for some purposes

Compare generation of English full history dump: even with 95+ reuse of cached data takes full week on 15 nodes

No abstraction layer (see XML dumps above)
Requires good knowledge of database schema

Live Database

Traditionally widely used by admins

Strengths

The 'real thing': reliability of data as good as it gets
What third parties are most likely to use and help with: see for instance the existing StatMediaWiki

Weaknesses

Performance - only for trivial queries (and the risk they are not trivial after all)
Access - few people have MySQL access to Wikimedia live database, for obvious reasons

Slave Database at WMF

Strengths

Quick to setup, with limited effort
Ability to do user analysis by geography (?)
Complex ad-hoc queries for editor history without disrupting site operations

Weaknesses

No access outside WMF staff

This will hamper reusability of code although in theory code might be reused on tool server

Possibility of losing complex queries
Reusability of queries on all wikis
Need for extra work to create API or script calls for queries found to be useful
Bottleneck of complex queries (?)

Tool Server

Strengths

Existing support structure
Large community of volunteer dev's

Weaknesses

Time sharing puts upper limit on resource usage
Due to scale and few staff less tuned to 24*7 operations

NoSQL

Strengths

Built-in data replication & failover (all implementations?)
Scales Horizontally
Developed by leading Web 2.0 players
Designed for really huge data collections
Optimized for fast response times on ad hoc queries

Weaknesses

Still maturing technology
Compared to MySQL few large implementations yet
Limited expertise available
- How important is the choice of implementation?
Export from live MySQL databases will require extensive effort (?)
Needs good initial design for best performance

Opportunities

New technology frontier, will appeal to potential new volunteers and staff who want to make their mark

Threats

Will it be feasible to compact stored information over time (e.g. only preserve aggregated data after x days)

Some NoSQL solutions reputedly are better tuned to addition of new info than updating/filtering existing info

Cassandra

Strengths

Decentralized - no master server , no single point of failure
Elasticity - machines can be added on the fly, read and write scale linearly with nodes added
Fault-tolerant - redundant storage on multiple nodes, replication over multiple data centers, hot swapping of failed nodes
Tunable consistency - from 'writes never fail' till 'deliver fast at expense of replication integrity'

Weaknesses

HADOOP

Strengths

Weaknesses

MongoDB

Strengths

Weaknesses