About the Algorithm

Author : Dell Zhang (dell.z@ieee.org)

Team: zeditor

Ranking in Wikichallenge: Honorable Mention

Publication: arxiv

Dependencies

The program has used the following open-source software tools which are all included in the software package Python(x,y) 2.7.2.

Python 2.7.2
NumPy 1.6.1
scikit-learn 0.8.1 (Note: The winning submission did not use it.)
OpenCV 2.3.1

Dataset Files

The program has used the following dataset files which are available to all participants.

(1) data/training.tsv The official dataset for training and testing.

(2) data/validation.tsv and data/validation_solutions.csv The official dataset for validation. Note: The winning submission did not use it.

(3) data/moredata_raw.tsv The additional dataset generously provided by Twan van Laarhoven. [http://www.kaggle.com/c/wikichallenge/forums/t/719/more-training-data] It is created by downloading a list of Wikipedia users, and then for each user (in random order) downloading a list of edits made by that user between 2001-01-01 and 2010-08-31. The format of the original file is slightly different from that of official datasets, so it is named moredata_raw while the pre-processed file is named moredata.

Source Code Files

(1) preprocess_moredata.py It converts moredata_raw.tsv to moredata.tsv so that its format would be consistent with the official datasets.

(2) editlog.py It provides a utility function that turns a timestamp string into a real number of months from 2001-01-01. For example, parse_timestamp('2001-06-16 00:00:00') = 5.5 as five and a half months have passed since 2001-01-01.

(3) prepare.py Given a dataset, it extracts useful editing statistics for each active user (who made at least one edit in the last year) from the tsv data file, and store them in a binary Python pickle file to facilitate future access.

(4) predictor.py It provides the core functions for feature representation, labelling, learning (training), and estimation (prediction). Please refer to the following sections about algorithm description and algorithm pseudo-code.

(5) experiment.py Given a labelled dataset (by default 'moredata'), it predicts the number of future edits for each active user, and then evaluates the results with RMSLE.

(6) submit.py Given an unlabelled dataset (by default 'training'), it predicts the number of future edits for each active user, and then writes the results into a file outputs.csv that is to be submitted.

Reproducing the Results

(1) Conduct Experiments for Parameter Tuning

$ ./preprocess_moredata.py
$ ./prepare.py moredata
$ ./experiment.py

(2) Make the Final Submission - outputs.csv

$ ./prepare.py training
$ ./submit.py

Algorithm Description

Our main idea is to construct a predictive function/model (that estimates a user's future number of edits based on his/her recent editing dynamics) through supervised learning, more specifically, regression.

Let $tst_{time}$ denote the time-point when we would like to predict each active user's number of edits in the next 5 months. To train the function/model, we would move 5 months backwards and assume that we were at the time-point trn_time = tst_time-5, thus we could know the actual number of edits made by each active user in those 5 months after $trn_{time}$ , i.e., the label for our supervised learning (regression). More accurately, the target value for regression would be $y_{i}=log(1+v_{i})$ where $v_{i}$ is the number of edits in the next 5 months, which would make the loss function directly relate to the evaluation metric RMSLE. Given a time-point ( $trn_{time}$ ), each active user $i$ would be represented as a vector $x_{i}$ that consists of the following dynamics features:

(1) the number of edits in the last 1/16, 1/8, 1/4, 1/2, 1, 2, 4, 12, 36, and 106 months;

(2) the number of edited articles in the last 1/16, 1/8, 1/4, 1/2, 1, 2, 4, 12, 36, and 106 months;

(3) the length of time from the first edit to the last edit, scaled logarithmically.

The temporal scales (1/16, 1/8, 1/4, 1/2, 1, 2, 4, 12, 36, and 106 months) are chosen at exponentially increasing steps because we conjecture that the influence of an editing activity would be exponentially decaying along with the time distance away from now. The periods doubles at each step from 1/16 to 4 and then triples at each step from 4 to 106 (here 106 is used instead of 36*3=108 to keep the periods within the time scope of the given datasets).

The machine learning methods that we have tried for our regression task come from two open-source Python modules: scikit-learn (Generalized Linear Models, Support Vector Machines, Stochastic Gradient Descent, Nearest Neighbours, etc.) and OpenCV (Gradient Boosted Trees etc.). I happen to know that the computer vision toolkit OpenCV provides very good implementations of some machine learning algorithms, as I am actually teaching a computer vision course. Finally Gradient Boosted Trees outperformed the other methods.

The feature selection and also the learning algorithm parameter tuning were carried out on the moredata dataset that has been filtered to contain only active users (who made at least one edit in the last year) so as to exhibit the same 'survival bias' as the official training dataset.

We have also introduced a global 'drift' term (i.e., how much the average number of edits would change after 5 months) when making the final predictions. Again, its value is estimated from the situation 5 months ago.

Algorithm Pseudo-Code

Learning

t = trn_time (i.e., tst_time-5)
find all active users who made at least one edit in [t-12, t]
represent each active user $i$ as a vector $x_{i}$ consisting of dynamics features (please refer to the above description)
label each active user by $y_{i}=log(1+v_{i})$ where $v_{i}$ is the number of edits in [t, t+5]
learn a predictive model/function $f:x->y$ from $(x_{i},y_{i})$ pairs using a regression technique such as en:Gradient_boosting
guess the drift d by comparing the number of edits in [t-5, t] and that in [t, t+5]

Predicting

t = tst_time (e.g., 116 for the dataset training)
find all active users who made at least one edit in [t-12, t]
represent each active user $j$ as a vector $x_{j}$ consisting of dynamics features (please refer to the above description)
for each active user $j$ , compute $y_{j}$ = f(x_j) using the learnt f, and output $exp(max(y_{j}+d,0))-1$ as the estimated number of edits in [t, t+5]

Research:Wiki Participation Challenge zeditor

Contents