WSoR datasets/trending articles
This dataset is used to the sprint on new editor retention in trending articles.
Location
editDropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv
Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2009_daily_rev.tsv
Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2010_daily_rev.tsv
Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2010_daily_rev.tsv
Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2011_daily_rev.tsv
Fields
edit$ head -4 ~/Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv title page_id redirect? pageview timestamp predicted pageview actual pageview trending hours surprisedness revision timestamp user type username editcount new? protect editcount_120d+90d National_Football_League_Most_Valuable_Player_Award 2019918 REDIRECT_RESOLVED 2009/01/03 12:00:00 212 1270 1 4.99056603774 261557507 20090103000010 REG Howdythere 880 OLD NO_PROTECT 1 La_Toya_Jackson 152297 REDIRECT_RESOLVED 2009/01/03 12:00:00 65 1668 1 24.6615384615 261557521 20090103000013 ANON 81.151.114.161 0 NEW NO_PROTECT 0 National_Football_League_Most_Valuable_Player_Award 2019918 REDIRECT_RESOLVED 2009/01/03 12:00:00 212 1270 1 4.99056603774 261557736 20090103000124 REG Howdythere 881 OLD NO_PROTECT 1
Each row represents a revision that has its 'surprisedness' value higher than the threshold. Each file covers those revisions found in one month.
title
page_id
redirect?
pageview timestamp
predicted pageview
: linear prediction from the previous two daysactual pageview
trending hours
: the duration of the continued trending dayssurprisedness
: percentage of the increase from the prediction to the actual page view countrevision ID
revision timestamp
: in date, hour, min and secondsuser type
: registered user, bot, or anonymous userusername
editcount
: editcount until the revision timestampnew user?
: whether the user had 30 days editing history as of the revision
Reproduction
editUse the scripts available at [1] and follow the documentation.
Notes
editSince it takes 1-2 days to produce the dataset for one month, only the samples for 5 months, every half a year between January 2009 and January 2011 are prepared. Edit count values may be incorrect bug:19311.