Research talk:Newcomer task suggestions/Work log/2014-07-24
Latest comment: 10 years ago by Halfak (WMF) in topic Thursday, July 24th
Thursday, July 24th
editTime to write some queries. First I need to sample for returnTos. Let's grab all of the returnTos from the last month in Enwiki.
First, I need to know how many we're going to get.
> SELECT -> COUNT(*) -> FROM ServerSideAccountCreation_5487345 -> WHERE -> wiki = "enwiki" AND -> timestamp BETWEEN "20140716" AND "20140723" AND -> event_returnTo IS NOT NULL; +----------+ | COUNT(*) | +----------+ | 30078 | +----------+ 1 row in set (0.76 sec)
30k! That aint no thang. I added some filtering to limit it to main namespace pages. (A little hacky, but it shouldn't matter)
-> SELECT -> event_returnTo, -> count(*) -> FROM ServerSideAccountCreation_5487345 -> WHERE -> wiki = "enwiki" AND -> timestamp BETWEEN "20140716" AND "20140723" AND -> event_returnTo IS NOT NULL AND -> event_returnTo NOT LIKE "%:%" AND -> event_returnTo != "Main Page" -> GROUP BY event_returnTo -> ORDER BY COUNT(*) DESC -> LIMIT 20; +-------------------------------------------------+----------+ | event_returnTo | count(*) | +-------------------------------------------------+----------+ | Wikipedia | 83 | | Wiki | 76 | | Gmail | 45 | | Facebook | 45 | | Malaysia Airlines Flight 17 | 44 | | India | 33 | | Login | 28 | | Nelson Mandela | 24 | | Vehicle insurance in the United States | 22 | | WhatsApp | 18 | | Deaths in 2014 | 17 | | Google | 17 | | Email | 17 | | Yahoo! | 16 | | James Rodríguez | 16 | | Indian Railway Catering and Tourism Corporation | 16 | | Snapchat | 14 | | VK (social network) | 14 | | Christina Aguilera | 14 | | Victor Ponta | 13 | +-------------------------------------------------+----------+ 20 rows in set (1.13 sec)
It looks like a lot of the visits get to "Wikipedia". Otherwise, this looks reasonable. Here's my sample query:
SELECT DISTINCT REPLACE(return_to, " ", "_") AS title FROM ( SELECT event_returnTo AS return_to FROM ServerSideAccountCreation_5487345 WHERE wiki = "enwiki" AND timestamp BETWEEN "20140716" AND "20140723" AND event_returnTo IS NOT NULL AND event_returnTo NOT LIKE "%:%" AND event_returnTo != "Main Page" ORDER BY RAND() LIMIT 201 ) AS sample
Now to start working on a script that can get recs from elastic search. --Halfak (WMF) (talk) 19:21, 24 July 2014 (UTC)