Research talk:Newcomer task suggestions/Work log/2014-07-24

Thursday, July 24th

edit

Time to write some queries. First I need to sample for returnTos. Let's grab all of the returnTos from the last month in Enwiki.

First, I need to know how many we're going to get.

> SELECT 
    ->     COUNT(*) 
    -> FROM ServerSideAccountCreation_5487345
    -> WHERE 
    ->     wiki = "enwiki" AND
    ->     timestamp BETWEEN "20140716" AND "20140723" AND
    ->     event_returnTo IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|    30078 |
+----------+
1 row in set (0.76 sec)

30k! That aint no thang. I added some filtering to limit it to main namespace pages. (A little hacky, but it shouldn't matter)

    -> SELECT 
    ->     event_returnTo,
    ->     count(*)
    -> FROM ServerSideAccountCreation_5487345
    -> WHERE 
    ->     wiki = "enwiki" AND
    ->     timestamp BETWEEN "20140716" AND "20140723" AND
    ->     event_returnTo IS NOT NULL AND
    ->     event_returnTo NOT LIKE "%:%" AND
    ->     event_returnTo != "Main Page"
    -> GROUP BY event_returnTo
    -> ORDER BY COUNT(*) DESC
    -> LIMIT 20;
+-------------------------------------------------+----------+
| event_returnTo                                  | count(*) |
+-------------------------------------------------+----------+
| Wikipedia                                       |       83 |
| Wiki                                            |       76 |
| Gmail                                           |       45 |
| Facebook                                        |       45 |
| Malaysia Airlines Flight 17                     |       44 |
| India                                           |       33 |
| Login                                           |       28 |
| Nelson Mandela                                  |       24 |
| Vehicle insurance in the United States          |       22 |
| WhatsApp                                        |       18 |
| Deaths in 2014                                  |       17 |
| Google                                          |       17 |
| Email                                           |       17 |
| Yahoo!                                          |       16 |
| James Rodríguez                                 |       16 |
| Indian Railway Catering and Tourism Corporation |       16 |
| Snapchat                                        |       14 |
| VK (social network)                             |       14 |
| Christina Aguilera                              |       14 |
| Victor Ponta                                    |       13 |
+-------------------------------------------------+----------+
20 rows in set (1.13 sec)

It looks like a lot of the visits get to "Wikipedia". Otherwise, this looks reasonable. Here's my sample query:

SELECT DISTINCT
    REPLACE(return_to, " ", "_") AS title
FROM (
    SELECT
        event_returnTo AS return_to
    FROM ServerSideAccountCreation_5487345
    WHERE
        wiki = "enwiki" AND
        timestamp BETWEEN "20140716" AND "20140723" AND
        event_returnTo IS NOT NULL AND
        event_returnTo NOT LIKE "%:%" AND
        event_returnTo != "Main Page"
    ORDER BY RAND()
    LIMIT 201
) AS sample

Now to start working on a script that can get recs from elastic search. --Halfak (WMF) (talk) 19:21, 24 July 2014 (UTC)Reply

Return to "Newcomer task suggestions/Work log/2014-07-24" page.