Research talk:Reading time/Work log/2018-09-25

Latest comment: 5 years ago by Tbayer (WMF) in topic Consistency Checks

Tuesday, September 25, 2018 edit

Consistency Checks edit

Taking care of this task: https://phabricator.wikimedia.org/T160492

DONE/UPDATED TO EXCLUDE SAFARI: Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time edit

Page tokens with more than 1 pageLoaded event exist.
   -- Are there any cases where pageLoaded > 1 ? 
   SELECT COUNT(*) AS nLoadedTwice FROM (
         SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction
         FROM event.readingdepth
         WHERE
             year = 2018 
             AND month = 9
             AND day = 24
             AND useragent.browser_family != "Safari" 
         GROUP BY event.pageToken, event.action) t1
    WHERE t1.eventAction = "pageLoaded" AND
          t1.nAction > 1;
   ---- result is 64
   
   -- What persentage of these cases is this?
   SELECT COUNT(*) AS nEvents FROM event.readingdepth
   WHERE
          year = 2018 
          AND month = 9
          AND day = 24
          AND useragent.browser_family != "Safari";
   
   ---- result is 575760

So we have a 0.000111 ratio of cases where pageLoaded more than once. This seems pretty small, but acceptable. Filtering these out in analysis seems like it wouldn't hurt.

DONE: Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored edit

   ---- What about cases where pageUnloaded > 1 ?
   -- Are there any cases where pageLoaded > 1 ? 
   SELECT COUNT(*) AS nUnloadedTwice FROM (
          SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction
          FROM event.readingdepth
          WHERE
          year = 2018 
          AND month = 9
          AND day = 24
          AND useragent.browser_family != "Safari"
          GROUP BY event.pageToken, event.action) t1
          WHERE t1.eventAction = "pageUnloaded" AND
                t1.nAction > 1;
   -- result is 1792

So we have a 0.00311 (0.311%) rate of cases where pageUnloaded more than once. This is more than multiple page loaded events. Filtering these out seems like it wouldn't hurt, especially if they are outliers.


DONE/UPDATED to exclude safari: Consistency checks: totalLength and visibleLength should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@Zareenf already worked on these) edit

   ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
   CREATE TEMPORARY FUNCTION GetMediawikiTimestamp as 'org.wikimedia.analytics.refinery.hive.GetMediawikiTimestampUDF';
   
   -- Query to see what browsers exist
   SELECT DISTINCT(useragent.browser_family) AS browser_ FROM event.readingdepth WHERE 
                 year = 2018 AND
                 month = 9 AND
                 day = 23;
     
   CREATE TABLE IF NOT EXISTS nathante.cleanReadingDepth LOCATION "/user/nathante/cleanReadingDepth" AS 
   SELECT * FROM event.readingdepth
   WHERE year > 0 and event.pageToken in
   ( SELECT DISTINCT(pageToken)
             FROM (SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction
                   FROM event.readingdepth
                   WHERE
                        useragent.browser_family != "Safari" AND
                        year > 0      
                   GROUP BY event.pageToken, event.action) t1
                   WHERE t1.eventAction = "pageLoaded" AND
                         t1.nAction == 1
    );
   
   -- OK now we gotta 
   -- compute the differences between loaded and unloaded events
   
   use nathante;
   ---- the lengths in the record are in ms, but the dt are in s
   CREATE TABLE IF NOT EXISTS nathante.pageEventTimings LOCATION "/user/nathante/pageEventTimings" AS
    SELECT lo.pageToken AS pageToken, lo.dt AS lo_dt, ul.dt AS ul_dt, ul.dt - lo.dt AS EventSpan, (ul.dt - lo.dt) - totalLength/1000.0 AS errTotalLength, (ul.dt - lo.dt) - visibleLength/1000.0 AS errVisibleLength  FROM 
          ( SELECT GetMediaWikiTimestamp(dt) AS dt, event.pageToken as pageToken FROM cleanReadingDepth WHERE event.action = "pageLoaded") lo,
          ( SELECT GetMediaWikiTimestamp(dt) AS dt, event.pageToken as pageToken, event.totalLength AS totalLength, event.visibleLength AS visibleLength FROM cleanReadingDepth WHERE event.action = "pageUnloaded") ul
    WHERE lo.pageToken = ul.pageToken AND lo.dt > 0;
   
   ---- This might be too big or too small. What is a reasonable rounding threshhold?
   ---- set roundingThreshhold=-2; -- I can't get the variable to work
   
   SELECT COUNT(*) AS nrow, AVG(IF(errTotalLength <  -2,1,0)) AS avgTotalLengthErrs, SUM(IF(errVisibleLength < -2,1,0)) AS nVisibleLengthErrs FROM nathante.pageEventTimings;
   --   nrow    ntotallengtherrs        nvisiblelengtherrs
   -- 80900688  0.07590008381634529     0.06944185690979537

So about 7.5% error rate for total time and 7% error rate for visible time.

   SELECT COUNT(*) AS nrow, AVG(IF(errTotalLength <  -5,1,0)) AS nTotalLengthErrs, SUM(IF(errVisibleLength < -5,1,0)) AS nVisibleLengthErrs FROM nathante.pageEventTimings;
   
   -- nrow     avgtotallengtherrs        avgvisiblelengtherrs
   --80900688  0.015145619033548887  0.013411258010562284

Visible length seems better: only 1.5% error rate.

Groceryheist (talk) 21:47, 25 September 2018 (UTC)Reply

Great! (We should update the Phab task with these results too.)
Regarding the timing inconsistencies: Some noise is to be expected, in particular because the time when an event request is received by the server (the dt timestamp) is always different from the time it was logged in the browser, etc.) That said, these percentages seem a bit high, and also, it is weird that the distribution of the total length error appears seem to have periodic peaks with a periodicity of about 40 seconds. That seems to point to a larger problem.
SELECT rerrTotalLength, COUNT(*) AS views FROM (
  SELECT ROUND(errTotalLength,1) AS rerrTotalLength
  FROM nathante.pageEventTimings ) AS reTL 
GROUP BY rerrTotalLength 
HAVING ABS(rerrTotalLength) < 200
ORDER BY rerrTotalLength LIMIT 100000;
Regards, Tbayer (WMF) (talk) 23:11, 25 September 2018 (UTC)Reply
Wow that 40 second periodicity is crazy! Who might have an explanation? Groceryheist (talk) 23:21, 25 September 2018 (UTC)Reply
We should ask the web engineers on the Phab ticket whether they have ideas... It might be worth checking first whether these events with n x 40 seconds discrepancy are concentrated in any dimension, e.g. browser, domain (webhost), mobile vs. desktop (webhost LIKE '%.m%'). A histogram chart might be useful for illustration.
Regards, Tbayer (WMF) (talk) 23:45, 25 September 2018 (UTC)Reply
@Groceryheist: Are we sure the calculation of errTotalLength in the query above is correct? It seems it is subtracting one MediaWiki timestamp from another (ul.dt - lo.dt). But those timestamps have the format YYYYMMDDHHMMSS, so that subtraction will give wrong results (e.g. the difference between 20180921000000 and 20180921000100 is 60 seconds, not 100 seconds). Regards, Tbayer (WMF) (talk) 00:34, 4 October 2018 (UTC)Reply

DONE: Generate histogram for totallength and visiblelength that don't agree with timestamps. edit

 
Distribution of apparent errors in attempts to measure dwell times on Wikipedia. The number of views characterized by a given discrepancy between log timestamps and estimated dwell times is shown on the Y axis and the magnitude of the discrepancy is shown on the x axis. There appears to be a period spike in errors at 40second intervals.

What spikes we have every 40 seconds!

DONE: Generate histograms for totalLength and visibleLength edit

 

TODO: Investigate negative values in the data for page event times. edit

TODO: Investigate very large values in the data for page event times. edit

Return to "Reading time/Work log/2018-09-25" page.