Research talk:Reading time/Work log/2018-09-25
Latest comment: 6 years ago by Tbayer (WMF) in topic Consistency Checks
Tuesday, September 25, 2018
editConsistency Checks
editTaking care of this task: https://phabricator.wikimedia.org/T160492
DONE/UPDATED TO EXCLUDE SAFARI: Having found that there are sometimes page tokens (pageviews) with > 2 events, check that at least there is only one pageloaded event among them each time
edit- Page tokens with more than 1 pageLoaded event exist.
-- Are there any cases where pageLoaded > 1 ? SELECT COUNT(*) AS nLoadedTwice FROM ( SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction FROM event.readingdepth WHERE year = 2018 AND month = 9 AND day = 24 AND useragent.browser_family != "Safari" GROUP BY event.pageToken, event.action) t1 WHERE t1.eventAction = "pageLoaded" AND t1.nAction > 1; ---- result is 64 -- What persentage of these cases is this? SELECT COUNT(*) AS nEvents FROM event.readingdepth WHERE year = 2018 AND month = 9 AND day = 24 AND useragent.browser_family != "Safari"; ---- result is 575760
So we have a 0.000111 ratio of cases where pageLoaded more than once. This seems pretty small, but acceptable. Filtering these out in analysis seems like it wouldn't hurt.
DONE: Check if such pageviews (tokens) with multiple unloaded events will need to be filtered out during analysis, or whether their impact is likely small enough to be ignored
edit---- What about cases where pageUnloaded > 1 ? -- Are there any cases where pageLoaded > 1 ? SELECT COUNT(*) AS nUnloadedTwice FROM ( SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction FROM event.readingdepth WHERE year = 2018 AND month = 9 AND day = 24 AND useragent.browser_family != "Safari" GROUP BY event.pageToken, event.action) t1 WHERE t1.eventAction = "pageUnloaded" AND t1.nAction > 1; -- result is 1792
So we have a 0.00311 (0.311%) rate of cases where pageUnloaded more than once. This is more than multiple page loaded events. Filtering these out seems like it wouldn't hurt, especially if they are outliers.
DONE/UPDATED to exclude safari: Consistency checks: totalLength and visibleLength should be less (apart from rounding errors) than the difference between timestamps of the loaded and unloaded events (@Zareenf already worked on these)
editADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar; CREATE TEMPORARY FUNCTION GetMediawikiTimestamp as 'org.wikimedia.analytics.refinery.hive.GetMediawikiTimestampUDF'; -- Query to see what browsers exist SELECT DISTINCT(useragent.browser_family) AS browser_ FROM event.readingdepth WHERE year = 2018 AND month = 9 AND day = 23; CREATE TABLE IF NOT EXISTS nathante.cleanReadingDepth LOCATION "/user/nathante/cleanReadingDepth" AS SELECT * FROM event.readingdepth WHERE year > 0 and event.pageToken in ( SELECT DISTINCT(pageToken) FROM (SELECT event.pageToken AS pageToken, event.action AS eventAction, COUNT(*) as nAction FROM event.readingdepth WHERE useragent.browser_family != "Safari" AND year > 0 GROUP BY event.pageToken, event.action) t1 WHERE t1.eventAction = "pageLoaded" AND t1.nAction == 1 ); -- OK now we gotta -- compute the differences between loaded and unloaded events use nathante; ---- the lengths in the record are in ms, but the dt are in s CREATE TABLE IF NOT EXISTS nathante.pageEventTimings LOCATION "/user/nathante/pageEventTimings" AS SELECT lo.pageToken AS pageToken, lo.dt AS lo_dt, ul.dt AS ul_dt, ul.dt - lo.dt AS EventSpan, (ul.dt - lo.dt) - totalLength/1000.0 AS errTotalLength, (ul.dt - lo.dt) - visibleLength/1000.0 AS errVisibleLength FROM ( SELECT GetMediaWikiTimestamp(dt) AS dt, event.pageToken as pageToken FROM cleanReadingDepth WHERE event.action = "pageLoaded") lo, ( SELECT GetMediaWikiTimestamp(dt) AS dt, event.pageToken as pageToken, event.totalLength AS totalLength, event.visibleLength AS visibleLength FROM cleanReadingDepth WHERE event.action = "pageUnloaded") ul WHERE lo.pageToken = ul.pageToken AND lo.dt > 0; ---- This might be too big or too small. What is a reasonable rounding threshhold? ---- set roundingThreshhold=-2; -- I can't get the variable to work SELECT COUNT(*) AS nrow, AVG(IF(errTotalLength < -2,1,0)) AS avgTotalLengthErrs, SUM(IF(errVisibleLength < -2,1,0)) AS nVisibleLengthErrs FROM nathante.pageEventTimings; -- nrow ntotallengtherrs nvisiblelengtherrs -- 80900688 0.07590008381634529 0.06944185690979537
So about 7.5% error rate for total time and 7% error rate for visible time.
SELECT COUNT(*) AS nrow, AVG(IF(errTotalLength < -5,1,0)) AS nTotalLengthErrs, SUM(IF(errVisibleLength < -5,1,0)) AS nVisibleLengthErrs FROM nathante.pageEventTimings; -- nrow avgtotallengtherrs avgvisiblelengtherrs --80900688 0.015145619033548887 0.013411258010562284
Visible length seems better: only 1.5% error rate.
Groceryheist (talk) 21:47, 25 September 2018 (UTC)
- Great! (We should update the Phab task with these results too.)
- Regarding the timing inconsistencies: Some noise is to be expected, in particular because the time when an event request is received by the server (the
dt
timestamp) is always different from the time it was logged in the browser, etc.) That said, these percentages seem a bit high, and also, it is weird that the distribution of the total length error appears seem to have periodic peaks with a periodicity of about 40 seconds. That seems to point to a larger problem.
SELECT rerrTotalLength, COUNT(*) AS views FROM ( SELECT ROUND(errTotalLength,1) AS rerrTotalLength FROM nathante.pageEventTimings ) AS reTL GROUP BY rerrTotalLength HAVING ABS(rerrTotalLength) < 200 ORDER BY rerrTotalLength LIMIT 100000;
- Regards, Tbayer (WMF) (talk) 23:11, 25 September 2018 (UTC)
- Wow that 40 second periodicity is crazy! Who might have an explanation? Groceryheist (talk) 23:21, 25 September 2018 (UTC)
- We should ask the web engineers on the Phab ticket whether they have ideas... It might be worth checking first whether these events with n x 40 seconds discrepancy are concentrated in any dimension, e.g. browser, domain (
webhost
), mobile vs. desktop (webhost LIKE '%.m%'
). A histogram chart might be useful for illustration. - Regards, Tbayer (WMF) (talk) 23:45, 25 September 2018 (UTC)
- @Groceryheist: Are we sure the calculation of
errTotalLength
in the query above is correct? It seems it is subtracting one MediaWiki timestamp from another (ul.dt - lo.dt
). But those timestamps have the format YYYYMMDDHHMMSS, so that subtraction will give wrong results (e.g. the difference between 20180921000000 and 20180921000100 is 60 seconds, not 100 seconds). Regards, Tbayer (WMF) (talk) 00:34, 4 October 2018 (UTC)
- We should ask the web engineers on the Phab ticket whether they have ideas... It might be worth checking first whether these events with n x 40 seconds discrepancy are concentrated in any dimension, e.g. browser, domain (
- Wow that 40 second periodicity is crazy! Who might have an explanation? Groceryheist (talk) 23:21, 25 September 2018 (UTC)
DONE: Generate histogram for totallength and visiblelength that don't agree with timestamps.
editWhat spikes we have every 40 seconds!