Research talk:Reading time/Work log/2018-10-06

Saturday, October 6, 2018 edit

I wanted to look at the distributions on several wikis. [User:OVasileva (WMF)] suggested English (en), Dutch (nl), Spanish (es), Hindi (hi), Arabic (ar), and Punjabi(pa). Because the data is highly skewed by wiki I took a stratified sample using this script:

-- create a table that has at most 10000 rows from each strata
-- we want to compute weights so that the ones that have more than 10000 true rows are weighted appropriately. 
CREATE TABLE IF NOT EXISTS nathante.readingDepthSampleByWiki LOCATION "/user/nathante/readingDepthSampleByWiki"
AS
SELECT dt, 
        pagetoken,
        dominteractivetime, 
        totalLength, 
        visibleLength, 
        browser_family, 
        os_family,
        wmf_app_version,
        wiki,
        mobile,
        year,
        month,
        day,
        hour,
        LEAST(10000, N_strat) AS N_strat_samp,
        N_strat/LEAST(10000, N_strat) AS weight,
        webhost
FROM (
    SELECT
        dt,
        event.pagetoken AS pagetoken,
        event.dominteractivetime AS dominteractivetime, 
        event.totallength AS totalLength, 
        event.visiblelength AS visibleLength, 
        useragent.browser_family AS browser_family, 
        useragent.os_family AS os_family,
        useragent.wmf_app_version AS wmf_app_version,
        wiki,
        year,
        month,
        day,
        hour,
        mobile,
        webhost,
    COUNT(*) over (partition by hash(wiki)) AS N_strat,
    rank() over (partition by hash(wiki) order by rand()) as rank_strat
    FROM (SELECT *, webhost LIKE '%.m%' AS mobile FROM nathante.cleanReadingDepth) A ) B
WHERE (rank_strat < 10000)

Summary stats from the stratified table edit

Remember that the units are miliseconds.

visiblelength
max percentile_95 percentile_75 median mean percentile_25 percentile_5 min count
wiki
arwiki 1.525657e+12 358095.30 79087.25 28957.0 3.862190e+08 9719.25 2150.25 219.0 3954
dewiki 2.994148e+08 444023.55 79470.50 24497.0 3.996264e+05 7482.75 1750.10 261.0 4084
enwiki 1.515280e+12 374439.40 64480.00 21509.0 3.579515e+08 6854.00 1649.20 188.0 4237
eswiki 1.526892e+12 577875.50 100811.00 32537.0 3.725822e+08 10733.50 2238.15 125.0 4102
hiwiki 2.021348e+07 359279.70 78303.00 31319.0 1.000993e+05 11674.50 2665.90 7.0 3679
nlwiki 1.825265e+08 472283.75 66615.00 22465.0 2.952587e+05 6989.75 1729.25 267.0 4114
pawiki 1.528875e+12 274091.00 54982.25 20308.0 5.298374e+08 7543.25 2007.75 105.0 2886

Visible time histogram by wiki edit

 
This chart shows box plots of the distribution of time that wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds.. Spanish, Hindi, and Arabic appear to have longer reading times while English and Punjabi appear to have somewhat shorter reading times.
Return to "Reading time/Work log/2018-10-06" page.