Research talk:Reading time/Work log/2018-10-06
Saturday, October 6, 2018
editI wanted to look at the distributions on several wikis. [User:OVasileva (WMF)] suggested English (en), Dutch (nl), Spanish (es), Hindi (hi), Arabic (ar), and Punjabi(pa). Because the data is highly skewed by wiki I took a stratified sample using this script:
-- create a table that has at most 10000 rows from each strata
-- we want to compute weights so that the ones that have more than 10000 true rows are weighted appropriately.
CREATE TABLE IF NOT EXISTS nathante.readingDepthSampleByWiki LOCATION "/user/nathante/readingDepthSampleByWiki"
AS
SELECT dt,
pagetoken,
dominteractivetime,
totalLength,
visibleLength,
browser_family,
os_family,
wmf_app_version,
wiki,
mobile,
year,
month,
day,
hour,
LEAST(10000, N_strat) AS N_strat_samp,
N_strat/LEAST(10000, N_strat) AS weight,
webhost
FROM (
SELECT
dt,
event.pagetoken AS pagetoken,
event.dominteractivetime AS dominteractivetime,
event.totallength AS totalLength,
event.visiblelength AS visibleLength,
useragent.browser_family AS browser_family,
useragent.os_family AS os_family,
useragent.wmf_app_version AS wmf_app_version,
wiki,
year,
month,
day,
hour,
mobile,
webhost,
COUNT(*) over (partition by hash(wiki)) AS N_strat,
rank() over (partition by hash(wiki) order by rand()) as rank_strat
FROM (SELECT *, webhost LIKE '%.m%' AS mobile FROM nathante.cleanReadingDepth) A ) B
WHERE (rank_strat < 10000)
Summary stats from the stratified table
editRemember that the units are miliseconds.
visiblelength | |||||||||
---|---|---|---|---|---|---|---|---|---|
max | percentile_95 | percentile_75 | median | mean | percentile_25 | percentile_5 | min | count | |
wiki | |||||||||
arwiki | 1.525657e+12 | 358095.30 | 79087.25 | 28957.0 | 3.862190e+08 | 9719.25 | 2150.25 | 219.0 | 3954 |
dewiki | 2.994148e+08 | 444023.55 | 79470.50 | 24497.0 | 3.996264e+05 | 7482.75 | 1750.10 | 261.0 | 4084 |
enwiki | 1.515280e+12 | 374439.40 | 64480.00 | 21509.0 | 3.579515e+08 | 6854.00 | 1649.20 | 188.0 | 4237 |
eswiki | 1.526892e+12 | 577875.50 | 100811.00 | 32537.0 | 3.725822e+08 | 10733.50 | 2238.15 | 125.0 | 4102 |
hiwiki | 2.021348e+07 | 359279.70 | 78303.00 | 31319.0 | 1.000993e+05 | 11674.50 | 2665.90 | 7.0 | 3679 |
nlwiki | 1.825265e+08 | 472283.75 | 66615.00 | 22465.0 | 2.952587e+05 | 6989.75 | 1729.25 | 267.0 | 4114 |
pawiki | 1.528875e+12 | 274091.00 | 54982.25 | 20308.0 | 5.298374e+08 | 7543.25 | 2007.75 | 105.0 | 2886 |
Visible time histogram by wiki
editThis chart shows box plots of the distribution of time that wikipedia pages were open in the browser on a selection of wikis. The plots were computed on random samples of several thousand observations for each wiki and truncated at 300 seconds.. Spanish, Hindi, and Arabic appear to have longer reading times while English and Punjabi appear to have somewhat shorter reading times.