Research:Reading time/Online Supplement

This online supplement is still a work in progress.

This is the online supplement for TeBlunthuis, Nathan; Bayer, Tilman; Vasileva, Olga (2019-08-20). Dwelling on Wikipedia: Investigating time spent by global encyclopedia readers. OpenSym ’19, The 15th International Symposium on Open Collaboration. Skövde, Sweden. p. 14.

Here you can find plots of distributions of page dwell times for all the mediawiki wikis that we analyzed in the paper and details of our model selection process and we omitted from the paper. We also show a marginal effects plot of the relatinship between page length and reading time.

Plots of distributions of page dwell times

This large chart shows box plots for each wiki. Wikis are in alphabetical order by language codes

.

Marginal effects plot showing how the time spent on pages depends on page length according to Model 1a.

Univariate Model Selection

In our analysis, we performed model selection to explore how well different probabiltiy models can describe the processes generating reading time data. Here, we provide additional details of our procedure.

Our method for model selection is inspired in part by Liu et al. (2010), who compared the log-normal distribution to the Weibull distribution of dwell times on a large sample of web pages ^[1]. They fit both models to data for each web page and then compare two measures of model fit: the log-likelihood, which measures the probability of the data given the model (higher is better), and the Kolmogorov-Smirnov distance (KS distance), which is the maximum difference between the model CDF and the empirical CDF (lower is better). For the sample of web pages they consider, the Weibull model outperformed the log-normal model in a large majority of cases according to both goodness-of-fit measures.

Similar to the approach of Liu et al. (2010), we fit each of the models we consider on reading time data, separately for each Wikipedia project ^[1]. We also use the KS distance to evaluate goodness-of-fit. The KS-distance provides a statistical test of the null hypothesis that the model is a good fit for the data ^[2]. The KS-test is quite sensitive to deviations between the model and the data, especially in large samples. Failing to reject this hypothesis with a large sample of data supports the conclusion that the model is a good fit for the data. For the samples sizes we use, passing the KS test is a high bar. This allows us to go beyond Liu et al. (2010) by evaluating whether each distribution is a plausible model, instead of just whether one distribution is a better fit than another.

Liu et al. (2010) compare two distributions that each have 2 parameters, but the models we consider have different numbers of parameters (the exponentiated Weibull model has 3 parameters and the exponential model has only 1). Adding parameters can increase model fit without improving out-of-sample predictive performance or explanatory power. To avoid the risk over-fitting and to make a fair comparison between models we use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) instead of the log-likelihood. Both criteria attempt to quantify the amount of information lost by the model (lower is better), evaluate the log likelihood, but add a penalty for the model parameters. The difference between AIC and BIC is that BIC maintains the penalty for larger sample sizes. For a more detailed example of this procedure see this work log.

We build these goodness-of-fit measures for each wiki and rank them from best to worst. For each distribution, we report the mean and median of these ranks. In addition, we report the mean and median p-values of the KS-tests and the proportion of wikis that pass the KS test for each model. We also use diagnostic plots to compare the empirical and modeled distributions of the data in order to explain where models are failing to fit the data. Because the data is so skewed, we log the X axis of these plots.

The diagnostic plots are shown with data on English Wikipedia. On this Wiki, the exponentiated Weibull model is the best fit, followed by the lomax model and then the log-normal model and only the exponentiated Weibull model passes the KS test.

Distribution fitting plots

To further explore how well these distributions fit the data, we present a series of diagnostic plots that compare the empirical distribution of the data with the model predicted distributions. For each of the four models under consideration (lomax, log-normal, exponentiated Weibull, Weibull), we present a density plot, a distribution plot, and a quantile-quantile plot (Q-Q plot). The density plots compare the probability density function of the estimated parametric model to the normalized histogram of the data. Similarly the distribution plots compare the estimated cumulative distribution to the empirical distribution. The Q-Q plots plot the values of the quantile function for the data on the x-axis and for the estimated model on the y-axis. These plots can help us explain diagnose ways that the data diverge from each of the models. We present the x-axis of all these plots on a logarithmic scale to improve the visibility of the data.

We show these plots for data from English Wikipedia. For this wiki, the likelihood-based goodness-of-fit measures indicate that the exponentiated Weibull model is the best fit (BIC = 19321) followed in order by the lomax (BIC = 19351), the log-normal (BIC = 19373) and the Weibull (BIC = 20111), but the log-normal model is the only model that passes the KS test ( $p$ = 0.089).

Figure 2.2. The Lomax model accurately estimates the rate of long reading times, but its monotonic density overestimates the probability of very short reading times and underestimates that of reading times in the range of 1-10 seconds.

Figure 2.4. The Exponentiated Weibull model fits the data somewhat better than the log-normal model, but still overestimates the occurrence of very short reading times.

Figure 2.3. The log-normal model fits the data well, but overestimates the probability of very short reading times and underestimates the probability of very long reading times.

Figure 2.5. The Weibull model is not a good fit for the data. On a log scale, the PDF is not only monotonically decreasing, it is concave up everywhere. It greatly overestimates the probability of very short and very long reading times while under estimating the probability of reading times between 10 and 1000 seconds.

↑ ^a ^b Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.
↑ Clauset, A.; Shalizi, C.; Newman, M. (2009-11-04). "Power-Law Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 0036-1445. doi:10.1137/070710111. Retrieved 2019-01-01.

[liu_understanding_2010-1] Liu, Chao; White, Ryen W.; Dumais, Susan (2010). "Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time". Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '10 (New York, NY, USA: ACM): 379–386. doi:10.1145/1835449.1835513.

[clauset_power_2009-2] Clauset, A.; Shalizi, C.; Newman, M. (2009-11-04). "Power-Law Distributions in Empirical Data". SIAM Review 51 (4): 661–703. ISSN 0036-1445. doi:10.1137/070710111. Retrieved 2019-01-01.

[1]

[2]