Research talk:Reading time/Work log/2018-11-17
Saturday, November 17, 2018
editToday I worked on model selection for the HDI variable and the interaction between HDI and mobile. I decided to work on this first because I was suspicious of the result that showed a quite striking large difference between reading time in lower HDI and higher HDI countries that was greatly diminished on mobile. Today I tried out many specifications, I looked at residual plots, and I evaluated the models on a hold-out sample. My conclusions are that adding a higher-order polynomial for HDI provides a quite modest, but statistically significant improvement to model fit. Adding a higher-order interaction between HDI and mobile is not substantive. This absolves my primary concern, which was that mobile in the lowest-HDI would be quite different from the middle-HDI and high-HDI contexts where most of our data lie. Despite the introduction of a higher-order polynomial for HDI, our substantive results from the model with only a single-order term are largely robust to this specification: Readers in lower-HDI contexts have longer dwell times, readers on the non-mobile site read for longer than readers on the mobile site, and the gap between mobile and PC declines in higher HDI contexts. Moreover, since all values from the same country have the same value of HDI, higher order terms for HDI are probably picking up on information about countries that is correlated with, and potentially confounding HDI. This suggests that we fit a mixed model with random-intercepts (maybe even slopes, or wikis) for countries. I should give some attention to other tasks, but I think that is what should be done for the next round of modeling.
Model Selection
editModel 3 (HDI*mobile)
editThe model fit improves slightly, but significantly (as indicated by F-tests, shown below) with the addition of higher-order terms for HDI. One matter of concern is that the coefficients become quite large in the higher-order specifications suggesting overfit. I'll see which model has the best predictive performance on a hold-out sample.
model 2 | model 3 | model 3 with quadratic HDI | model 3 with cubic HDI | model 3 with quartic HDI | model 3 with quintic HDI | ||
---|---|---|---|---|---|---|---|
Intercept | 10.8247 (0.0122)*** | 11.6286 (0.0165)*** | 6.4282 (0.1259)*** | 65.8866 (1.2369)*** | 253.8476 (11.0605)*** | -2637.6658 (77.9810)*** | |
mobile | -0.2016 (0.0011)*** | -1.5927 (0.0191)*** | |||||
Human Development Index | -1.0219 (0.0103)*** | -1.8893 (0.0157)*** | 9.8203 (0.2826)*** | -198.1313 (4.3139)*** | -1138.6273 (52.9416)*** | 16669.4991 (473.8399)*** | |
mobile : HDI | 1.5146 (0.0207)*** | -1.9572 (0.0228)*** | -2.5460 (0.3171)*** | 51.7714 (3.4166)*** | -76.4951 (30.0172)* | ||
mobile : HDI^2 | 1.8852 (0.0246)*** | 3.3265 (0.7124)*** | -185.1585 (11.8494)*** | 415.2980 (143.1758)** | |||
mobile : HDI^3 | -0.8700 (0.3991)* | 216.3254 (13.6596)*** | -832.7491 (255.2358)** | ||||
mobile : HDI^4 | -83.1222 (5.2339)*** | 727.8907 (201.5475)*** | |||||
mobile : HDI^5 | -234.1483 (59.4847)*** | ||||||
HDI^2 | -6.5621 (0.1587)*** | 234.7857 (5.0081)*** | 1983.9482 (94.8060)*** | -41665.3318 (1149.0281)*** | |||
HDI^3 | -92.9699 (1.9344)*** | -1527.1808 (75.2644)*** | 51709.8261 (1389.8805)*** | ||||
HDI^3 | -92.9699 (1.9344)*** | -1527.1808 (75.2644)*** | 51709.8261 (1389.8805)*** | ||||
HDI^4 | 437.7320 (22.3464)*** | -31878.6736 (838.6029)*** | |||||
HDI^5 | 7812.3979 (201.9034)*** | ||||||
Revision length (bytes) | 0.1665 (0.0005)*** | 0.1664 (0.0005)*** | 0.1665 (0.0005)*** | 0.1663 (0.0005)*** | 0.1663 (0.0005)*** | 0.1662 (0.0005)*** | |
time to first paint | -0.0154 (0.0006)*** | -0.0152 (0.0006)*** | -0.0150 (0.0006)*** | -0.0148 (0.0006)*** | -0.0149 (0.0006)*** | -0.0148 (0.0006)*** | |
time to dom interactive | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0038 (0.0009)*** | |
R2 | 0.0520 | 0.0525 | 0.0526 | 0.0528 | 0.0529 | 0.0531 | |
Adj. R2 | 0.0520 | 0.0525 | 0.0526 | 0.0528 | 0.0529 | 0.0530 | |
Num. obs. | 9873641 | 9873641 | 9873641 | 9873641 | 9873641 | 9873641 | |
RMSE | 14.3860 | 14.3821 | 14.3812 | 14.3795 | 14.3791 | 14.3780 | |
***p < 0.001, **p < 0.01, *p < 0.05 |
Anova Results
editRes.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
9873619 | 2043418459 | NA | NA | NA | NA |
9873618 | 2042312944 | 1 | 1105514.7 | 5347.7375 | 0 |
9873617 | 2042058043 | 1 | 254901.4 | 1233.0417 | 0 |
9873615 | 2041575118 | 2 | 482924.4 | 1168.0320 | 0 |
9873613 | 2041441930 | 2 | 133187.8 | 322.1365 | 0 |
9873611 | 2041128943 | 2 | 312987.5 | 757.0118 | 0 |
Residual Analysis
editI made partial residual plots to diagnose the fit for model 3. The first set of plots show the residuals, adjusted by the projection of X, on the Y axis and variable on the X axis. If the slope is not flat they can indicate a motivation to include higher order terms in the model. I use a loess smoother to look for curvature.
Partial residuals plot for the interaction of HDI and mobile. Residuals plot for modeling Wikipedia reading times with respect to the interaction between HDI and mobile. There doesn't seem to be much evidence of curvature (the line is pretty straight) suggesting that the current specification is appropriate, but that including higher order terms does not seem totally unreasonable. However, HDI is pretty much only in the range of (0.5,1) so the separation of the data might obscure curvature.
|
Partial residuals plot for the interaction of HDI (scaled) and mobile. Similar to the plot to the left, except that I transformed HDI by logging and scaling it before I fit the model. Now there is even less appearance of curvature suggesting that the current (first-order) specification is appropriate.
|
Partial residuals plot for HDI (Not scaled). There is a clear bend between 0.8 and 0.9 on the range of HDI. To me this suggests that we can include higher-order polynomial terms for HDI in the model.
|
Specifications without higher-order mobile:HDI interaction, but with higher order HDI polynomial
editPer the results of the residual analysis. I realized that the right specification for HDI and mobile might be a higher-order polynomial for HDI, but not for the interaction between HDI and mobile. Now it appears that the complex interaction is not very helpful, but what is helpful is the higher-order specification for HDI.
Also, the F-test still passes for the quadratic interaction, but the statistic is much smaller. We have so much data that it will be hard to fail any test.
model 2 | model 3 | model 3 with quadratic HDI | model 3 with cubic HDI | model 3 with quartic HDI | model 3 with quadratic HDI:mobile | ||
---|---|---|---|---|---|---|---|
Intercept | 9.9687 (0.0077)*** | 10.0457 (0.0078)*** | 10.0487 (0.0078)*** | 10.0171 (0.0078)*** | 10.0308 (0.0079)*** | 10.0314 (0.0079)*** | |
mobile | -0.2016 (0.0011)*** | -0.3236 (0.0020)*** | -0.3137 (0.0020)*** | -0.3082 (0.0020)*** | -0.3053 (0.0020)*** | -0.3071 (0.0020)*** | |
Human Development Index | -0.0946 (0.0009)*** | -0.1746 (0.0014)*** | -0.1172 (0.0022)*** | -0.0520 (0.0026)*** | -0.0243 (0.0028)*** | -0.0147 (0.0034)*** | |
mobile : HDI | 0.1399 (0.0019)*** | 0.1275 (0.0019)*** | 0.1196 (0.0020)*** | 0.1186 (0.0020)*** | 0.1029 (0.0036)*** | ||
mobile : HDI^2 | 0.0145 (0.0028)*** | ||||||
HDI^2 | -0.0482 (0.0014)*** | 0.0199 (0.0020)*** | -0.0690 (0.0042)*** | -0.0785 (0.0046)*** | |||
HDI^3 | -0.0764 (0.0016)*** | -0.0868 (0.0017)*** | -0.0863 (0.0017)*** | ||||
HDI^4 | 0.0421 (0.0018)*** | 0.0424 (0.0018)*** | |||||
Revision length (bytes) | 0.1665 (0.0005)*** | 0.1664 (0.0005)*** | 0.1664 (0.0005)*** | 0.1663 (0.0005)*** | 0.1664 (0.0005)*** | 0.1664 (0.0005)*** | |
time to first paint | -0.0154 (0.0006)*** | -0.0152 (0.0006)*** | -0.0150 (0.0006)*** | -0.0148 (0.0006)*** | -0.0148 (0.0006)*** | -0.0148 (0.0006)*** | |
time to dom interactive | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0035 (0.0009)*** | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | 0.0036 (0.0009)*** | |
R2 | 0.0520 | 0.0525 | 0.0526 | 0.0528 | 0.0529 | 0.0529 | |
Adj. R2 | 0.0520 | 0.0525 | 0.0526 | 0.0528 | 0.0529 | 0.0529 | |
Num. obs. | 9873641 | 9873641 | 9873641 | 9873641 | 9873641 | 9873641 | |
RMSE | 14.3860 | 14.3821 | 14.3812 | 14.3796 | 14.3792 | 14.3791 | |
***p < 0.001, **p < 0.01, *p < 0.05 |
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
9873619 | 2043410883 | NA | NA | NA | NA |
9873618 | 2042305921 | 1 | 1104961.432 | 5344.18611 | 0e+00 |
9873617 | 2042058100 | 1 | 247821.330 | 1198.59687 | 0e+00 |
9873616 | 2041587173 | 1 | 470927.432 | 2277.65764 | 0e+00 |
9873615 | 2041469312 | 1 | 117861.191 | 570.03993 | 0e+00 |
9873614 | 2041463831 | 1 | 5480.562 | 26.50694 | 3e-07 |
Even more, it is clear that the quadratic term for the interaction is over-fitting. When I test the models on out-of-sample prediction the model with the quadratic interaction does worse.
Rmse | Rsqr | name |
---|---|---|
1.677372 | 0.0542021 | model 2 |
1.678363 | 0.0575015 | model 3 |
1.679984 | 0.0562511 | model 3 with quadratic HDI |
1.679288 | 0.0590972 | model 3 with cubic HDI |
1.680671 | 0.0616457 | model 3 with quartic HDI |
1.680658 | 0.0613616 | model 3 with quadratic HDI:mobile |
I wonder if the little increase in the mid range of HDI on mobile is robust?