Research talk:Reading time/Work log/2018-11-17

Saturday, November 17, 2018

Today I worked on model selection for the HDI variable and the interaction between HDI and mobile. I decided to work on this first because I was suspicious of the result that showed a quite striking large difference between reading time in lower HDI and higher HDI countries that was greatly diminished on mobile. Today I tried out many specifications, I looked at residual plots, and I evaluated the models on a hold-out sample. My conclusions are that adding a higher-order polynomial for HDI provides a quite modest, but statistically significant improvement to model fit. Adding a higher-order interaction between HDI and mobile is not substantive. This absolves my primary concern, which was that mobile in the lowest-HDI would be quite different from the middle-HDI and high-HDI contexts where most of our data lie. Despite the introduction of a higher-order polynomial for HDI, our substantive results from the model with only a single-order term are largely robust to this specification: Readers in lower-HDI contexts have longer dwell times, readers on the non-mobile site read for longer than readers on the mobile site, and the gap between mobile and PC declines in higher HDI contexts. Moreover, since all values from the same country have the same value of HDI, higher order terms for HDI are probably picking up on information about countries that is correlated with, and potentially confounding HDI. This suggests that we fit a mixed model with random-intercepts (maybe even slopes, or wikis) for countries. I should give some attention to other tasks, but I think that is what should be done for the next round of modeling.

Model Selection

Model 3 (HDI*mobile)

The model fit improves slightly, but significantly (as indicated by F-tests, shown below) with the addition of higher-order terms for HDI. One matter of concern is that the coefficients become quite large in the higher-order specifications suggesting overfit. I'll see which model has the best predictive performance on a hold-out sample.

Statistical models
	model 2	model 3	model 3 with quadratic HDI	model 3 with cubic HDI	model 3 with quartic HDI	model 3 with quintic HDI
Intercept	10.8247 (0.0122)^***	11.6286 (0.0165)^***	6.4282 (0.1259)^***	65.8866 (1.2369)^***	253.8476 (11.0605)^***	-2637.6658 (77.9810)^***
mobile	-0.2016 (0.0011)^***	-1.5927 (0.0191)^***
Human Development Index	-1.0219 (0.0103)^***	-1.8893 (0.0157)^***	9.8203 (0.2826)^***	-198.1313 (4.3139)^***	-1138.6273 (52.9416)^***	16669.4991 (473.8399)^***
mobile : HDI		1.5146 (0.0207)^***	-1.9572 (0.0228)^***	-2.5460 (0.3171)^***	51.7714 (3.4166)^***	-76.4951 (30.0172)^*
mobile : HDI^2			1.8852 (0.0246)^***	3.3265 (0.7124)^***	-185.1585 (11.8494)^***	415.2980 (143.1758)^**
mobile : HDI^3				-0.8700 (0.3991)^*	216.3254 (13.6596)^***	-832.7491 (255.2358)^**
mobile : HDI^4					-83.1222 (5.2339)^***	727.8907 (201.5475)^***
mobile : HDI^5						-234.1483 (59.4847)^***
HDI^2			-6.5621 (0.1587)^***	234.7857 (5.0081)^***	1983.9482 (94.8060)^***	-41665.3318 (1149.0281)^***
HDI^3				-92.9699 (1.9344)^***	-1527.1808 (75.2644)^***	51709.8261 (1389.8805)^***
HDI^3				-92.9699 (1.9344)^***	-1527.1808 (75.2644)^***	51709.8261 (1389.8805)^***
HDI^4					437.7320 (22.3464)^***	-31878.6736 (838.6029)^***
HDI^5						7812.3979 (201.9034)^***
Revision length (bytes)	0.1665 (0.0005)^***	0.1664 (0.0005)^***	0.1665 (0.0005)^***	0.1663 (0.0005)^***	0.1663 (0.0005)^***	0.1662 (0.0005)^***
time to first paint	-0.0154 (0.0006)^***	-0.0152 (0.0006)^***	-0.0150 (0.0006)^***	-0.0148 (0.0006)^***	-0.0149 (0.0006)^***	-0.0148 (0.0006)^***
time to dom interactive	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0038 (0.0009)^***
R²	0.0520	0.0525	0.0526	0.0528	0.0529	0.0531
Adj. R²	0.0520	0.0525	0.0526	0.0528	0.0529	0.0530
Num. obs.	9873641	9873641	9873641	9873641	9873641	9873641
RMSE	14.3860	14.3821	14.3812	14.3795	14.3791	14.3780
^*p < 0.001, ^p < 0.01, ^*p < 0.05

Anova Results

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
9873619	2043418459	NA	NA	NA	NA
9873618	2042312944	1	1105514.7	5347.7375	0
9873617	2042058043	1	254901.4	1233.0417	0
9873615	2041575118	2	482924.4	1168.0320	0
9873613	2041441930	2	133187.8	322.1365	0
9873611	2041128943	2	312987.5	757.0118	0

Residual Analysis

I made partial residual plots to diagnose the fit for model 3. The first set of plots show the residuals, adjusted by the projection of X, on the Y axis and variable on the X axis. If the slope is not flat they can indicate a motivation to include higher order terms in the model. I use a loess smoother to look for curvature.

Partial residuals plot for the interaction of HDI and mobile. Residuals plot for modeling Wikipedia reading times with respect to the interaction between HDI and mobile. There doesn't seem to be much evidence of curvature (the line is pretty straight) suggesting that the current specification is appropriate, but that including higher order terms does not seem totally unreasonable. However, HDI is pretty much only in the range of (0.5,1) so the separation of the data might obscure curvature.

Partial residuals plot for the interaction of HDI (scaled) and mobile. Similar to the plot to the left, except that I transformed HDI by logging and scaling it before I fit the model. Now there is even less appearance of curvature suggesting that the current (first-order) specification is appropriate.

Partial residuals plot for HDI (Not scaled). There is a clear bend between 0.8 and 0.9 on the range of HDI. To me this suggests that we can include higher-order polynomial terms for HDI in the model.

Specifications without higher-order mobile:HDI interaction, but with higher order HDI polynomial

Per the results of the residual analysis. I realized that the right specification for HDI and mobile might be a higher-order polynomial for HDI, but not for the interaction between HDI and mobile. Now it appears that the complex interaction is not very helpful, but what is helpful is the higher-order specification for HDI.

Also, the F-test still passes for the quadratic interaction, but the statistic is much smaller. We have so much data that it will be hard to fail any test.

Statistical models
	model 2	model 3	model 3 with quadratic HDI	model 3 with cubic HDI	model 3 with quartic HDI	model 3 with quadratic HDI:mobile
Intercept	9.9687 (0.0077)^***	10.0457 (0.0078)^***	10.0487 (0.0078)^***	10.0171 (0.0078)^***	10.0308 (0.0079)^***	10.0314 (0.0079)^***
mobile	-0.2016 (0.0011)^***	-0.3236 (0.0020)^***	-0.3137 (0.0020)^***	-0.3082 (0.0020)^***	-0.3053 (0.0020)^***	-0.3071 (0.0020)^***
Human Development Index	-0.0946 (0.0009)^***	-0.1746 (0.0014)^***	-0.1172 (0.0022)^***	-0.0520 (0.0026)^***	-0.0243 (0.0028)^***	-0.0147 (0.0034)^***
mobile : HDI		0.1399 (0.0019)^***	0.1275 (0.0019)^***	0.1196 (0.0020)^***	0.1186 (0.0020)^***	0.1029 (0.0036)^***
mobile : HDI^2						0.0145 (0.0028)^***
HDI^2			-0.0482 (0.0014)^***	0.0199 (0.0020)^***	-0.0690 (0.0042)^***	-0.0785 (0.0046)^***
HDI^3				-0.0764 (0.0016)^***	-0.0868 (0.0017)^***	-0.0863 (0.0017)^***
HDI^4					0.0421 (0.0018)^***	0.0424 (0.0018)^***
Revision length (bytes)	0.1665 (0.0005)^***	0.1664 (0.0005)^***	0.1664 (0.0005)^***	0.1663 (0.0005)^***	0.1664 (0.0005)^***	0.1664 (0.0005)^***
time to first paint	-0.0154 (0.0006)^***	-0.0152 (0.0006)^***	-0.0150 (0.0006)^***	-0.0148 (0.0006)^***	-0.0148 (0.0006)^***	-0.0148 (0.0006)^***
time to dom interactive	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0035 (0.0009)^***	0.0036 (0.0009)^***	0.0036 (0.0009)^***	0.0036 (0.0009)^***
R²	0.0520	0.0525	0.0526	0.0528	0.0529	0.0529
Adj. R²	0.0520	0.0525	0.0526	0.0528	0.0529	0.0529
Num. obs.	9873641	9873641	9873641	9873641	9873641	9873641
RMSE	14.3860	14.3821	14.3812	14.3796	14.3792	14.3791
^*p < 0.001, ^p < 0.01, ^*p < 0.05

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
9873619	2043410883	NA	NA	NA	NA
9873618	2042305921	1	1104961.432	5344.18611	0e+00
9873617	2042058100	1	247821.330	1198.59687	0e+00
9873616	2041587173	1	470927.432	2277.65764	0e+00
9873615	2041469312	1	117861.191	570.03993	0e+00
9873614	2041463831	1	5480.562	26.50694	3e-07

Even more, it is clear that the quadratic term for the interaction is over-fitting. When I test the models on out-of-sample prediction the model with the quadratic interaction does worse.

Rmse	Rsqr	name
1.677372	0.0542021	model 2
1.678363	0.0575015	model 3
1.679984	0.0562511	model 3 with quadratic HDI
1.679288	0.0590972	model 3 with cubic HDI
1.680671	0.0616457	model 3 with quartic HDI
1.680658	0.0613616	model 3 with quadratic HDI:mobile

I wonder if the little increase in the mid range of HDI on mobile is robust?

Add topic