Research talk:Reading time/Work log/2018-10-20
Saturday, October 20, 2018
editPredicting dwell time
editToday I begin the second phase of this project in earnest. Many of the questions that were asked in the proposal can be answered using multivariate regression to modelreading time as a function of relevant covariates. Since reading time is well approximated by a log normal distribution, we can fit the models using OLS on log transformed data. This is great because it will allow us to use Spark ML's linear regression functionality to fit these models. Link to API docs.
Hypotheses Rationale
editFor the first iteration of this analysis we study how characteristics of the reader correlate with reading time. We will also look at project size.
HDI and reading times
editPer discussion with Olga, we are interested in testing hypotheses about the relationship between the nature of different projects and their audiences to reading behavior.
A working paper proposes that less developed countries will have more in-depth reading (reflected in longer/deeper reads of the page) compared to countries with higher development score.
Underlying this idea is a theory about Wikipedia's role in addressing inequalities of access to knowledge between countries and languages. Here is my thinking:
- Readers in lower HDI countries are more likely to use Wikipedia for intrinsic learning and to meet their information needs (this may all be because relative to other available information sources in these languages, Wikipedia is more accessible or higher quality).
- Readers seeking in-depth knowledge and seeking knowledge for intrinsic learning are more likely to spend more time reading (they have greater need for the information).
Therefore H1 (wikimotifs validation): Readers from lower HDI countries have longer reading times.
Mobile and reading times
editMobile devices are ubiquitous, but they have been criticized as inferior to PCs in ways that are important to studying information (smaller screen and keyboard, sometimes inferior non-english language support, inferior tools for saving and organizing information) [1] A recent review of digital inequalities points to such "device gaps" in drawing mixed conclusions about the promise of mobile phones to alleviate the digital divide. [2]
A second difference between mobile and desktop reading behavior factor in wikimotifs is that mobile readers are more likely to engage in tasks that likely require shorter reading times like fact finding or when motivated by conversations.
The "device gap" and the association between device and task suggest H2: mobile readers have shorter reading times.
Digital divides and mobile devices
editAlthough mobile devices may be technologically limited, they are also relatively available and accessible to low socioeconomic status (SES) people compared to desktops, especially in developing countries.[2]
A plausible mechanism for the relationship between HDI and session length / reading time is that people in lower HDI contexts have:
- High information needs
- Low access to quality information sources in their languages (other than Wikipedia)
A second plausible mechanism is
- Low levels of literacy (even the literate may be relatively slow readers)
We can test this mechanism under the assumption that within a given country, mobile readers are more likely to be low SES and desktop readers are more likely to be high SES. If both this assumption and either of the above mechanisms drive the relationship between HDI and reading time then we should observe that H3: Readers on mobile devices in lower HDI countries will read longer (there should be a negative interaction between mobile and HDI).
Project size
editEven though I predicted in H1 that lower HDI countries will have longer reading times, User:OVasileva_(WMF) suggested that smaller wikis may be likely to have longer reading times, even when most readers are in high HDI contexts because many smaller wikis (example Welsh) are made by and for language preservationists.
H4: smaller wikis will have longer reading times
Measures
editHere is how I will measure the analytic variables:
Reader's country's HDI: Observe country of reader from IP-based geolocation data in the event log and map these to UN HDI data. Later we can build on this to look at more granular country data like literacy.
Mobile: If the reader is reading the mobile webhost or from a mobile app then we assume they are reading on a mobile device.
Project size: Number of pages. There are obviously several other ways we could measure this. We should probably try a few to see if results are robust to this measurement.
Controls
editHere are some important potential confounders we should try to control for:
- Time of day: fixed effects for the hour (or maybe half hour) of the day (in our best guess for reader's timezone).
- Day of week
- Month of year
- Nth view in session
- Wiki
- Year
- Page Length
- DomInteractiveTime
- FirstPaint
- Last in session
Filters
edit- Only look at namespace 0
- No bots
- Remove negative and null visible times
- No Safari
- No cases where the unload event comes before the load event.
Variables of interest to WMF
edit- Continent / Region
- Global North / Global South
SELECT
SUM(IF (FIND_IN_SET(country_code,
'AD,AL,AT,AX,BA,BE,BG,CH,CY,CZ,DE,DK,EE,ES,FI,FO,FR,FX,GB,GG,GI,GL,GR,HR,HU,IE,IL,IM,IS,IT,JE,LI,LU,LV,MC,MD,ME,MK,MT,NL,NO,PL,PT,RO,RS,RU,SE,SI,SJ,SK,SM,TR,VA,AU,CA,HK,MO,NZ,JP,SG,KR,TW,US') > 0, view_count, 0)) AS Global_North_views
See List_of_countries_by_regional_classification.
Model specifications
editWe can use fixed effects for wiki to get a better estimate of the effect of mobile. Language family will be collinear with Wiki.
M1:
Country is probably pretty highly correlated with the project they are reading. Also project size is obviously collinear with Wiki, so in this model we don't include the control for Wiki.
M2:
To help with interpretation and robustness we should try a specification with and without the interaction.
M3:
Y is visible length and we will fit OLS models with a log link.
We might also want to see how mobile varies by language family.
M4a:
For this it might be easier to interpret a "secret weapon" for language family or region. So we would fit a different model with the following specification for each language family or region.
M4b:
Bike Rack
editMultilingual Readers
editThe wikimotifs paper suggests studying readers who read wikis of different languages in the same session. My intuition is that multilingual readers will read for longer times for a few reasons:
- Multilingual readers may be more likely to perform deep-info tasks if the reason they are viewing multiple languages is to obtain more complete information.
- Multilingual readers may be reading Wikipedia as a language learning activity.
Therefore:
HX: Readers who read multiple in multiple languages within a session are likely to read for longer.
References
edit- ↑ Pearce, Katy E.; Rice, Ronald E. (2013-07-11). "Digital Divides From Access to Activities: Comparing Mobile and Personal Computer Internet Users". Journal of Communication 63 (4): 721–744. ISSN 0021-9916. doi:10.1111/jcom.12045.
- ↑ a b Marler, Will (2018-04-07). "Mobile phones and inequality: Findings, trends, and future directions". New Media & Society 20 (9): 3498–3520. ISSN 1461-4448. doi:10.1177/1461444818765154.