Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Pilot

From March 4 - 5, 2019, a small-scale pilot of the survey was run on English Wikipedia. It resulted in 771 responses, of which 626 were complete and not under the age of 18. The pilot (and start of the survey translation process) identified a number of issues, described below, that were worked through before expanding the survey to more languages / respondents.

QuickSurveys Sampling edit

Sampling for inclusion in a given survey is done by browser. The first time a user navigates to a Wikipedia article with an active survey, a token is stored in their browser's local storage that is associated with that survey's name and indicates in a deterministic way whether the survey will be displayed on that browser. Given that a survey is active for at least several days, readers who at least occasionally visit Wikipedia are just as likely to be sampled as frequent readers. More frequent readers who are included in the survey are more likely to respond to the survey though. In the pilot, respondents viewed an average of 6.9 pages and 52% only viewed a single page while individuals who did not respond viewed an average of 4.7 pages and 61% only viewed a single page. Additionally, selection bias or issues with translations / text of the questions could differentially affect response rates.

A small minority of survey respondents did not have associated EventLogging data, which limits our ability to understand the relationship between reader demographics / motivations and the types of pages that they are reading. The different causes and respective magnitude are provided below:

  • People we completely miss (~3-5%): there are some platforms for which EventLogging and QuickSurveys do not work because these platforms do not support JavaScript. This mainly would be older IE platforms (any IE version before 11) but also would include "lite" browsers (e.g., Opera Mini) that are optimized for low data or privacy. We cannot do much about this. It is not a huge proportion of the internet-connected world but likely is more likely to knock out older users and people from regions with poor internet connectivity, so we should be aware of that. See this for more details.
  • People who can see QuickSurveys but don't have EventLogging (~10%): It is possible that browsers that are slower are failing to load the EventLogging code and thus would be able to see and respond to surveys but would not be logged appropriately. See this phabricator task for more details. There is a chance that some of this is fixable (phab:T218243 and phab:T220627#5107667), but we cannot recover data in any real way for these respondents so any analysis that relies on EventLogging data will miss them. There was no strong demographics patterns related to who was missing EventLogging data, though they tended to be below 40 and male.
  • People who right-click and open in a new tab to take external surveys (~5%): We get QuickSurveyInitiation EventLogging but not QuickSurveysResponses EventLogging for this group. This happens almost exclusively on desktop and should only be a problem for external surveys (no reason to right-click on internal surveys). For this group, it's harder to get the contextual information but not impossible based on approximate methods. The main feature we lose is the editCountBucket.

Age / Gender Skew edit

The survey respondents skewed heavily young and male. Including those who were under the age of 18, 70% of respondents were under the age of 30. Of those who completed the survey, 76% identified as men. There were no clear interactions with other variables -- that is, the gender balance was consistent across age groups. This held true for country as well with the exception that the United States was slightly more balanced gender-wise (only 67% men). The United Kingdom and India, the other two most well-represented countries, had a gender balance of 75% and 83% men respectively.

This was a surprising level of skew for the reader population, which led to the question: is the readership truly skewed that far to men or is the skew resulting from different rates at which individuals of different gender identities self-select into the survey? We looked at past surveys and found the following data points regarding gender and frequency of Wikipedia reading:

  • Based on a survey of 1000 AMT workers from US: "Second, men use Wikipedia more often — they are twice as likely than women to use Wikipedia daily"[1]
  • While younger respondents were consistently more likely to read Wikipedia frequently, mixed evidence from Global Insights phone surveys on gender:
    • India: women more likely to be frequent readers of Wikipedia
    • Mexico: men more likely to be frequent readers of Wikipedia
    • Nigeria: men slightly more likely to be frequent readers of Wikipedia
    • Iraq: ~equal likelihood by gender of being frequent readers of Wikipedia

Urban / Rural Question edit

See locale analysis.

References edit

  1. Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Social Science Research Network.