Research:Characterizing Wikipedia Reader Behaviour/Code

Attribution: The code described in this page is largely the work of, in no order, Florian Lemmerich, Philipp Singer, and Ellery Wulczyn.

Code Repository: https://gerrit.wikimedia.org/g/research/reader-survey-analysis

This page provides an overview of the code used in support of analyzing the surveys that were run as part of the Characterizing Wikipedia Reader Behaviour projects. These surveys, at their simplest, randomly sampled readers from a given Wikipedia project and asked them three questions about their motivations for reading Wikipedia. The evaluation of these surveys then, consists of two stages: pre-processing the survey responses and analyzing the relationships between a given survey response and other responses or reader behavior on Wikipedia.

Stage One: Preparing the Surveys edit

In order to survey readers on Wikipedia, we need the following elements:

  • Code: contains information necessary to run a survey on a given project (implemented via QuickSurveys)
    • Start and end times of survey in UTC
    • Wikimedia projects involved (e.g., en-wiki)
    • Platform: mobile, desktop. Mobile can be 'beta' and/or 'stable'. more info
    • Sampling rate (e.g., 1% of readers)
    • Finalized survey questions and answers
    • Survey service provider: for this research, we have used Google Forms. One requirement for the choice of service provider is that the service should be able to "speak" with EventLogging.
    • Unique IDs (surveyInstanceToken now pageviewToken mentioned below) that will be shared between EventLogging and Google Forms (or alternative provider) and is associated with each survey response. Here's an example configuration, lines 458-475.
      • This token is passed onto Google Forms via a special form of the survey URL that pre-fills a question with an answer. In this case, the final page of the survey has a question that is short-text and we ask the users not to change. If you go to the Google Form and select Get a pre-filled link and then enter any value into that field, the necessary information can be extracted from the resulting link that Google provides. It will look something like entry.1791119923 as in phab:T217080. The instanceTokenParameterName in the QuickSurvey schema is filled with this information and then it will pass the pageviewToken as a query parameter to the survey, which pre-fills the user's token.
    • You must also create the necessary interface pages, akin to phab:T217049
  • Schema:QuickSurveysResponses: defines how EventLogging should be implemented to connect the surveys to webrequest logs
    • Make sure to review this schema to ensure that it captures everything that is needed for analysis.
  • Content and Translations:
  • Coordination:
    • Consult Deployment to make sure that the survey can be deployed in the window of time that is desired.
    • If working with different language communities, establish a Point of Contact in each language community who can work with you throughout the experiment and afterwards.
    • Add survey information to Community Engagement Calendar
    • Once the survey is ready to be released, provide a 72-hour notice (example) to the corresponding village pumps. Monitor the conversations in case adjustments are needed.
  • Testing: make sure the data collection works
    • Deploy the survey on the beta labs and verify that the data is correctly saved both in EventLogging and the external survey provider.
      • You need interface administrator permissions to create the necessary special pages (e.g., phab:T217171. The friendly folks at #wikimedia-relengconnect can help with this.
      • For testing, make sure the browser's DNT feature is turned off.
      • Go to https://en.wikipedia.beta.wmflabs.org/wiki/Book (on desktop) and look for the survey on the top right of the article content.
      • If you see another survey, dismiss it, and reload the page. Multiple surveys maybe enabled at one time, so keep dismissing other surveys until you see yours.
      • Once you see your survey, take it and see if your survey token is being passed to the external URL.
      • Check other things as you find fit.

Stage Two: Pre-Processing the Surveys edit

The most straightforward approach to analyzing the surveys would be to aggregate these survey results and analyze the raw data (the equivalent of just Step 1 below). For example, 19% of survey respondents indicated that their motivation for visiting Wikipedia was work/school, so we could report that 19% of Wikipedia readers are motivated by work/school-related reasons. This example suggests, however, that readers who visit Wikipedia for work/school-related reasons were equally likely as everyone else to take our survey. In reality, we find that this population is over-represented in the survey respondents and, through the corrections described below, determine that it would be more realistic to say that 16% of readers on Wikipedia visit for work/school-related reasons. This bias arises namely from sampling bias and non-response bias. A comparison of the raw and weighted survey responses on English Wikipedia is shown below.

 
Comparison of raw and weighted survey responses from Why We Read Wikipedia.

The code described in Steps 2 - X below goes through this process of determining how likely each survey respondent was to respond to the survey and re-weighting their response to even out these likelihoods. While we use log data (e.g., time-of-day, what types of pages the respondent was reading) to make these corrections, this is analogous to common sample re-weighting procedures that use features such as demographics (e.g., polling [1]).

Step 1: Collect and Clean Survey Results edit

Through Google Forms, one can download survey responses in CSV format. We clean the data before any further processing: 1) remove duplicate and incomplete responses, and, 2) recode the survey questions and answers so that they are labeled equally across all languages. After this stage, the raw survey response counts can be computed.

Step 2: Extract Log Traces for Debiasing Results edit

Our process requires linking three separate data sources:

  • survey results from Google Forms: CSV with one row per survey submitted with the respondent’s answers and a unique surveyInstanceToken
  • survey response EventLogging metadata as defined in the QuickSurvey schema: MariaDB table with one record per survey response with Google Forms surveyInstanceToken and corresponding unique surveySessionToken to match to the webrequest logs
  • reader sessions from webrequest logs: Hive table with reader sessions -- i.e. sequences of pageviews -- can be reconstructed from client IP and user-agent information and linked to the survey results via the surveySessionToken

The end result are the survey responses and trace data (i.e. pageviews) associated with each survey response from the period of time when the survey was running. A random set of sample traces is also generated to serve as the controls for reweighting the survey responses in the next step. The steps to complete this are as follows:

  1. Identify Approximate UserIDs for Survey Respondents
    1. Collect all webrequests that contain a QuickSurvey token from Hive (webrequests associated with users who might have completed the survey).
    2. Collect all QuickSurvey EventLogging records from MariaDB for the time period in which the survey ran.
    3. Merge (inner join) Google Forms CSV and EventLogging (set of surveySessionTokens that are associated with a completed survey).
    4. Merge (inner join) QuickSurvey webrequests with the completed surveySessionTokens (webrequests associated with users who did complete surveys).
    5. From these survey-completed webrequests, extract a list of approximate userIDs as a hash of client-IP and user-agent information and load into Hive (see unique devices research for more information on the accuracy and limitations of this approach).
    6. For each language, write joined data to CSV, where each row corresponds to a respondent and includes their survey response, EventLogging data, and userID from first matching webrequest.
  2. Extract survey and sample traces
    1. Pull all webrequests in Hive, including hashed userID, from the days when the survey was live.
      1. Filter: access_method != 'mobile app' AND agent_type = 'user'
    2. For survey traces:
      1. Filter all of these webrequests down to those matching survey respondent userIDs and store in Hive.
      2. For each language, filter by pageviews to the main namespace and associate any webrequests with their matching userIDs and survey responses in a single record.
      3. Export to CSV.
      4. Join these survey traces with survey responses and convert to Pandas object for cleaning (drop any duplicates) and exploratory analysis. The pageview associated with the actual survey (i.e. where the respondent clicked on the survey link) is identified based on the page title and timestamp recorded in EventLogging.
    3. For sample traces:
      1. For each language, filter by pageviews to the main namespace and select a random sample of userIDs. Gather all of the webrequests associated with each of the userIDs into a single record.
      2. Filter down these samples to remove userIDs who are associated with more than 500 requests and limit to exactly 200,000 random userIDs with their matching traces and export to CSV.
      3. Convert sample traces into a Pandas object for cleaning (drop any duplicates) and exploratory analysis. A random pageview for each userID is chosen to be the equivalent of the survey request article (for building features in the propensity-score modeling).
    4. For both survey and sample traces:
      1. Anonymize trace data. At this point, there is no raw IP addresses or geographic data other than continent, country, and timezone.
      2. Split traces into sessions, where a session is defined as no more than one hour between successive pageviews. These sessions are used for constructing features for modeling in the next step.

Step 3: Reweight Survey Responses edit

We use inverse propensity score weighting[1] to determine how much weight should be given to each survey response. At a high level, this requires building a model to predict the likelihood that a particular reader session is associated with a survey. The components of this are as follows:

  • Data Points: We combine the reader sessions associated with completed surveys (positive examples) and randomly-selected reader sessions (negative examples).
  • Y: Is the reader session associated with a completed survey?
  • X: Extensive features are computed for each reader session that are related to the requests (e.g., country, referer class), articles (e.g., in-degree, topic model representation), and session characteristics (number of pages, topic distance between pages). For a complete list, see Table 1 in Why We Read Wikipedia.
  • Model: We use Gradient-boosted decision trees to predict whether a given reader session resulted in a survey or not.

The weight of a given survey response is then set to be 1 / p(x) where p(x) is what the model outputs as the probability that that session resulted in a survey. These weighted results are then used in the analyses in Stage Three.

Stage Three: Analyzing the Surveys edit

Step 1: Cross-Tabulation edit

For a given survey response, evaluate lift associated with other responses to the survey and how its relative share of the results changes temporally (by day and part of day).

Step 2: Sub-Group Discovery edit

For a given survey response, evaluate what article or reader session features have the highest lift.

Anonymization edit

In order to preserve user privacy, we do the following to anonymize webrequest data that we analyze:

  • Remove any inferred geographic data from the IP address that is at a granularity finer than the country-level (e.g., city, zip code)
  • Remove any raw IP information (only included in hashed form as part of the approximate userID)

References edit