Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-02-05

Tuesday, February 5, 2019 edit

A core part of the analysis that we do involves reconstructing what we believe to be page view sessions for a given device. This is done approximately through hashing of the user-agent and IP addresses. Past work has shown this to be effective for a 24-hour period, but that work is several years old so we decided to revisit it to make sure that this method still worked well enough for research purposes.

Mobile User IDs edit

When a page view is recorded in the webrequest logs, certain x-analytics headers are sent along, including a "wmfuuid" value that uniquely identifies an installation of the Wikipedia mobile app. Thus, any page views that have the same "wmfuuid" value are known to come from the same device. By computing the distribution of user-agent, IP addresses, and browser languages for all the known page views for a device, we can evaluate the expected proportion of page views that we would expect to capture in a given time period when recreating page view sessions.

We stratify the results by country and language edition as likely sources of variation.

Caveats edit

There are some key caveats:

  • We do not know what page views occurred on other devices, and so are estimating an upper bound to the proportion of page views we can expect to detect through this method. It is possible that the mobile app views comprise only a small proportion of a given user's total views.
  • It is possible that these are shared devices, in which case though we are capturing page views for a device, we are not actually capturing page views for a given individual.
  • This analysis is based off of mobile app users, who are certainly not a representative set of the population. Most obviously, it does not tell us about people who primarily access the desktop version.
@Isaac (WMF): I think it's an excellent idea to look into the wmfuuid field for this purpose, but I would add another caveat: The apps only contribute a tiny share of our overall traffic (1-2% of pageviews, also, a large part of these come without wmfuuid value, especially on iOS where users have to opt into data collection for that). Thus, this analysis will likely be more informative regarding the question how often a Wikipedia reader's UA+IP hash changes, and less so for the question how often several readers might share the same hash.
A similar concern actually applies to the "Past work has shown this to be effective for a 24-hour period" statement, see Research_talk:Mobile_sessions#"Grouping"_result_and_sampling.
Regards, Tbayer (WMF) (talk) 07:30, 6 February 2019 (UTC)Reply
@Tbayer (WMF): Thanks for providing this context and the link to prior discussions. You are right to emphasize that what this analysis can provide some insight on is how the UA+IP degrades over time and, thankfully, not at all the degree to which multiple people may share a device. --Isaac (WMF) (talk) 18:11, 6 February 2019 (UTC)Reply

Analyses edit

To ensure anonymity, we only examine data points for which we have at least 500 data points (following https://arxiv.org/pdf/1812.00474.pdf) and do not report numbers associated with specific countries or projects (just high-level takeaways and trends). Code below are Hive queries executed via Python scripts.

How stable is the hash of User-Agent + IP-address? edit

Overall, we find 607,010 unique user IDs in a 24-hour period that are associated with at least two page views (there are another 338,060 user IDs that are only associated with one page view). Of these 607,010 unique IDs:

  • 606,249 (99.9%) were associated with a single user-agent
  • 434,011 (71.5%) were associated with a single IP address
  • 537,161 (88.5%) were associated with a single browser language

We see that browser language is not consistent (it appears largely due to the value changing when the same device accesses a different language edition) but has a lot of overlap (there are only 13,764 unique browser language strings in our dataset). As a result, we discard it from the rest of the analyses; user-agent is much more stable and sufficiently unique (3,717 unique values) as to differentiate between devices sharing the same IP.

# Gather a sample of users from one day: fingerprint_list
"SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(client_ip, <HASHKEY>)) as haship, "
       "reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(user_agent, <HASHKEY>)) as hashua, "
       "reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(accept_language, <HASHKEY>)) as hashlang, "
       "reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(x_analytics_map['wmfuuid'], <HASHKEY>)) as hashuid, "
       "concat(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0') , '-', LPAD(day, 2, '0')) AS date, "
       "dt, "
       "uri_host, "
       "geocoded_data['country'] as country "
  "FROM wmf.webrequest "
 "WHERE year = 2019 AND month = 2 AND day = 4 "
       "AND x_analytics_map['wmfuuid'] IS NOT NULL "
       "AND is_pageview AND agent_type = 'user' "
 "ORDER BY hashuid, date LIMIT 10000000"
# Same day: stability of IP address, User-Agent, and language by user: ipuastability
"SELECT hashuid, "
       "COUNT(DISTINCT(haship)) AS unique_ip, "
       "COUNT(DISTINCT(hashua)) AS unique_ua, "
       "COUNT(DISTINCT(hashlang)) AS unique_lang, "
       "COUNT(hashuid) AS num_views "
 "FROM fingerprint_list "
"GROUP BY hashuid"

# Same day: overall stability of IP address, User-Agent, and language
"SELECT COUNT(*) as num_uids, "
       "SUM(IF(unique_ip = 1, 1, 0)) as num_ips, "
       "SUM(IF(unique_ua = 1, 1, 0)) as num_uas, "
       "SUM(IF(unique_lang = 1, 1, 0)) as num_langs "
  "FROM ipuastability"

How much variation is there by country and project? edit

We evaluate whether the number of IP-addresses associated with each device varies by country or Wikipedia project. Notably, we do not see any variation in the number user-agents associated with each device by country or host.

For country, we find that many European countries like the Netherlands and Switzerland have the fewest IP-addresses per user (~1.15 IP addresses / user ID). At the other end of the spectrum, we see that many countries in Africa and Southeast Asia have 1.3 to 1.4 IP addresses / user ID. In the middle are mainly countries from North and South America.

For host, we find many of the same patterns as above given that many language editions overlap greatly with specific countries (e.g., sv-wiki and Sweden). Notable new patterns include simple-wikipedia having very few IP addresses per user ID and en-wiki having a relatively high number of IP addresses per user ID (in line with patterns from the United States).

# Same day: group by host (alternatively country):
"SELECT uri_host, "
       "COUNT(DISTINCT(hashuid)) AS unique_uid, "
       "COUNT(DISTINCT(concat(haship, hashuid))) / COUNT(DISTINCT(hashuid)) AS unique_ip_per_uid, "
       "COUNT(DISTINCT(concat(hashua, hashuid))) / COUNT(DISTINCT(hashuid)) AS unique_ua_per_uid, "
       "COUNT(DISTINCT(concat(hashlang, hashuid))) / COUNT(DISTINCT(hashuid)) AS unique_lang_per_uid "
  "FROM fingerprint_list "
 "GROUP BY uri_host"

Expected page views captured by UA-IP hash edit

We would expect to capture 72% of total page views associated with a device in a 24-hour period if we use a UA-IP hash to reconstruct sessions. Related, this would capture all of page views associated with a given device for 433,836 (71.5%) of the 607,010 users with more than one page view in the dataset.

# Same day: page views per devices split by sessions as identified via UA+IP hash: viewsperfprint
"SELECT hashuid, "
       "COUNT(*) AS num_pageviews "
  "FROM fingerprint_list "
 "GROUP BY hashuid, haship, hashua, "
 "ORDER BY hashuid, num_pageviews "
 "LIMIT 1000000"

# Same day: expected page views captured by UA+IP hash
"SELECT SUM(w.num_pageviews) "
  "FROM (SELECT hashuid, "
               "AVG(num_pageviews) as num_pageviews "
          "FROM viewsperfprint "
         "GROUP BY hashuid) w"

# Same day: all page views in sample
"SELECT SUM(num_pageviews) "
  "FROM viewsperfprint"

How many page views are associated with these devices that are not through the app (potentially grouping)? edit

There was very little evidence of device grouping (a single UA-IP hashes being associated with multiple devices). The number of page views that matched the UA-IP hashes from the wmfuuid dataset but were not associated with a wmfuuid was less than 1% of the wmfuuid dataset.

# pageviews that match the IP+UA hash but do not have associated wmfuuids
"SELECT w.haship, w.hashua, w.hashlang, w.date, w.dt, w.uri_host, w.country "
  "FROM (SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(client_ip, <HASHKEY>)) as haship, "
               "reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(user_agent, <HASHKEY>)) as hashua, "
               "reflect('org.apache.commons.codec.digest.DigestUtils', 'sha512Hex', concat(accept_language, <HASHKEY>)) as hashlang, "
               "concat(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0') , '-', LPAD(day, 2, '0')) AS date, "
               "dt, "
               "uri_host, "
               "geocoded_data['country'] as country "
          "FROM wmf.webrequest "
         "WHERE year = 2019 AND month = 2 AND day = 4 "
               "AND x_analytics_map['wmfuuid'] IS NULL "
               "AND is_pageview AND agent_type = 'user') w "
 "WHERE CONCAT(w.haship, w.hashua) IN (SELECT CONCAT(haship, hashua) FROM fingerprint_list)"

Future Work edit

  • Check whether IP-UA degrades in the following day(s)
Return to "Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-02-05" page.