Research:Improving link coverage/Release page traces

We are investigating the possibility of releasing the page traces that can be generated as a result of the research on improving link coverage to the broader community. The release can help amplify the creation of research and tools in areas that focus on how Wikipedia's content is currently consumed. In this page, we describe how the data can be collected and we propose an anonymization process prior to sharing the data publicly. We will collect feedback in three stages: 1) from a team of Wikimedia Foundation staff, 2) from 1-2 privacy and security experts, 3) from the community. The decision of whether to release the data or not and in what form will be made after all feedback is collected.

What data is proposed to be shared publicly?

edit

We are proposing to share page traces with a relative time stamp. Here is an example of a JSON object representing a browse path we are proposing to publish

{"rel_time":0,"http_status":"200","title":"Great_Britain","children":[{"rel_time":10,"http_status":"200","title":"Kingdom_of_Great_Britain","children":[{"rel_time":64,"http_status":"200","title":"United_Kingdom"}]}],"referer":GOOGLE|FACEBOOK|TWITTER|...}

How is the data generated?

edit

The following description is based on the refined data collected in wmf.webrequest in the Analytics cluster.

Mapper:

  • Discard requests for pages from domains other than “$LANG.wikipedia.org”
  • Keep only article pageviews (“/wiki/…”) or wiki searches (“/w/index.php?search=...”)
  • Discard requests for special pages (“/wiki/Special:...”, “/wiki/Spezial:...”, etc.)
  • Filter bots based on the UAParser API and hand-crafted regex
  • Only accept HTTP statuses 200 (OK) and 304 (Not Modified) for regular pageviews, and 200 (OK) and 302 (Found) for wiki searches
  • All requests from the same “user” are sent to the same reducer. Here a “user” is defined by the concatenation of IP address, user-agent string, and browser language. If the “X-Forwarded-For” field contains an IP address, we use the latter, otherwise we use the standard IP address

Reducer:

  • Discard users with more than 10,000 events (i.e., one every 4 min over the entire month), since they are likely to be bots
  • Order all requests by the user in terms of time
  • Resolve redirects based on a fixed Wikipedia snapshot
  • URL-decode article titles
  • Transform the sequence of requests into a set of trees: stitch together URL and referer, appending a pageview to the most recent pageview of the URL specified in the referer; if no such pageview exists, make the node the root of a new tree; store whether the referer is ambiguous
  • Discard trees that are more than 100 levels deep
  • Discard a tree if the root’s referer is a page from Wikipedia; this excludes traces that are in fact the continuation of a previous session that we somehow missed
  • If requested (we don’t), discard trees with ambiguous referers

Anonymization proposal

edit

This proposal is a draft. It will most probably change as we receive more feedback.

  1. Find all logs for which uri_query is like '%action=edit%' or '%action=submit%' and find the corresponding (user_agent, ip) pair.
  2. Remove all logs with (user_agent, ip) pairs identified in step 1. This assures that as much as possible we have removed browse data that could be associated with edit activity.
  3. Remove all page traces for which less than k unique users have viewed at least one page in that trace. k=100 for example, can be strong (or too strong) for such a threshold. We can define such threshold by looking at the histogram and seeing how much data we lose as we increase k.

Feedback

edit

Staff feedback

edit
  • Define a bound on the session time. The page traces created from activity over long periods of times are not interesting for many research use cases. They may also reveal extreme information that can potentially put user privacy at risk.
    • Browsing sessions can be delimited using an interactivity timeout. A new identifier could be provided in the output for each unique session. This would make individual sessions of a user difficult to associate with each other. See R:Activity session. --Halfak (WMF) (talk) 13:19, 13 August 2015 (UTC)
  • Define a maximum tree depth and discard all trees with at least one chain longer than the threshold.
    • We discard trees with depth more than 100 due to technical considerations. It would be easy to put a limit on the chain, however, step 3 in anonymization takes care of the possible privacy concerns.
  • Ask for feedback more broadly. Consider communicating via wikimedia-l and running a well-scoped consultation.
  • Ideally, the browse traces should have timestamps. Do not consider a time bucket smaller than a day. If you can provide value with one week as a time bucket, go with that.
    • Bob and Leila agree with this. Bob recommends providing relative timestamp within the bucket.
    • +1 for relative timestamp. --Halfak (WMF) (talk) 13:22, 13 August 2015 (UTC)
  • Check whether the browsing trace contains redlinks.
    • We agreed that the presence of redlinks is not concerning because of the existence of step 3 in the anonymization process.
  • Chains with deleted and suppressed articles should not be included in the data release. This may not be a big problem given step 3 in the anonymization, however, we should make sure we don't release, for example, page_titles of deleted/suppressed articles accidentally.
    • We release title of deleted and suppressed content all of the time. E.g. the XML dumps contain pages and titles that were deleted after the data was collected. I'm not sure this is a problem. --Halfak (WMF) (talk) 13:22, 13 August 2015 (UTC)
  • I fully support the effort of trying to make this public. This data set is of immense value for anyone interested in understanding how information on Wikipedia is consumed. In terms of anonymization, it would be prudent to make sure all User and User Talk pages are excluded, as this may allow you to match a trace to an editor. If you are publishing data from many languages make sure you get all the different variations for how these namespaces can be specified. --Ewulczyn (WMF) (talk)
  • I think it's fantastic to release that type of data :) About anonymization, I can't think of other things than the one having already been mentioned. I like particularly the idea of relative timestamp (no precise date/time), and session max length. The only concern I might think of is about community communication, making sure that our members don't feel tracked. Joal (WMF) (talk) 14:37, 19 August 2015 (UTC)
    • Thanks for your note, Joal (WMF). Can you explain more specifically what you mean by members? Step 2 of the proposal, for example, is meant to make sure that we do not share traces that could be associated with edit activity. If you also have recommendations on how we can do better communications in this area, please let us know. --LZia (WMF) (talk) 23:26, 24 August 2015 (UTC)
  • Are the article titles (per-wiki) canonicalized? Are actual search strings in the trace or is this for search strings that only result in the user landing directly on a page after the fulltext search is submitted and the user is 30x'd? If the criteria is merely that some k users have visited a singular node in the trace, is there a risk of identifying searches being divulged? For example, one person's trace may include several popular articles, but the search may very specific. If there is a heightened risk of identifying searches being divulged, would a countermeasure to this risk to be enforce a rule that, for traces that become eligible per the k criteria yet contain searches, there must be some j number of traces containing the identical page-search pair?
    • What exactly do you mean by "canonicalized" and "30x'd"? Regarding search strings (i.e., queries): you may think of page transitions from X to Y happening in one of two ways: via a link click from X to Y, or via a search using the search box on page X that results in the user landing on page Y. The data set will only specify (X,Y) and whether the transition was of the first or the second kind; it will not specify the query that was used if the transition was of the second kind. In short: no search queries will be published. --Cervisiarius (talk) 00:45, 27 August 2015 (UTC)
  • Will the concatenated fields for identifying a "unique" be obfuscated (e.g., hashing with master and per-user salts rotated on some frequency)? Or are user "identities" not actually involved in the output data, but rather just used only temporarily in the constructions of the trees and their weighted paths? — The preceding unsigned comment was added by ABaso (WMF) (talk) 22:00, 19 August 2015 (UTC)
    • The user "identities" will not be involved in the output data, they will only be used temporarily in the construction of the trees. We are not planning to share any sort of unique IDs in the data-set. --LZia (WMF) (talk) 23:15, 24 August 2015 (UTC)
  • Please clarify what exactly you are planning to release. Just sequences of page titles, no other data whatsoever? Will browsing sequences that were done by the same user but not connected (e.g. two different browser tabs) be associated?
    • Here's a JSON object representing a browse path, as we're planning on publishing it: {"rel_time":0,"http_status":"200","title":"Great_Britain","children":[{"rel_time":10,"http_status":"200","title":"Kingdom_of_Great_Britain","children":[{"rel_time":64,"http_status":"200","title":"United_Kingdom"}]}],"referer":GOOGLE|FACEBOOK|TWITTER|...} -- Note that there's no user id, i.e., it will not be possible to tell which paths come from the same user. --Cervisiarius (talk) 00:45, 27 August 2015 (UTC)
  • Also, is this one-time or will it happen continuously, for an indefinite period of time? What will be the delay between a pageview happening and getting released?
  • You should treat special pages like edits (discard that session); some of them (e.g. MovePage, Upload) result in log records which can connect the user to the trace. Also protect/unprotect and delete actions, patrolling/flagrev and veaction=edit (and there are probably others).
  • For extra paranoia, do Unicode normalization on page titles after URL-decoding them. Theoretically, someone could trick users into following links that have been crafted to include unusual Unicode compositions, and then search for those in the trace. apparently we do a redirect for non-standard forms so that's not a concern.
  • Limiting to article space is probably a good idea, too. There are many scenarios in which one can make inferences from what Wikipedia: pages a navigation sequence contains.
    • Yes! Thanks. Correcting the description now. We only focus on main namespace pageviews. --LZia (WMF) (talk) 23:06, 24 August 2015 (UTC)
    • To be language-independent, we are planning on persuing the following approach: Discard all page names that contain the pattern "non-whitespace followed by ':' followed by non-whitespace". This should discard everything that's not in the main namespace. Could people please jump in to confirm or refute this assumption? Thanks! --Cervisiarius (talk) 16:31, 27 August 2015 (UTC)
  • More generally, someone could trick a user into visiting some very unusual page or even a set of pages, and then use that to identify their further activity. That seems very hard to prevent, unless you are willing to filter out all pages with only one unique (still not perfect though, an attacker can just generate artificial uniques) or disassociate pageviews which are not in the same navigation sequence. --Tgr (WMF) (talk) 09:28, 22 August 2015 (UTC)
    • Tgr (WMF) We are filtering out any trace that contains at least one article/node with unique pageview counts of value k (k=100 is considered right now) or less. This is explained in step 3. Does this address your point? Regarding your suggestion to disassociate pageviews: do you mean disassociating pageviews that happen in the same session, or pageviews of the same user in a separate session later? If you mean the latter, the different traces are not associated with unique IDs, so the person tricking the user in your example will not know that a future trace is from the same user.--LZia (WMF) (talk) 23:01, 24 August 2015 (UTC)
      Thanks, that's good to know. Step 3 as phrased now is ambiguous; the way I understood it was that there are less than 100 users in the whole trace ("less than k unique users have viewed at least one page" -> there are less than k users who have each viewed at least one page).
      It doesn't fully rule out an attack based on sending a user to a specified article (although it definitely makes it more of an effort); the attacker can just use a proxy to create a bunch of fake uniques. The only way I can see to defuse that is to disassociate pageviews in the same session. (Or you can just ignore it. I'm not sure how much of a threat it is realistically.) --Tgr (WMF) (talk) 02:13, 25 August 2015 (UTC)
      Tgr (WMF) This concern is expressed also below, and I see you suggest that theoretically one could set a bunch of proxies to generate fake uniques. Building on the example below, if I understand the proposal correctly, an attacker that creates the pages Draft:Quiz, Draft:Red, Draft:Blue has to visit all the paths Draft:Quiz -> Draft:Red and Draft:Quiz -> Draft:Blue (k-1) times and then trick the user in visiting the page Draft:Quiz. That would imply that the attacker could discover the answer of the user to the fake quiz, but not that it would be able to track it further in his session since the k constraint is applied to all the pages of the trace. So if the user visits Draft:Quiz, Draft:Red and then follows on with his navigation to visit, say, Tiananmen Square only the trace Draft:Quiz -> Draft:Red would be included in the logs unless there are other (k-1) unique visitors that have followed the same path Draft:Quiz -> Draft:Red -> Tiananmen Square. The attack is, in theory, still open because the attacker could use the proxies to mine k-1 long paths trying to guess what the user would visit next, but it requires increasing effort by the attacker . A possible mitigation to this is to delete the traces that contain pages that have been deleted before processing the log (e.g. 3 months), and the log should be released after a given period of time (e.g. 6 months). This would mean that the fake quiz page has to go undetected for 3 months to be included in the dataset released. --CristianCantoro (talk) 07:31, 18 November 2015 (UTC)
      Yes, that captures well what I had in mind. Again, I'm not sure if this is something we need to be concerned about; expending huge effort to learn which pill someone clicks on would hardly motivate an attacker. What I would worry about is whether this can be somehow used to deanonimize users, by sending an email to an IRL-identifiable person who is suspected to be an editor, pointing them to a prepared page, and then making inferences from their behavior (such as which in a long list of usernames do they click). I can't really think of a realistic example of this, though. --Tgr (WMF) (talk) 07:55, 18 November 2015 (UTC)

External privacy/security experts feedback

edit

We discussed the proposal for releasing the data with 2 external security and privacy experts. Both individuals expressed concerns around releasing such data. Here is the summary of what we heard from them:

  • Every entity (company, organization, etc.) that has released such data-sets has regretted doing so. This is a very risky path to walk on.
  • Even if releasing such a data-set does not violate users' privacy, it violates the users' expectation of privacy. This will result in no good.
  • There are always people out there who will be eager to use this data to work against you by trying to combine the data-set with other publicly or privately available data-sets. You can't know all the data that is already available out there, and you don't have the time/resources/interest to put yourself in their shoes to figure out what they can do to harm you or the users.
  • If after hearing the above concerns you still want to release some version of the proposal, consider the following:
    • Consider trees that are common. Instead of releasing trees for which each node has at least K unique views, release trees only if the whole tree is viewed by at least k unique devices. (We discussed extensively that this will potentially make the data-set completely useless for research purposes as we have observed a small percentage of trees being common.)
    • Take out biographies from the trees, or drop trees that include biography pages. People tend to go to their own pages often and this can become a source of issue in releases like this.
    • Look into methods developed in the field of differential privacy.

Community feedback

edit
  • While the suggested k of 100 might be needed to cover straight-forward de-anonymization, it also means that the dataset will only cover very prominent pages. Even for big wikis, the better part of pages is <100 pageviews, hence viewed by <100 unique users. For example, in whole of October 2015, only 8% of the visited pages on dewiki saw >=100 page views (according to http://dumps.wikimedia.org/other/pagecounts-ez/merged/pagecounts-2015-10-views-ge-5-totals.bz2 , which even has the long tail cut at 5 page views. So the real number is even lower than 8%).
Thanks for the feedback. Two points
  • 100 is an example, the actual k, may be lower or higher than that.
  • We agree with you that large values of k may result in very small and not-so-interesting data-sets for the community, and that questions the value of the release. We won't release the data if it's not valuable for the community, and we won't release the data if we don't feel confident about its privacy. The discussion continues.  --LZia (WMF) (talk) 22:38, 18 November 2015 (UTC)
  • The k on its own is not sufficient to achieve anonymization. The following example is a bit crafted, and ignores some details, but it exhibits the core problem in a simple form. Let's assume, I create three pages: Draft:Quiz, Draft:Red, and Draft:Blue. Draft:Quiz contains the question "Which pill would you take?" with links to Draft:Red, and Draft:Blue. Now I visit each of those three pages (k-1) times. Then I send an email to a friend saying "Here, I created this quiz just for you" with a link to the Draft:Quiz page. If my friend follows the link, the proposed dataset would reveal my friend's answers to me. While the example is crafted, the needed steps to set such a trap up are simple, and can easily be done by machines. Using the same approach, I can use the content of the Barack Obama article to create look-alike Draft:Bracka_Obama article that has all links replaced with links to other crafted articles. Then visiting each of those (k-1) times before luring a friend onto the Draft:Bracka_Obama article would reveal to me what links my friend followed.
    A mitigation to this is that the proposal says that k unique visitors need to visit the page. An attacker would need to be able to make (k-1) requests from different IPs. So, is this still a concern? --CristianCantoro (talk) 06:52, 18 November 2015 (UTC)
This is a possible scenario. Two responses that control for this case happening:
  1. The data under discussion will only include traces from main namespace, so in your example, Draft traces will not be included.
  2. We are discussing a one time release for a data that has been kept from the past or will be collected at an unknown future time. Planning by an attacker for such a one time release is unlikely if we use future data, and impossible if we use past data. --LZia (WMF) (talk) 22:38, 18 November 2015 (UTC)

Decision

edit

Based on the feedback we have collected so far, we have decided to not release the data-set proposed in this page. We are stopping the discussion/consultation early (before officially kick-starting stage 3, the community consultation) as the risks identified are simply too high. Thank you for everyone who participated in this discussion and helped us make a more informed decision. --LZia (WMF) (talk) 21:28, 29 December 2015 (UTC)