Research:Differential privacy for Wikimedia data/User filtering

A core component of differential privacy is determining what level of contribution will be protected within the dataset -- i.e. privacy unit -- and enforcing sufficient filtering to provide that protection. Commonly the goal is to protect any person who might contribute data -- i.e. in a differentially-private dataset, it should not be possible to determine if any given person has contributed data.[1] In practice, this "person" is actually a user account as that is how most tech platforms track who contributes what data -- e.g., phone + user account with location history turned on for Google's mobility reports[2].

The privacy unit is not just theoretical. If the unit is a person, the dataset must be filtered such that each person contributes no more than a certain number of data points to the entire dataset. Otherwise, the formal privacy guarantees will not hold. For most tech platforms, this is a trivial point -- all of their users are required to be logged-in and as such all actions are associated with user IDs that can easily be used for filtering. Wikimedia projects, however, do not generally require accounts in order to preserve users' privacy and reduce the barriers to access.

This privacy-first policy on Wikimedia projects leads to an ironic tension: by virtue of not collecting user identification data, the strong privacy guarantees of differential privacy at the user level are much more challenging to achieve. Below is a possible approach that seek to achieve these strong privacy guarantees without further compromising user privacy in terms of the data collected by the Wikimedia Foundation. It focuses on the use-case of differentially-private datasets of pageviews -- e.g., how many times each Wikipedia article is read by individuals from a given country.

Client-side filteringEdit

Following a rich history of creative use of generic cookies to achieve high-quality data usually achieved via fingerprinting (e.g., unique devices, SessionLength metrics), the gist of this approach is to track contributions on the client-side and pass a simple indicator of whether a pageview should be excluded from any differentially-private datasets. Some possibilities for implementation and potential issues below.

Basic designEdit

  • A cookie would be established in the client's browser that keeps track of pages read and is used to determine whether a page should be filtered or not.
  • Filtering would occur on a daily cadence -- i.e. any cookies would reset after no more than 24 hours. This reduces issues associated with users switching devices frequently or clearing cookies from their browser.
  • The status of a given pageview will be determined based on the cookie and passed to Wikimedia Foundation's servers via the x-analytics header so that it is included in the webrequest table and can be used for filtering.

At its simplest, this is a cookie that on the first pageview is set to 1 with an expiry at midnight UTC. With each subsequent pageview, it increments by 1 and when it exceeds a pre-determined threshold -- e.g., 10 -- it stops being incremented but every subsequent pageview contains an x-analytics header that has a new field -- e.g., dp-do-filter. On the server side, a table is generated for differentially-private datasets that is the subset of pageviews that do not contain this header.

A further benefit of this approach is that it would be easy to support user preferences that allow for private viewing sessions by automatically including the dp-do-filter header in all pageviews.

Open questionsEdit

Which pageviewsEdit

One challenge with this approach is ensuring a representative set of pageviews is included. In traditional approaches, all of a user's contributions would be considered and, if they exceed the limit, a random subset would be included in the dataset. This is not possible with client-side filtering and including just the first 10 pageviews, for example, might be an inaccurate reflection of what content is being read -- e.g., if many reader sessions start with the Main Page for a wiki and clicks to articles from there, the Main Page and pages linked from it would be overrepresented in the final dataset. Most readers do not exceed 10 pageviews per day, however, and there likely is no perfect fix for this issue. Larger sensitivity values -- i.e. how many pageviews are included -- will reduce this bias (though require more noise to achieve equivalent privacy guarantees). Ensuring that a page is not included twice in the count is one small step (and likely a good practice anyways) towards reducing this bias though requires a slightly more elaborate cookie to know which articles have been viewed.

Fixed or flexible thresholdEdit

The simplest approach is to define a threshold like 10 a priori. Then the information passed to the server can be a simple binary "include" or "exclude". If more flexible thresholds are required -- e.g., 5 for some datasets but 10 for others -- then either the count of the pageview in the session would need to be provided to the server, which carries privacy implications as it could be used to generate more accurate reader sessions on the server, or multiple thresholds would have to be supported (which might be a reasonable trade-off for two or three thresholds but slowly transforms to just sending the pageview count with more thresholds).

Alternatives not under considerationEdit

  • Collecting device-specific user IDs, even if these are quickly discarded, is not under consideration.
  • We are very hesitant to use approximate user IDs -- e.g, based on IP address and user-agent information -- as it is difficult to quantify the privacy loss this introduces, the efficacy of this is likely to shift over-time (see task T242825), and the privacy costs would be unequally distributed with e.g., mobile users whose IP changes frequently having less privacy guarantees than desktop users whose IP is stable.[3]
  • We would like to avoid weaker privacy guarantees such as pageview-level privacy. Our most vulnerable users are generally frequent editors -- because they may be subject to retribution for what they write, much of their data is available via their edit history, and they generate lots of pageviews -- and they would receive the least protection under pageview-level privacy.
  • Local differential privacy -- where data is made differentially-private before sending data to the Wikimedia Foundation -- is not currently under consideration. It would be substantially more complicated and it is unclear how it would be implemented effectively.

ReferencesEdit

  1. This can be further complicated when there are recurring dataset releases so the privacy unit then becomes e.g., a user's contributions in a given day with reduced guarantees over time. A common alternative to the user as a privacy unit might be just a single data point -- e.g., in a differentially-private dataset of pageview counts, it should not be possible to determine whether a given pageview is in the dataset. This is a weaker form of protection, however, as one might be able to determine whether a given user who contributed many pageviews is present in the data.
  2. Aktay, Ahmet; Bavadekar, Shailesh; Cossoul, Gwen; Davis, John; Desfontaines, Damien; Fabrikant, Alex; Gabrilovich, Evgeniy; Gadepalli, Krishna; Gipson, Bryant; Guevara, Miguel; Kamath, Chaitanya; Kansal, Mansi; Lange, Ali; Mandayam, Chinmoy; Oplinger, Andrew; Pluntke, Christopher; Roessler, Thomas; Schlosberg, Arran; Shekel, Tomer; Vispute, Swapnil; Vu, Mia; Wellenius, Gregory; Williams, Brian; Wilson, Royce J. (3 November 2020). "Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)". arXiv:2004.04145 [cs]. 
  3. Saxon, James; Feamster, Nick (27 May 2021). "GPS-Based Geolocation of Consumer IP Addresses". arXiv:2105.13389 [cs].