Differential privacy/Active/CentralNotice/Feasibility report

The CentralNotice statistics dataset reports the number of users who view and interact with banner notices which appear on the top of Wikipedia and other affiliated sites, broken down by geographic region, date, and Wikipedia project. This dataset allows Wikimedia to assess the effectiveness of the banner notices across different regions and user bases. Currently Wikimedia does not release this information due to privacy concerns. The adoption of differential privacy would let Wikimedia release these statistics for the first time.

Problem definition edit

The CentralNotice input data will be stored in two tables. The impressions table will contain 6 columns: a campaign_id denoting the campaign the banner belongs to, a banner_id denoting the specific banner, the geographic boundary (country), project, date (day), and impressions. The impressions will be a pre-aggregated count of the number of impressions for that specific boundary, project and time. The clicks table will have a different format. In this table each row represents an individual click and will contain the same campaign, banner, geographic boundary, project and day as well as user identifier information.

A sample of a possible release can be seen in the table below. Each row denotes the number of clicks and impressions for a specific banner for a specific date, time, and project. The impressions column denotes the overall impressions for this set of identifiers and will be a private sum of a single row from the impressions table. The clicks column will consist of a private count of all the rows from the clicks table which match the date, time, geographic area, and banner identifiers.

Example CentralNotice output data
campaign_id	banner_id	country	project	impressions	clicks	date
my_campaign	my_campaign_desktop	DE	de.wikipedia	987,654	321	1/1/2023
my_campaign	my_campaign_mobile	DE	de.wikipedia	102,938	456	1/1/2023

Privacy risk edit

If the underlying CentralNotice data were to be combined with additional outside information this could risk identifying sensitive information about individuals such as their locations and website access patterns. This includes additional risk to those users who may often interact with the CentralNotice banners such as donors or frequent contributors. Due to this risk, this release will not include data from countries identified on the country protection list as potentially dangerous for journalists or internet freedom.

Objectives edit

The objective of this project is to use differential privacy (or a variant thereof) to precisely report both the number of views of and clicks on banners, broken down by region, date, and Wikipedia project. This report will be generated periodically, either daily, or campaign by campaign.

Privacy analysis edit

The CentralNotice statistics release is well suited for differential privacy. All of the released counts are aggregate statistics where users contribute a small amount to each statistic. Likewise the keyset is relatively small as each CentralNotice banner is only shown for a limited date range and in a limited range of countries and projects. Since these parameters are chosen in advance, and are not affected by user behavior, the keyset can be considered public knowledge. The one challenge this dataset presents is that the impression counts are pre-aggregated; we will discuss approaches to deal with this below.

Neighboring relation edit

The dataset contains counts of user interactions with the CentralNotice Banners. This captures both users who only view a banner as well as those who both view and click on a banner. Because of this a user can either contribute to both the impression and click through counts or only the impression counts. Users clicks are identified by their account identifier if they have one and by IP address and a user identifier if they do not. This ensures that users who have and consistently use their accounts can be bounded across multiple click events. However users who do not have accounts and switch between devices may not be identifiable across multiple click events and may suffer additional privacy loss. Meanwhile, impression counts are pre-aggregated, and therefore we do not know which users contributed which impressions.

Since both impressions and clickthrough are measured differently they will require different definitions of privacy. The impressions may include multiple contributions by a single user (though bounded by a small amount) and as such each user will incur a small amount of privacy loss per some number of impressions. Click through events however can be bounded more strictly and as such the neighboring relation of a change to a single row will suffice. Following we will give the different neighboring relation options for both impressions and click through rates.

Impressions edit

(Geo, Project, $k$ )-Differential Privacy: This definition protects the addition or deletion of any $k$ (impression, geo, project) pairs per release. This results in a user receiving one unit of privacy loss for each $k$ records they contribute to in a single per time period. If a user contributes on multiple days they receive privacy loss for each day contributed. For example if data were to be released on a daily basis a user would receive one unit of privacy loss per $k$ records they contribute each day. A user who contributes $k+1$ records on one day but $k-1$ records the next day would receive 2 units of privacy loss on the first day but one on the second for a total of 3 units of privacy loss.

(Geo, Project, Campaign, $k$ )-Differential Privacy: This definition protects the addition or deletion of any $k$ (impression, geo, project) pairs over the entire campaign. This will result in a user receiving one unit of privacy loss per $k$ records they contribute during a given campaign. This definition is stronger than the previous one. To demonstrate, note the previous example. A user contributes $k+1$ records the first day and $k-1$ records the second day. Under (Geo, Project, $k$ )- differential privacy this user only receives 2 units of privacy loss as the total number of records contributed is only $2k$ . While this is a stronger definition of privacy, it requires more noise per record released.

Clicks edit

Row-Differential Privacy: Due to the small number of expected number of click events per individual the traditional notion of row differential privacy, in which only one record is removed or added, is a viable option. Under this option each user will receive one unit of privacy loss per row contributed over the entire dataset. Due to the nature of click events it is expected that the vast majority of individuals contribute exactly one (or very few) click events and as such should expect fairly minimal privacy loss. Like previously the privacy loss for each user can be accounted on a release by release basis or over an entire campaign.

User-Differential Privacy: Given the existence of user identifiers which persist across days it is possible to enforce a global sensitivity across days or an entire campaign. As a preprocessing step one can trim the dataset such that any user only contributes at most $k$ records over a given time period. This ensures that no user receives more than one unit of privacy loss per time period. It's important to note that this guarantee only applies to users who only appear using one identifier, users who switch accounts or use other methods to use multiple identifiers run the risk of additional privacy loss.

Initial mechanisms edit

Here we include two separate mechanisms: Algorithm 1 measures the number of impressions while Algorithm 2 measures the clickthrough rate.

Algorithm 1 works in two steps. In the first step it splits the number of impressions for each time period into blocks of size $k$ . This allows us to express $k$ -differential privacy to Tumult Analytics. In the second phase a noisy sum is taken over each of the new split rows using the clamping bounds of $[0,k]$ , which will ensure that no data is clamped while scaling the noise to a contribution of $k$ . In addition to the privacy budget, $k$ , the upper bound of impressions in the split rows and as such the amount of impressions protected with one unit of privacy budget, can also be tuned. As $k$ increases the strength of the privacy guarantee increases. This comes at the cost of adding additional noise. As such we suggest tuning the value of $k$ prior to release.

Algorithm 2 measures the number of click through events for a given (geo, date) pair. Since click through events are fairly sparse and any individual contributes few of them a simple private count of all clickthrough events can be used. In this case the only tunable parameter would be the privacy budget $\epsilon$ .

def algorithm_1(
  k, # num impressions to protect
  epsilon, # daily privacy loss per k impressions
  w, # set of (geo, date) pairs in campaign_table
  campaign_table # table with impressions broken down by geo and date
):
  output = {}
  keys = w.filter(w.geo != country_protection_list)
  campaign_table_split = split campaign_table into n rows of k + 1 remainder row
  camapaign_table_split = campaign_table_split.join(keys)

  for (geo, date) in keys:
    x = sum(campaign_table_split.filter(geo, date)) + laplace(1/epsilon)
    output[geo, date] = x

  return output.filter(x > release_thresh)

def algorithm_2(
  epsilon, # daily privacy loss
  w, # set of (geo, date) pairs in campaign_table
  campaign_table # table with each click event
):
  output = {}
  keys = w.filter(w.geo != country_protection_list)
  for (geo, date) in keys:
    x = count(campaign_table.filter(geo, date)) + laplace(1/epsilon)
    output[geo, date] = x

  return output.filter(x > release_thresh)

Extensions edit

One possible alteration of Algorithm 2 would be the use of zCDP instead of traditional DP. Due to the low sensitivity of the click through rate it is compatible with zCDP. This would allow for the use of the Gaussian mechanism which uses Gaussian noise instead of Laplace noise. This comes with a few benefits including significantly fewer spurious results. If zCDP were to be used it would require additional accounting to convert the privacy $\rho$ parameter of zCDP to an $\epsilon ,\delta$ guarantee in order to properly combine the pure DP guarantee of Algorithm 1 with the zCDP guarantee of Algorithm 2.