We will record all events, sampling at 100%.
The code that logs to the EditorJourney schema is contained in the extension PageViews class.
The following business rules control when we log events to the EditorJourney schema:
$wgWMEUnderstandingFirstDayis set to true in the MediaWiki configuration repository. Currently, this is set for Korean and English beta labs. The plan is to have this enabled on Korean and Czech wikipedias.
- The user has to be in the cohort we are interested in logging data for. The cohort is users who have created their accounts within the last 24 hours.
For users in the cohort, on each page view, redirected page view, login or logout event we construct an event object containing the following:
- The title text, e.g. "Help desk" if the user is on
- The page ID, e.g. 564696
- The HTTP request method, this will be either GET or POST depending on whether the user is reading or writing data.
- The action associated with the page view, for example if the user is on
/w/index.php?title=Wikipedia:Help_desk&action=historythen action would be "history"
- If the action is "edit", we generate a comma-separated list of permission errors associated with the edit action for this page, if any. Example would be "protectedpagetext,editprotected,edit" or "badaccess-group0". This is helpful in understanding if the user is encountering access permissions problems when attempting to edit a protected article, for example.
- If the page view is associated with the mobile front end
- The namespace associated with the event, e.g. 1, 2, etc
- The path associated with the page view, for example if the URL is
https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=info, then the path would be
- The query parameters associated with the page view, for example if the URL is
https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=infothen the query parameters would be
- The user ID associated with the page view
Redacting sensitive informationEdit
Once we have the event object prepared, we redact sensitive data from it.
- First, we hash or redact sensitive query parameters. Hashed query parameters are
tokenis simply replaced with the string
- Then, we check to see if the event is in a sensitive namespace. The namespaces are defined in the MediaWiki-config repository, and vary by wiki.
- If the event's title is not in a sensitive namespace, then we log the event data and are done.
If it is in a sensitive namespace, then we have to hash several fields.
- Path: we replace all instances of the title db_key (i.e.
Main_Page) in the path with the hashed value of the title db_key
- Query: We replace all instances of the title db_key (i.e.
Main_Page) in the query with the hashed value of the title db key
- Title: We replace the title with the hashed value of the title.
- Page title: we replace any instances of title db_key or title text with the hashed value of title db_key
With the above steps done, we send the hashed data via the EditorJourney schema.
- We are using PHP's
hash_hmacfunction with the
whirlpoolalgorithm. The hash secret is generated once per user, and is stored in Redis with a 24 hour TTL. The key to lookup the secret is generated by hashing the user ID and user account registration timestamp. The end result is that we obfuscate the URLs that a user is visiting to protect privacy, but can still see patterns in page views, though we don't know which pages in particular they are viewing.