Schema talk:EditorJourney
Maintainer: | Kosta Harlan |
---|---|
Team: | Growth |
Project: | Understanding first day usage |
Status: | inactive |
Purge: | After 90 days, auto-purge PII contained in event capsule, as well as User ID, keep the rest indefinitely |
This page holds a JSON schema that specifies a data model for EventLogging.
Sampling
editWe will record all events, sampling at 100%.
Business rules
editThe code that logs to the EditorJourney schema is contained in the extension PageViews class.
The following business rules control when we log events to the EditorJourney schema:
$wgWMEUnderstandingFirstDay
is set to true in the MediaWiki configuration repository. Currently, this is set for Korean and English beta labs. The plan is to have this enabled on Korean and Czech wikipedias.- The user has to be in the cohort we are interested in logging data for. The cohort is users who have created their accounts within the last 24 hours.
For users in the cohort, on each page view, redirected page view, login or logout event we construct an event object containing the following:
- The title text, e.g. "Help desk" if the user is on
/wiki/Wikipedia:Help_desk
. - The page ID, e.g. 564696
- The HTTP request method, this will be either GET or POST depending on whether the user is reading or writing data.
- The action associated with the page view, for example if the user is on
/w/index.php?title=Wikipedia:Help_desk&action=history
then action would be "history" - If the action is "edit", we generate a comma-separated list of permission errors associated with the edit action for this page, if any. Example would be "protectedpagetext,editprotected,edit" or "badaccess-group0". This is helpful in understanding if the user is encountering access permissions problems when attempting to edit a protected article, for example.
- If the page view is associated with the mobile front end
- The namespace associated with the event, e.g. 1, 2, etc
- The path associated with the page view, for example if the URL is
https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=info
, then the path would be/w/index.php
- The query parameters associated with the page view, for example if the URL is
https://en.wikipedia.org/w/index.php?title=Wikipedia:Help_desk&action=info
then the query parameters would be?title=Wikipedia:Help_desk&action=info
- The user ID associated with the page view
Redacting sensitive information
editOnce we have the event object prepared, we redact sensitive data from it.
- First, we hash[1] or redact sensitive query parameters. Hashed query parameters are
search
,return
, andreturnto
, whiletoken
is simply replaced with the stringredacted
. - Then, we check to see if the event is in a sensitive namespace. The namespaces are defined in the MediaWiki-config repository, and vary by wiki.
- If the event's title is not in a sensitive namespace, then we log the event data and are done.
If it is in a sensitive namespace, then we have to hash several fields.
- Path: we replace all instances of the title db_key (i.e.
Main_Page
) in the path with the hashed value of the title db_key - Query: We replace all instances of the title db_key (i.e.
Main_Page
) in the query with the hashed value of the title db key - Title: We replace the title with the hashed value of the title.
- Page title: we replace any instances of title db_key or title text with the hashed value of title db_key
With the above steps done, we send the hashed data via the EditorJourney schema.
Footnotes
edit- ↑ We are using PHP's
hash_hmac
function with thewhirlpool
algorithm. The hash secret is generated once per user, and is stored in Redis with a 24 hour TTL. The key to lookup the secret is generated by hashing the user ID and user account registration timestamp. The end result is that we obfuscate the URLs that a user is visiting to protect privacy, but can still see patterns in page views, though we don't know which pages in particular they are viewing.