Discovery/Analytics

This page documents technical information, procedures, processes, and other knowledge that Discovery's Analysts should be aware of. An Employee Operations Manual (EOM), if you will. For onboarding steps, please refer to Wikimedia Discovery/Team/Analyst onboarding page on MediaWiki. Certain information has been withheld and is available as a supplement on the internal Office wiki. This EOM, its supplement, and the onboarding page provide all the necessary documentation and instructions for a newly hired Data Analyst in the Discovery Department.

For hiring new members of the team, refer to the write-up on Wikimedia Blog.

Data Sources, Databases, and Datasets

Be sure to check out Discovery data access guidelines and data retention guidelines.

Data Sources

Web requests

Web requests are generated when a browser (or API client) navigates to a Wikimedia site.

The initial request may just be for a page, e.g. https://en.wikipedia.org/wiki/San_Francisco

There may then be subsequent requests for additional content, e.g. images, other data from the page

Some subset of those web requests are considered "page views" based on the profile of the requests

"Page views" is a metric that attempts to capture how many times a page has been viewed by a real human being

Page views tools: https://tools.wmflabs.org/pageviews/

Some additional reading: Wikipedia:Pageview_statistics

e.g. a web request for https://en.wikipedia.org/w/api.php?action=help should not be counted as a page view, but https://en.wikipedia.org/wiki/San_Francisco would be

EventLogging

EventLogging (EL) & EL Guide

Almost all, but not *all*, EventLogging requires Javascript be enabled in the client

It's often used to track user interactions with features to determine how well the feature serves their need

As you perform different actions, those actions fire events from the Javascript engine in your browser which are sent to our EventLogging servers

Databases

MySQL

Once SSH'd to stat1002 ("stat2") or stat1003 ("stat3"), connect to the MySQL server with the following command: mysql -h analytics-store.eqiad.wmnet

log

Contains event logging tables, defined by schemas. The following tables are of particular interest:

Search_* (Schema:Search) captures the events from the autocomplete text field in the top right corner of wikis on desktop.
MobileWikiAppSearch_* (Schema:MobileWikiAppSearch) captures events from people searching on the Wikipedia app on Android and iOS.
MobileWebSearch_* (Schema:MobileWebSearch) captures events from people searching on the *.m.* domains.
TestSearchSatisfaction2_* (Schema:TestSearchSatisfaction2) which allows us to track user sessions and derive certain metrics.
WikipediaPortal_* (Schema:WikipediaPortal) captures events from people going to the Wikipedia Portal (wikipedia.org)
PrefUpdate_* (Schema:PrefUpdate) contains users' preference change events (e.g. opting in to / out of a beta feature).

Hadoop Cluster

The cluster contains many databases, some of which we'll make a note of and describe in this section. It has A LOT of data. So much, in fact, that we can only retain about a month of it at any point.

wmf.WebRequest

The WebRequest table in the wmf database contains refined request data stored in the Parquet column-based format (as opposed to the raw JSON imported directly from Kafka to wmf_raw.WebRequest). Refined means some fields (columns) are computed from the raw data using user-defined functions (UDFs) in Analytics' Refinery. For example, client_ip is computed from ip and x_forwarded_for to reveal the true IP of the request, and geocoded_data (country code, etc.) is computed from client_ip so we can easily write queries that fetch requests from a specific country.

Due to the volume of the data, the requests are written out to specific partitions indexed by webrequest_source (e.g. "text" or "misc"), year, month, day, and hour. It is important to include a WHERE webrequest_source = "__" AND year = YYYY AND etc. clause in your Hive query to avoid querying all partitions, which will take a very, very, VERY long time.

As of Summer 2017,^[1] certain special pageviews have tags (for example, visits to www.wikipedia.org/ are tagged as "portal").

wmf_raw.CirrusSearchRequestSet

The CirrusSearchRequestSet table in the wmf_raw database contains raw JSON of Cirrus searches.

User-defined Functions

User-defined functions (UDFs) are custom functions written in Java that can be called in a Hive query to perform complex tasks that can't be done with the built-in Hive QL functions. To see examples of UDFs, look in Analytics' Refinery source (e.g. IsPageview UDF).

To get started with writing your own UDFs, git clone https://gerrit.wikimedia.org/r/analytics/refinery/source, install maven (e.g. brew install maven on Mac OS X if you have Homebrew installed), and then package the codebase into a Java .jar file (via mvn package) that you can import into Hive:

ADD JAR /home/bearloga/Code/analytics-refinery-jars/refinery-hive.jar;
CREATE TEMPORARY FUNCTION my_udf AS 'org.wikimedia.analytics.refinery.hive.MyUDF';

USE [Database];
SELECT my_udf([Field]) AS [UDF-processed Field] FROM [Table]
  WHERE year = YYYY AND month = MM AND day = DD;

I have a clone of the analytics refinery source repository in /home/bearloga/Code and have an update script scheduled to run (12 20 * * * cd /home/bearloga/Code && bash update-refineries.sh) that checks if origin/master is ahead of the clone and then pulls the updates & mvn package-es it up and copies the latest JARs to /home/bearloga/Code/analytics-refinery-jars, so our data collection scripts can use those (as of T130083) and always have the latest and greatest. The script is:

cd analytics-refinery-source

MERGES=$(git log HEAD..origin/master --oneline)
if [ ! -z "$MERGES" ]; then
  git pull origin master
  mvn package
  for refinery in {'core','tools','hive','camus','job','cassandra'}
  do
    jar=$(ls -t "/home/bearloga/Code/analytics-refinery-source/refinery-${refinery}/target" | grep 'SNAPSHOT.jar' | head -1)
    cp "/home/bearloga/Code/analytics-refinery-source/refinery-${refinery}/target/${jar}" "/home/bearloga/Code/analytics-refinery-jars/refinery-${refinery}.jar"
  done
fi

Public Datasets

Our golden (data) retriever codebase fetches data from the above-described MySQL and Hadoop databases, does additional munging & tidying-up, and then writes the data out to /a/published-datasets/discovery/[search|portal|maps|wdqs|external_traffic] directory on stat1002, which rsyncs to https://analytics.wikimedia.org/datasets/ (specifically discovery directory). It is a collection of R scripts and SQL/HiveQL queries. All the scripts are executed daily via Reportupdater, which is scheduled as a crontab job (in Mikhail's bearloga account):

0 5 * * * cd /a/discovery/golden/ && sh main.sh >> /home/bearloga/discovery-golden.log 2>&1

We still need to get it Puppet-ized so it's not dependent on any staff account. Note that this means that any time golden is updated, it needs to be git pull'd in /a/discovery/

Workflow

Process Overview

Typically, tasks go through Product Manager approval before they go into our backlog (currently the "Analysis" column on Discovery Phabricator board). During sprint planning meetings, we pull tasks from that into our sprint board to work on them during that sprint. Occasionally we sidestep that process and add emergency ("Unbreak now!") tasks directly to the sprint. For a more thorough description of process at Discovery, see this article on MediaWiki.

Analysis with R

In the past, we've focused on doing our analyses in R (statistical analysis software and programming language). In this section we describe packages we've developed internally, as well as important packages we tend to use to accomplish our tasks. Remember to set proxies, though, if you're gonna be downloading packages from CRAN and GitHub:

Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")

Internal Codebases

polloi contains common functions used by the dashboards
golden isn't a package but contains all the data collection scripts that retrieve and aggregate data from the MySQL and Hadoop databases
wmf contains common functions used in analyses and data collection (e.g. querying Hive/MySQL)

By the way, all of our Gerrit-hosted repositories are mirrored to GitHub/wikimedia. So you can install the wmf package (GitHub mirror) you could run one of the following:

devtools::install_git("https://gerrit.wikimedia.org/r/p/wikimedia/discovery/wmf.git")
devtools::install_github("wikimedia/wikimedia-discovery-wmf")

Common Packages

You can use uaparser for parsing user agents but installing it is awful (note: is really slow and should only be used on data extracted from MySQL event logs since we have a UA-parsing UDF already and the refined webrequests already contain parsed UAs). To install into your R library on stat1002:

# 'uaparser' requires C++11, and libyaml-cpp 0.3, boost-system, boost-regex C++ libraries
devtools::install_github("ua-parser/uap-r", configure.args = "-I/usr/include/yaml-cpp -I/usr/include/boost")

For installation on Mac OS X, refer to these instructions. Heads up that the current version of uaparser segfaults in RStudio^[2] but works in Terminal.

Statistical Computing

Sometimes we may need to run computationally intensive jobs (e.g. ML, MCMC) without wanting to worry about hogging up stat100x. For tasks like that, we have a project on Wikimedia Labs called discovery-stats that we can create 2-core, 4-core, and 8-core (with 16GB of RAM!) instances under. The instances must be managed through Horizon and can be set up with the following shell script after SHH-ing (ssh <LDAP username>@<instance name>.eqiad.wmflabs):

#!/bin/bash

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/debian jessie-cran3/" >> /etc/apt/sources.list'
sudo apt-key adv --keyserver keys.gnupg.net --recv-key 6212B7B7931C4BB16280BA1306F90DE5381BA480

sudo apt-get update --fix-missing && sudo apt-get -y upgrade
sudo apt-get -fy install gcc-4.8 g++-4.8 gfortran-4.8 make \
 libxml2-dev libssl-dev libcurl4-openssl-dev \
 libopenblas-dev libnlopt-dev libeigen3-dev libarmadillo-dev libboost-all-dev \
 liblapack-dev libmlpack-dev libdlib18 libdlib-dev libdlib-data \
 libgsl0ldbl gsl-bin libgsl0-dev \
 libcairo2-dev libyaml-cpp-dev \
 r-base r-base-dev r-recommended

sudo su - -c "R -e \"dotR <- file.path(Sys.getenv('HOME'), '.R'); \
if (!dir.exists(dotR)) { dir.create(dotR) }; \
M <- file.path(dotR, 'Makevars'); \
if (!file.exists(M)) { file.create(M) }; \
cat('\nCXXFLAGS+=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function', file = M, sep = '\n', append = TRUE); \
cat('\nCXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs', file = M, sep = '\n', append = TRUE)\""

sudo su - -c "R -e \"install.packages(c('arm', 'bayesplot', 'bclust', 'betareg', 'bfast', 'BH', 'BMS', 'brms', 'bsts', 'C50', 'caret', 'coda', 'countrycode', 'data.table', 'data.tree', 'deepnet', 'devtools', 'e1071', 'ElemStatLearn', 'forecast', 'gbm', 'ggExtra', 'ggfortify', 'ggthemes', 'glmnet', 'Hmisc', 'import', 'inline', 'iptools', 'irlba', 'ISOcodes', 'knitr', 'lda', 'LearnBayes', 'lme4', 'magrittr', 'markdown', 'mclust', 'mcmc', 'MCMCpack', 'mcmcplots', 'mice', 'mlbench', 'mlr', 'nlme', 'nloptr', 'NLP', 'nnet', 'neuralnet', 'prettyunits', 'pROC', 'progress', 'quanteda', 'randomForest', 'randomForestSRC', 'Rcpp', 'RcppArmadillo', 'RcppDE', 'RcppDL', 'RcppEigen', 'RcppGSL', 'RcppParallel', 'reconstructr', 'rgeolocate', 'rstan', 'rstanarm', 'scales', 'sde', 'tidytext', 'tidyverse', 'tm', 'triebeard', 'urltools', 'viridis', 'xgboost', 'xtable', 'xts', 'zoo'), repos = c(CRAN = 'http://cran.rstudio.com/'))\""

This will install all the necessary software packages, libraries, and R packages at an instance-level. Additional R packages can then be installed at a user-level in the user's home dir.

PAWS Internal

We can perform data analyses using Jupyter notebooks (with R and Python kernels) via PAWS Internal.

Assuming you’ve got production access and SSH configured (see Discovery/Analytics on Office wiki for more examples of SSH configs), you need to create an SSH tunnel like you would if you wanted to query Analytics-Store on your local machine:

ssh -N notebook1003.eqiad.wmnet -L 8000:127.0.0.1:8000 # or notebook1004

Then navigate to localhost:8000 in your favorite browser and login with your LDAP credentials (username/password that you use to login to Wikitech).

By the way, if you want a quick way to get into PAWS Internal, you can make an alias (in, say, ~/.bash_profile) that creates the SSH tunnel and opens the browser:

alias PAWS="ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000 & open http://localhost:8000/"

Then in terminal: PAWS

This will launch your default browser and output a numeric process ID. When you want to close the tunnel: kill [pid]

R in PAWS Internal

Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")
options(repos = c(CRAN = "https://cran.rstudio.com/"))

If you want, you can put those 3 lines in ~/.Rprofile and they will be executed every time you launch R. We need to install the devtools and Discovery’s wmf packages. If you want to work with user agents in R, I’ve included the command to install the uaparser package.

install.packages("devtools")
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")

# uaparser:
devtools::install_github("ua-parser/uap-r", configure.args = "-I/usr/include/yaml-cpp -I/usr/include/boost")

MySQL in PAWS Internal with R

The mysql_connect function in wmf will try to look for some common MySQL config files (those vary between stat1002, stat1003, and notebook1001). It’ll let you know if it encounters any problems but I doubt you’ll have any. Try querying with mysql_read:

log_tables <- wmf::mysql_read("SHOW TABLES;", "log")
head(log_tables)

Fetched 375 rows and 1 columns.

Tables_in_log
BannerImpression_5329872
CentralAuth_5690875
CentralNoticeBannerHistory_13447710
ChangesListFilters_15876023
ChangesListFilters_16174591
CommandInvocation_15237653

Hive in PAWS Internal with R

Since hive has been configured on notebook1001, we don’t need to do anything extra to use the query_hive function in wmf:

wmf_tables <- wmf::query_hive("USE wmf; SHOW TABLES;")
head(wmf_tables)

tab_name
aqs_hourly
browser_general
last_access_uniques_daily
last_access_uniques_monthly
mediacounts
mediawiki_archive

Installing Python modules on PAWS Internal

Madhu said that the global version of pip is out of data and needs to be updated on a per-user basis.

Upgrading should get you to pip 8+, and then wheels (the new python distribution format) instead of eggs should get installed.

You can update & install within the notebook, but if you prefer to do it in Terminal after SSH’ing to notebook1001.eqiad.wmnet, you can add the path to your ~/.bash_profile:

[[ -r ~/.bashrc ]] && . ~/.bashrc
export PATH=${PATH}:~/venv/bin
export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080</sosyntaxhighlight>
Then you can use and upgrade <code>pip</code>:

<syntaxhighlight lang="python">!pip install --upgrade pip

Downloading/unpacking pip from https://pypi.python.org/packages/b6/ac/7015eb97dc749283ffdec1c3a88ddb8ae03b8fad0f0e611408f196358da3/pip-9.0.1-py2.py3-none-any.whl#md5=297dbd16ef53bcef0447d245815f5144
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.6
    Uninstalling pip:
      Successfully uninstalled pip
Successfully installed pip
Cleaning up...

Then we can install (for example):

Data
- Pandas for analysis-friendly data structures (e.g. Series & DataFrame)
- Pandas Data Reader for remote data access in Pandas
- Requests for HTTP requests
- BeautifulSoup for web-scraping
- Feather is an Apache Arrow-based file format that efficiently stores pandas DataFrame objects on disk. Note: You can read/write feather files into/out of R using the sister R interface (available on CRAN).
Visualization
- Seaborn for data visualization (also installs Matplotlib):
- Bokeh for interactive dataviz
Statistical Modeling and Machine Learning
- StatsModels for statistical analysis
- Scikit-Learn for machine learning
- PyStan interface to Stan probabilistic programming language for Bayesian inference
- PyMC3 for Bayesian modeling and probabilistic machine learning
- Patsy for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. (Patsy brings the convenience of R “formulas” to Python)
- TensorFlow for machine learning using data flow graphs
- Edward for probabilistic modeling, inference, and criticism

pip install \
    pandas pandas-datareader requests beautifulsoup4 feather-format \ 
    seaborn bokeh \
    statsmodels scikit-learn pystan pymc3 patsy

Warning: TensorFlow v0.12.0 and 0.12.1 broke compatibility with Edward. Use at most TensorFlow v0.11.0 for now:

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0-cp34-cp34m-linux_x86_64.whl
pip install $TF_BINARY_URL
pip install edward

This nifty command to update installed Python modules comes to us courtesy of rbp at Stack Overflow:

pip freeze --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Event Logging

MediaWiki-powered websites use the EventLogging extension, while Wikipedia Portal (wikipedia.org) has a "lite" version with components copied from the MW extension.

Search Satisfaction

TSS2 schema is implemented in searchSatisfaction.js in the Wikimedia Events repo. Users have different chances of being enrolled into search satisfaction event logging depending on which wiki they're on. When somebody searches, they get a searchSessionID assigned to them that expires after 10 minutes of no searching. This persists across multiple searches because it's stored in browser local storage (same with Wikipedia Portal session ID). Once it expires, the user gets a "rejected" token that lasts for 20 minutes, preventing them from being entered into search satisfaction event logging immediately again.

To enable Erik's awesome debugging (via change 270798), run the following in the JS console:

mw.loader.using('mediawiki.api').then(function () {
  new mw.Api().saveOption('eventlogging-display-web', '1');
});

This will show all events (and their data) as they are sent. To force yourself into Search Satisfaction, refer to change 335160.

For users enrolled into EL, links on SERP should have a ?wprov= appended to the query so the visit page and check-in events start firing after the visited article loads. See Provenance for list of acceptable values. It probably shouldn't be unique. For example, when you click a search result the wprov attached is srpw1_4, which indicates you clicked on a search result page from the web, and it was the 4th result. Basically passing only the information needed to pick things back up by the JS on page load. Having srpw1_ lets the code know you came from a search result page, and the 4 at the end is used in the 'visitPage' event. Let's say we make it "srpw1_es" for explore similar links. The code currently looks for 'srpw1_' as the start of the wprov, so that would still trigger due to how initFromWprov function is implemented, though, since the `es` doesn't parse to an int the resultPosition will be NaN, and cameFromSearch will be set to false.

To see events (e.g. clicks) coming in in real-time: kafkacat -b kafka1012:9092 -C -t eventlogging_TestSearchSatisfaction2 2>/dev/null | grep click With kafkacat, you can choose the offset, so you can start at the oldest offset which is 7 days (our retention period) or you can start at the latest offset, which is basically real-time. By default there's no offset and it prints the events in real time.

Wikipedia.org Portal

Overview

The relevant files are located in dev/wikipedia.org/assets/js:

event-logging-lite.js: where the eventLoggingLite global object is defined with a public method for generating a session ID.
wikipedia-org-event-logging.js: where we add event triggers such as landing and section click events.
wm-test.js: where sampling and bucketing for A/B tests happens.

Development

Refer to CONTRIBUTING.md for detailed instructions on contributing to the Wikimedia Portals repo. Here are the instructions (assuming you're using a Mac):

First Install Homebrew if you don't have it already. The repo has a few dependencies on packages that need to be installed globally, like Casper.js and Python v2.7 (default version on Mac).

# Run the following after installing node.js 0.12.7:
brew update && brew upgrade && brew install npm casperjs

# Repository and webserver set up:
sudo apachectl start
cd /Library/WebServer/Documents
# (give yourself R+W permissions on Documents)
git clone ssh://bearloga@gerrit.wikimedia.org:29418/wikimedia/portals
cd portals

# Set NPM to use Python 2.7
npm config set python python2.7
# Install NPM modules:
npm install

# Do not use: `npm install --python=python2.7 node-gyp postcss cssnext handlebars imagemin jshint jscs lwip sprity-lwip sprity gulp`

git checkout -b patch_nickname

gulp watch --portal wikipedia.org
# ^ Watches for changes in dev/wikipedia.org/ and generates an index.html file at dev/wikipedia.org/index.html

# ...coding...

# Test and debug by browsing to: http://localhost/portals/dev/wikipedia.org/

# Generate the production version with minified JS & CSS assets:
gulp --portal wikipedia.org
# Test the production version by browsing to: http://localhost/portals/prod/wikipedia.org/

git commit -a -m "message"
# git commit --amend # detailed patch notes
git review

There are some sort-of unit tests in the tests/ folder which can be run using npm.

Dashboarding

As of 16 June 2017, our team maintains 5 dashboards (Search Metrics, Portal Metrics, WDQS Metrics, Maps Metrics, and External Referral Metrics). Each dashboard has its own repository on Gerrit. The links to the repositories can be found on the Discovery Dashboards homepage. The dashboards are powered by Shiny, a web-development framework written in R, and run on Shiny Server, and maintained via Puppet on a Labs instance. We've documented most of the dashboarding process in the Building a Shiny Dashboard article on Wikitech. To ensure a unique but uniform look, we might use the shinythemes package for the dashboard's overall appearance.

Discovery Dashboards

As of June 2017,^[3] there is a Shiny Server Puppet module that is used by our Puppet-configured dashboards.

The dashboards live on the discovery-production (discovery.wmflabs.org) and discovery-testing (discovery-beta.wmflabs.org) instances on Labs, and deploying new versions of the dashboards is different between beta and production. The production instance uses the discovery::dashboards role which uses the discovery_dashboards::production profile (which pulls each dashboard from its "master" branch). The beta testing instance uses the discovery::beta_dashboards role which uses the discovery_dashboards::development profile (which pulls each dashboard from its "develop" branch). Both profiles use the discovery_dashboards::base profile which uses the Shiny Server module and installs R packages used by the dashboards.

Remember that deploying to the production server should not be done lightly. Product Managers and Leads use the dashboards to make data-informed decisions, so it's really important that they are stable and always up. Submit major patches (e.g. new features or refactorings) for CR to the "develop" branch, which go live on the beta testing instance when merged. Once ready to release to production, merge into the "master branch".

New Labs Instances

Spin up an instance: Wikitech → Manage Instances → Select "shiny-r" as the project → Set filter → Add instance (e.g. discovery-production.eqiad.wmflabs)
Create a proxy: Wikitech → Manage Web Proxies → Select "shiny-r" as the project → Set filter → Create proxy (e.g. discovery.wmflabs.org)

Shiny/Dashboarding Resources

Shiny Reference
RStudio Shiny Webinar (2 hours 25 minutes long)
Getting started with Shiny Server
Shiny Server Administrator's Guide
Get started with shinydashboard
htmlwidgets on GitHub
Learn the basic building blocks of jQuery with Try jQuery at Code School

Research and Testing

We prefer to upload our analysis codebases and report sources (we mostly write our reports in RMarkdown and compile/knit them into PDFs to be uploaded to Wikimedia Commons) to GitHub where we have the wikimedia-research organization. We use the following naming convention for the repositories: Discovery-[Research|Search|Portal|WDQS]-[Test|Adhoc]-NameOrDescription.

-Research- repositories (e.g. Discovery-Research-Portal & Discovery-Research-UserSatisfaction) contain analyses and experiments for our various research projects.
-Test- repositories (e.g. Discovery-Portal-Test-SearchBox) contain analyses of A/B test data while -Adhoc- repositories (e.g. Discovery-Portal-Adhoc-JavaScriptSupport) contain ad-hoc analyses and reports.

A/B Tests and Experiments

Our past and current A/B tests – and process guidelines for future tests – are documented on Discovery's Testing page. We perform the analyses of clickthroughs using our BCDA R package and other packages. We prefer the Bayesian approach because traditional, null hypothesis significance testing methods do not work with the volume of data we usually generate from our A/B tests.

Forecasting Usage

We have an ongoing research project to forecast usage volume. A prototype dashboard for this endeavor is live on the experimental instance.

Miscellaneous

Responsibilities

Monthly Metrics

We try to keep Discovery's KPI table on Wikimedia Product page updated with previous month's metrics, namely before the 15th of the month. To that end, we created the Monthly Metrics module on the Search Metrics dashboard, which allows us to very quickly fill out the table with the latest numbers. The table also has sparklines, which can be generated from scratch using the following R code:

library(magrittr)

devtools::source_url("https://raw.githubusercontent.com/wikimedia/wikimedia-discovery-rainbow/develop/utils.R")

read_desktop()
read_apps()
read_web()
read_api()
read_failures()
read_augmented_clickthrough()

smoothed_load_times <- list(
    Desktop = desktop_load_data,
    Mobile = mobile_load_data,
    Android = android_load_data,
    iOS = ios_load_data
  ) %>%
  dplyr::bind_rows(.id = "platform") %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(Median = median(Median)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = Median)
smoothed_zrr <- polloi::smoother(failure_data_with_automata, "month", rename = FALSE) %>%
  dplyr::rename(value = rate)
smoothed_api <- split_dataset %>%
  dplyr::bind_rows(.id = "api") %>%
  dplyr::filter(referrer == "All") %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(total = sum(calls)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = total)
smoothed_engagement <- augmented_clickthroughs %>%
  dplyr::select(c(date, `User engagement`)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = `User engagement`)

smoothed_data <- dplyr::bind_rows(list(
  `user engagement` = smoothed_engagement,
  `zero rate` = smoothed_zrr,
  `api usage` = smoothed_api,
  `load times` = smoothed_load_times
), .id = "KPI") %>%
  dplyr::arrange(date, KPI) %>%
  dplyr::distinct(KPI, date, .keep_all = TRUE) %>%
  tidyr::spread(KPI, value, fill = NA) %>%
  dplyr::filter(lubridate::mday(date) == 1) %>%
  dplyr::mutate(unix_time = as.numeric(as.POSIXct(date))) %>%
  dplyr::filter(date < lubridate::floor_date(Sys.Date(), "month"))

sparkline <- function(dates, values) {
  unix_time <- as.numeric(as.POSIXct(dates)) # "x values should be Unix epoch timestamps"
  if (any(is.na(head(values, 10)))) {
    offset <- max(which(is.na(values)))
    unix_time <- unix_time[(offset + 1):length(values)]
    values <- values[(offset + 1):length(values)]
  }
  # mw:Template:Sparkline supports up to 24 values
  unix_time <- tail(unix_time, 24); values <- tail(values, 24)
  points <- paste0("x", 1:length(unix_time), " = ", unix_time, " | ", "y", 1:length(unix_time), " = ", values, "|")
  return(paste0(c("|{{Sparkline|", paste0(points, collapse = "\n"), "}}"), collapse = "\n"))
}

sparkline(smoothed_data$date, smoothed_data$`user engagement`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`zero rate`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`api usage`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`load times`) %>% cat("\n")

This generates wiki markup for inserting Sparklines into tables.

Explanations

MediaWiki Train

There may be times where you need to patch MediaWiki Core or an extension. It is helpful to know about the MediaWiki train. Many years ago everybody deployed their own changes in an ad-hoc manner, and would have to figure out how to do what and when. Then Release Engineering (RE) was created, and they do these weekly releases. Releases are created by branching off master for every MW extension and Core every Tuesday, depending on when one of the RE engineers manually runs the script that creates the appropriate branches and submodules. The deployment is gradual, by group:

Group 0 (test wiki) - deployed on Tuesday
Group 1 (Wikimedia wikis that are not Wikipedia -- although maybe there are 2 Wikipedias in this group) - deployed on Wednesday
Group 2 (Wikipedia) - deployed on Thursday

RE monitor error rates with each group as the release branch is deployed. Use https://noc.wikimedia.org/conf/ to check. If the page shows 1 version, the changes have rolled out; more than 1 version shown means the rollout is in-progress. SWAT should only be used for fixing problems with new branch (bugs that didn't get caught before merge), not new features. SWAT (setting wikis ablaze team) deployments include ticket numbers in SAL (server admin log). If there is a problem with the deployment, the train is halted. We cherry pick into the next train or SWAT. The new branch is created every week (for consistency) even if it never gets deployed (from problems with the train or if everyone is at Wikimania or an offsite).

Visit Special:Version page on any wiki. "MW 1.30 wmf-13" means 13th week of MediaWiki v1.30's development before its release to the world (when it's released as a static thing to third party users); then after wmf-26 (ish; there are roughly 2 releases per year), we will move on to "MW 1.31 wmf-1".

For more information, refer to this page on deployment and this page on how to deploy code.

Style Guide

This section provides tips for consistent and efficient code.

Programming Style

R code should follow Hadley Wickham's style guide (based on Google's style guide for R). The following are some of our additional suggestions:

# Okay, but could be better:
foo <- function(x) {
  if (x > 0) {
    return("positive")
  } else {
    return("negative")
  }
}

# Better:
foo <- function(x) {
  if (x > 0) {
    return("positive")
  }
  return("negative")
}

Because if x > 0, return("positive") prevents execution of return("negative") anyway.

x <- TRUE

# Unnecessary comparison:
if ( x == TRUE ) return("x is true")

# Efficient:
if (x) return("x is true")

Same with ifelse(<condition>, TRUE, FALSE) :P

Syntax Checking

RStudio IDE

If using RStudio IDE, check the following global options:

In Editing:
- Insert spaces for tabs (with tab width of 2)
In Saving:
- Ensure that source files end with newline
- Strip trailing horizontal whitespace when saving

Also install lintr package, which has support for viewing errors and warnings in RStudio IDE.

Sublime Text

There are some more steps in getting lintr up and running in Sublime Text:

Install Lua (e.g. via Homebrew: brew install lua)
Install luacheck: luarocks install luacheck
Install SublimeLinter via Package Control in Sublime Text
Install SublimeLinter-contrib-lintr via Package Control

Version Control

Version control is very important on this team. All our codebases use version control which allows us to track changes, collaborate without clashing, and review each other's work before it is deployed. Git is a tool that tracks changes to your code and shares those changes with others. You no longer have to email files back and forth, or fight over who's editing which file in Dropbox. Instead, you can work independently, and trust Git to combine (aka merge) your work. Git allows you to back in time to before you made that horrific mistake. You can replay history to see exactly what you did, and track a bug back to the moment of its creation. If you haven't used Git before but have 15 minutes and want to learn Git, try this interactive lesson inside your web browser.

Gerrit

Code pipeline: commit (refer to these commit message guidelines) → review (make sure to install git-review)

Resources for learning how to use Gerrit

Gerrit getting started quick guide
Gerrit tutorial (more thorough guide)

Someone on the team (senior analyst?) should add themselves to the wikimedia/discovery/* section on the Git/Reviewers page so that they are automatically added as a reviewer to all wm/discovery repositories.

GitHub

Code pipeline after forking or branching: commit → push → pull request.

If somebody else owns the repository ("repo"), you fork to have your own copy of the repo that you can experiment with. Once you are ready to submit a commit for review (and potentially deployment), you create a pull request that allows the owner to accept or reject your proposed changes.

Resources for learning how to use GitHub

Resources

Free/open books:

Citations

↑ "⚓ T164021 Create tagging udf". phabricator.wikimedia.org. Retrieved 2017-06-16.
↑ "parse_agents crashes RStudio · Issue #8 · ua-parser/uap-r". GitHub. Retrieved 2017-06-16.
↑ "⚓ T161354 [Dashboards] Migrate from Vagrant to Puppet config". phabricator.wikimedia.org. Retrieved 2017-06-16.

[1] "⚓ T164021 Create tagging udf". phabricator.wikimedia.org. Retrieved 2017-06-16.

[2] "parse_agents crashes RStudio · Issue #8 · ua-parser/uap-r". GitHub. Retrieved 2017-06-16.

[3] "⚓ T161354 [Dashboards] Migrate from Vagrant to Puppet config". phabricator.wikimedia.org. Retrieved 2017-06-16.

[1]

[2]

[3]