Abstract Wikipedia/Wikifunctions performance product narrative

Wikifunctions introduces a new mode of knowledge representation to the Wikimedia ecosystem. Interactive web applications in themselves are not novel, and users typically expect a reliably fast and responsive interface in order to stay engaged. At the same time, Wikifunctions runs on Wikimedia infrastructure, which has finite resources, so the system must make smart resource allocation decisions. Both expectations shape the performance characteristics for Wikifunctions.  

This document outlines, in the context of Wikifunctions, why performance matters, the key challenge and solution themes, and the delivery approach. As Wikifunctions evolves and new use cases and user requirements come up, this document should evolve accordingly.

About Wikifunctions

edit
Minimal amount of relevant information to lay the context for performance-specific sections, and to make this a readable standalone doc. See inline reference links for more details

What is Wikifunctions

edit

Wikifunctions is a platform on which anyone can collaboratively create, maintain, access, and use an open-source library of computer functions.[1]

What is a function, and how it is represented on Wikifunctions

edit

A computer function (“function”) takes some input, performs computation on the input according to user-defined logic, and returns some output.[2]

On Wikifunctions, every function is encapsulated in its own wiki page, and a complete function wiki page includes the following beyond the typical wiki community management content - 1) description of the function, such as the required input and output formats, 2) user-defined computation logic for the function, in the form of user-written code, 3) test cases for the function.[3]

Wikifunctions as a part of the MediaWiki tech stack

edit

To handle function code execution, two new Kubernetes services, namely the orchestrator and evaluator services, are implemented as a MediaWiki extension called WikiLambda. Function input and output values are currently treated as ephemeral, though new storage solution(s) may be introduced to cache function output(s), or persist metadata for function calls.

Ways to use Wikifunction functions

edit

Wikifunctions functions can be invoked interactively through the Wikifunctions web UI at wikifunctions.org. Other interfaces to Wikifunctions include an HTTP API and a command-line interface for system developers. Users can also export functions’ source code to use elsewhere.[4] When functions are used directly through wikifunctions.org, the request flow is illustrated here.[5]

About Performance for Wikifunctions

edit

Challenge

edit

Wikifunctions is different from Wikipedia and other Wikimedia projects because its primary unit of content is user-written executable code with a web-based code execution environment. These characteristics mean that Wikifunctions has unique performance needs to not only ensure a positive user experience, but also an operationally manageable system.

User personas

edit

Two key user personas have been identified to drive the performance needs of Wikifunctions. The first persona is a Wikimedia Foundation (WMF) system administrator responsible for managing the Wikifunctions system. The second persona is a user who accesses wikifuncitons.org to invoke functions. Their corresponding user stories are captured as follows.

As a WMF system administrator, I want to…  

  1. access current and historical information and patterns of how the orchestrator and evaluator services are performing / have performed, using the tools I am familiar with, so that I can enable my team to make better decisions about resource requirements, capacity planning, configuration changes, and optimization opportunities
  2. be confident that orchestrator and evaluator services emit logging and monitoring signals in a way that complies with established Foundation standards, so that I can enable my team to manage Wikifunctions services in a consistent manner
  3. be alerted as soon as possible when the orchestrator or evaluator service has failed or is in a degraded state, so that I can enable my team to respond to restore the failed or degraded service
  4. understand, in the event of a service failure or degradation, what events as captured in logs led up to that state, so that I can enable my team to conduct root cause analysis and prevent or mitigate similar issues in the future
  5. be able to limit the amount of system resources that any single user can utilize, so that I can ensure the system is resilient to abuse, and provide an equitable level of service to all users of the system

User stories for the second persona can be found in the List of components and pages that need to be available.

Themes

edit

Three key themes are distilled from the user stories, and are used to drive feature identification, refinement, and prioritization in the performance roadmap.

  • Wikifunctions is observable
    System administrators need a comprehensive view into how the system is behaving in order to identify failure modes or other undesirable behaviors. This view is represented by an aggregation of signals from the request, services (orchestrator and evaluator), and system levels. Additionally, administrators need the ability to clearly understand, explore, and drill down into specific behaviors or signals to make informed decisions about what to do next.
    Expected outcomes: Better and faster failure diagnosis, detection, and recoverability. Better planning and decision making about capacity needs, configuration needs, optimization needs, architecture evolution needs.
  • Wikifunctions is efficient
    Computing resources are limited. Additionally, Wikimedia's technical staff is small compared to its traffic and scale. System administrators need to ensure that limited resources are used efficiently and effectively, so that more tasks can be done with the same amount of resources, or that the same tasks can be done with fewer resources.[6] A variety of work can contribute to efficiency, including enabling the ability to accurately profile and trace bottlenecks in code, identify and restructure resource-hogging components in the algorithm, implement caching, or trigger system actions like autoscaling.
    Expected outcomes: Better allocation of computing resources. Less waste. Speed gains.
  • Wikifunctions is stable
    Both system administrators and function users on wikifunctions.org need to be able to rely on a system that not only behaves in predictable ways, but also responds gracefully to temporary stress such as request overload.  
    Expected outcomes: Less time spent in responding to system or component degradation. Better user adoption and lower churn.

Delivery

edit

A note about approach.

In delivering an observable, efficient, and stable Wikifunctions system, we take an evolutionary approach that starts with the foundational building blocks at the initial stage, while providing room for iterative development in each stage. This approach is captured from two angles.

  • Measure → Analyze → Optimize
    To be able to do something about performance, we first need to see the performance signals. Examples of such signals include request latency, wall time, cache hit ratio. Instrumenting the system to collect and surface performance signals is key to the measure stage. From the pool of signals, we ask the question of what is inefficient. Establishing benchmarks and parsing out how the system is spending its resources relative to the benchmark (and why so) are key to the analyze stage. In this stage, we can also find that not enough information has been collected to conduct analysis. Using the analysis results, we then decide what and how to tune to optimize performance. Optimizations can be code improvements, cutting clutter like obsolete metrics, or rebalancing allocation of system resources to specific processes. This is the optimize stage.
    The stages do not always represent a linear progression. Not only do outputs from each stage inform each other (e.g. optimization prunes noisy signals collected during measurement), it is also often required to revisit the assumptions and hypotheses in each as new user requirements come up.
  • Manual → Automate
    Automation is an optimization opportunity. When implemented well, automation reduces performance woes by lowering operational burden and improving system and team scalability. Deciding what to automate requires thoughtful analysis on what, whether a process or something else, would return the most value for time and resources spent in automating. Manually setting up the plumbing, having teams manually inspect performance signals for usefulness and accuracy, and recognizing places where human-in-the-loop diligence should be kept to provide equitable outcomes are essential starter activities.

Notes

edit