Abstrakt Wikipedia/Første evalueringsmotor

Where and how should our first evaluation engine be deployed and how should it be architected?

Eventually, Wikifunctions aims to have a healthy ecosystem of interoperable evaluation engines. We will provide a well-defined description of what an evaluation engine MUST and SHOULD do, and we hope that there will be evaluation engines running in many different contexts — from small embedded evaluation engines living in native apps who want to use certain functions, over evaluation engines running as mobile apps or local servers, to powerful evaluation engines running in the cloud or on a distributed peer-to-peer network. The Wikimedia Foundation will need to run one or more evaluation engines in order to provide for the functionalities the Wikimedia projects will rely on. We will think about these and many others at another time.

For now we need to figure out where our very first evaluation engine should live. Whereas it would be great to have all of these implemented eventually, we need to be realistic about our resources and should focus on only one of them first. There seem to be three main possibilities to consider for the first implementation:

Embedded in the WikiLambda extension.
A standalone service.
We evaluate everything in the reader’s browser.

We need to decide this before we can start work on Phase δ, where we will implement this evaluation engine. Here we discuss pros and cons of the three main approaches and also discuss possible options within the approaches.

Hovedmuligheder

WikiLambda udvidelse

The WikiLambda extension itself not only provides the wiki extension to edit and maintain the ZObjects on wiki, but also embeds an evaluation engine. The extension exposes the evaluation engine through an API to the world and to its internal uses.

Fordele:

Deployment of the WikiLambda extension is easy as we don’t have to deploy a secondary stack for the evaluation engine (even if we don’t go this route now, it would make sense to have an embedded evaluation engine in the extension in order to allow easy deployment, which also helps with getting people to contribute to the code base easier).
We already started the creation of objects with certain capabilities in the PHP code, and can continue building on top of that.
It is by far the fastest and easiest way to access all the other ZObjects that one would need to evaluate a given call.
As we want the extension to call to contributed functions living on wiki to provide some of its experiences, having it immediately available at hand is by far the most convenient solution with the least overhead.
Can reuse everything that already exists in MediaWiki to provide an API, e.g. authentication, tokens, etc.

Ulemper:

From the point of security, it is high risk, as this would in deployment run on the main cluster. If someone manages to break out from the embedded system, they would have direct access to the production databases, as the wiki has access to respective credentials.
From the point of security, it can also be an attack vector for an intentional or even unintentional DOS attack as expensive evaluations are being run and locking production instances of MediaWiki.
It is hard to sandbox the evaluation inside the extension as it all lives in the same PHP code.
Need to implement monitoring, time and memory constraints all within MediaWiki.

Selvstændig service

We develop a standalone service which can be called via REST to evaluate function calls. The service is used both by external and internal clients.

Fordele:

Can develop everything from scratch. No constraints through MediaWiki, the service can be as lean as the stack we build on allows us.
The service can be sandboxed, monitored, and time- and memory-budgeted as a whole.
Can build a holistic and uniform approach to caching within the evaluation engine.
Can scale easily by running more services in more boxes. Even allows an architecture where a single server call may decide to call other servers and so parallelizes parts of the call, something which is hard to implement in the other two approaches discussed here.

Ulemper:

Have to develop everything from scratch. Plenty of potential for bikeshedding regarding implementation language, stack, and deployment approach.
Also means much more opportunities to introduce new bugs.
Need to figure out how to start a new production-strength service for launch (but can run on WMFCloud or other infrastructure until then).
Need to read lots from the DB, so important to be able to do that fast. But it is read-only access.
Incurs cost on anyone who deploys and wants to work on the extension, as they need to set up the service as well.

Browser

The function evaluations do not have to run on Wikimedia infrastructure at all, but might all run in the user’s browsers.

Fordele:

Easy to deploy.
Hard to cause an adverse effect on the serving infrastructure.
No need to constrain resources for readers, as it is their own resources.

Ulemper:

Have to be very careful not to allow attacks against readers.
Will likely lead to a sluggish experience in many clients, leading to make this accessible only to users with better equipment, which is counter to our goals.
The evaluation engine might need to read a lot from the DB, with long round-trip times since the browser has to access the wiki.
No way to call the code server-side (unless we develop it in parallel on say node so that it can run in the browser and in the backend, but that’s basically developing and maintaining two solutions, even if they share some code).
Opportunity for caching is severely limited, especially across users, where huge benefits are expected.
Not friendly for search engine.
Unable to process further (such as {{Str left|{{#lambda:xxx}}|123}}).

Arkitektur til den enkeltstående service

We are currently leaning towards developing the first evaluation engine as a standalone service. Deploying it as part of MediaWiki is viewed as having too many potential security and performance risks, as does running it in the browser. We hope to revisit the decision about the browser at a later point.

Ideally the evaluation engine can be spun up many times and orchestrated as a stateless service. The service is 'read-only', modulo caching and monitoring.

The evaluation engine exposes an API over HTTP that receives a ZObject as the input and returns a ZObject as the output. The returned ZObject is the result of evaluating the incoming ZObject until a fixpoint is reached. This is most interesting for function calls.

The service will dramatically benefit from caching. There are (at least) three levels of caching regarding the evaluation engine:

HTTP cache: caching the whole API call to the evaluation engine at the level of HTTP calls. If the same call is made, the same result is returned.
intermediate results cache: when the evaluation engine evaluates a function call, it should check whether it has the given function call in its cache, and return that result instead. This would benefit from all evaluation engines spun up in parallel could use the same cache.
content cache: references are looked up in the wiki. This might also be pushed from the wiki, particularly if we have a shared cache. This can be part of the intermediate results cache.

Eventually we will need to allow for REST calls to other Wikimedia and external services (e.g. Wikidata, Weather report, etc.), as well as to changing things such as the current time, as well as random values.

What about race conditions when functions are being edited while a function is running?

Would evaluation engines be homogenous or diverse? I.e. could there be an evaluation engine that can use a TPU, and others that don't, and how do we route queries? A simple idea: do all evaluation engines need to understand all programming languages, or can we make them lighter by having dedicated evaluation engines where some can run Python, others JavaScript, etc.

Some technologies to consider:

Kladde til trin i første evalueringsmotor

The order of the following steps is not necessarily as given. Built-ins go first, but most of the other are rather independent of each other.

Indbyggede

Se phabricator:T260321.

Very small API — maybe even just start with a single evaluate method and that's it.
No caching.
Implements some (or all) of the first set of builtins as described by phabricator:T261474.
Everything that is not a function call gets returned as is (maybe canonicalized).
Example: "Test" gets evaluated to "Test".
Everything that is a function call gets evaluated until fixpoint.
Example: if(true, "Yes", "No") gets evaluated to "Yes".
If the arguments are function calls, they also get evaluated, until fixpoint.
Example: head(tail(["1", "2", "3"])) evaluates to "2".

Note that all of this already requires that references are being resolved to the wiki. The function head is a function in the wiki and needs to be looked up, its implementations need to be gathered (we only have built-in implementations for now), an implementation needs to be chosen, and evaluated.

The first version of choosing an implementation is to choose any. This needs to be refined later.

Sandboxing og overvågning

Implement all necessary steps for sandboxing and monitor the evaluation engines (Phabricator:T261470).

A lot of this should be available through Kubernetes. But we probably would also like to keep statistics such as “how often was a function called?”, “how often did we have cache misses”, etc.

Mellemliggende caching

Introduce caching of intermediate results.

Before evaluating a function call, look up if this is in the cache and return that instead.

Allow the cache to be reset.

Editing invalidates the cache

Editing an existing ZObject invalidates the whole cache. We will slowly improve this behaviour. (Ideally an edit only invalidates those caches which need to be invalidated, but that will become complex quickly. We will get to it later.)

Requires intermediate caching.

Komponere implementeringer

Allow implementations that are compositions (Phabricator:T261468).

JavaScript-implementering

Allow for JavaScript implementations.

Wikidata

Able to call out to Wikidata.

Commons

Able to call out to files in Commons.

Andet Wikimedia

Able to call out to other Wikimedia properties.

Opkald til Web

Able to call out to the Web in general.

Ikke-funktionelle funktioner

Particularly "current time" and "random".