Talk:Abstract Wikipedia/First evaluation engine
Latest comment: 4 years ago by ABaso (WMF) in topic Initial thoughts
Initial thoughts
editSome thoughts and notes on the essay.
- We've been advised to keep this off the main application cluster and be mindful of architecting the database in a fashion where the database could possibly be moved to separate hardware. This is partly to avoid simply saturating those resources, but can serve some level of security isolation, too.
- It seems conceivable we might be able to contain execution via https://phabricator.wikimedia.org/T260330. For pure function composition using only our system defined PHP primitives, that might be useful (they can exist as pure classes in a library and not even have to run in the context of the main app server), and it's probably even more interesting if we want to shell out to run UDFs on a shortlist of permitted and locked down runtimes (containers/runtimes need to be speculatively and reactively pre-spawned / re-spawned as Tim and Joe note). UDFs would still need to have certain security properties about their code construction, and additional guards would need to be in place to avoid badness.
- If we have a long running standalone service, we have options to have it as either (a) a sidecar with MediaWiki, (b) its own truly separate scalable service, or (c) something that's clever about when to use (a) versus (b). For long running functions (even outside of Abstract Wikipedia), usually some sort of queue/topic needs to be available, too, in order to let clients know when a computation is done.
- With a standalone service, I believe we want MediaWiki to receive the client POSTs with the client-provided function function names and parameters, so MediaWiki can access the database for the real function material, and then MediaWiki should ship those off to the standalone service which does the heavy lifting. This way the standalone service doesn't need to make any database calls. Alternatively, MediaWiki could proxy the (validated as well-formed) calls to the service, and the service could itself be subscribed for updated function implementation material (e.g., via ChangeProp) so it has up-to-date definitions, such that it can performantly run the known code without any expensive on-demand database lookups in most cases. This would need to guard against split brain errors.
- But if we determine we want a standalone service to make database calls, I think we need to look into getting read only credentials on production replicas. I believe in general we try to consistently route all direct MediaWiki table MariaDB access via MediaWiki PHP today, although I believe there are cases of standalone databases that may already have read-only access - it's just in our case if we want the database to be conformant to Wikimedia's MediaWiki conventions, we probably want the correctness guarantees of the MediaWiki data access layer. Anyway, a read-only MariaDB ID specific to the application with a safelisted set of tables seems strictly required here if we find we do want a service to query the database without security risks. Architecturally, there's a tendency toward avoiding cycles where MediaWiki calls a service and then that service in turn ends up having to call back through to MediaWiki in order to complete its processing that ultimately returns a result.
- For the developer persona and perhaps worker nodes in P2P land, it sure would be nice to be able to run disconnected from our servers (most of the time, anyway). I don't think developers mind having to install something via, for example, Docker Compose, which could have a data volume or some sort of sync routine to fetch updated function definitions opportunistically.
- The embedded device and mobile device scenarios today in practice wouldn't really be able to use Docker Compose, although eventually that may change. Those devices probably just need a copy of the functions plus pertinent runtime(s) (e.g., natively compiled runtimes for embedded, and in a mobile context maybe for simplicity Blink/WebKit with or without WASM and depending on uptake maybe embedded Python). Cross-compilation is another approach used here. It's not the primary thing to consider right now, but client-side function invocation for native computing purposes, plus network-issued invocation on a distributed backplane, are use cases worth keeping in mind in system design so that people don't have to end up re-implementing too many pieces of the evaluation engine.
- Can we utilize hashing of the normalized client POSTed function-and-data in order to have a key to store results (e.g., as a small graph data structure) in Swift storage or some other elastically scalable place for quicker lookups? We may want heuristics to determine when to do this or ways to let a user specify that something is long running and so would benefit from being persisted for future fast lookup. SIDEBAR: This concept may be useful for constructing units of checkpointed, composed functions (e.g., minified/tree-shaken and babel-ized JS) and storing them after they're called on demand, although a good JITter runtime (e.g., Node.js) that examines hot paths may also work for that just as well, if not better. There's also the option of a monolithic compilation of all functions into one big binary or a set of binaries.
- A purely browser-based implementation up front seems a stretch for most of the reasons described in the Cons section about this (although I'm less skeptical about search engine friendliness, just to note - there are ways to do it). I'd like to get there, and I can see how pure functions written in JavaScript (or compiled to WASM, or with JS as first class but WASM embedded for the other languages) that can run isomorphically can be made to be performant (e.g., if precomputed values can be stored in scalable storage and bundled intelligently as small graph data structures in server responses), and furthermore as an end user I'd really like to be able to run disconnected for intermittent connections. Maybe a way to think about this is how would we go about allowing end users to have the functions cached client side so that if they need to run offline (and if we want to not have them needing our server just to run a function) they can do so? That's typically in the realm of a Progressive Web App / ServiceWorker context that has fewer caps on its storage and long running code. That seems doable for system-defined primitive and UDFs written in JS. But I think we assume the developer persona is more likely to access exported libraries through some sort of other interface (e.g., native IDE or codesandbox / codepen) even if we need some basic CodeMirror type UI on the web for our first party experience for writing UDFs. As for articles, yeah, it's hard to imagine full composition at the client in the near run...for full length articles, at least - and I don't think anyone's fully solved the UX of how to cleverly update an article and avail the user, even when it's computationally possible.
--ABaso (WMF) (talk) 20:31, 11 September 2020 (UTC)
@ABaso (WMF): Very much out of my depth here, but I think Google Sheets is a helpful analogue. This manages to be functional offline and (as far as I know) cannot be used to attack Google when online. I'm not sure how far we'd get actually trying to create an implementation of (some subset of) Wikifunctions in Google Sheets, but I thought thinking it through might be worth while.--GrounderUK (talk) 13:58, 9 December 2020 (UTC)
@GrounderUK: thank you. --ABaso (WMF) (talk) 19:23, 9 December 2020 (UTC)