Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, July 2016

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, July 14, 8:00 - 9:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): Heather, Gabriel, Michelle, Ori, RobLa, Jaime, Madhu, Katherine, Joady ; participating remotely: Darian, Faidon, Greg, Maggie, Mark, Nuria, Chad, Chase, Wes

Backup Datacenter edit

 
  • Wes: Great rollout. We had to make adjustments, moved out a quarter do be prepared and do this well. Good learning and improvements.
  • Katherine: reiterating last session; huge accomplishment for the team, community members were very positive; confidence in the results; board appreciated the work

Architecture edit

 

Implement ArchCom Subteams edit

 
  • Rob: Goal is to scale ArchCom with sub-teams. Hoped to spin up at least one (security), but not fully done yet.

Document RFC edit

 
  • Improving process. No single big improvement to call out. Status page for RFC shepherding.

Develop Fellowships edit

 
  • Developing fellow program. Did not create any controversy, but also wasn't really discussed widely.
  • Background: Idea is to free up senior engineers a tenure-like status, with a lot of freedom.

Technical operations edit

 
  • 17 FTE, plus some opsens embedded in other teams
  • Main KPI is availability, improved slightly to 99.987%.
  • Katherine: congratulations on a job well done (fail-over). Presented at Wikimania, well received there. Choice to delay was the right choice.

HTTP/2 support edit

 
  • Needed to move to HTTP/2, as major browsers like Chrome dropped SPDY support.
  • Switched successfullly on May 4th.

Objective: Varnish 4 migration edit

 
  • Moving away from Varnish 3. Complicated transition.
  • Encountered issues, but solved them.

Objective: Tools on k8s edit

 
  • Target is tool tabs, the community project running a lot of small projects.
  • Replaced legacy system with Kubernetes. Needed to learn a lot about how users use our platform.
  • Katherine: is this aligned around bd808's work
  • Chase: yes, he's basically an integrated member of the team. Ideally transparent to the team. Explains things to users. Honorary member of the team.
  • Katherine: great to see x-team work
  • Gabriel: K8s work is providing us stuff that we might want to use in production
  • Katherine: Looks like a very busy quarter. Seems to have been a very satisfying way of coming into the quarter with everything that was in play last quarter

Other successes and misses edit

 
  • As usual, lots of large & small things that didn't make it to official goal status.

Other successes and misses edit

 

Core workflows and metrics edit

 
  • No major changes, slight improvements.
  • enwiki test didn't seem to trigger problem; may need to talk to vendor about this

Core workflows and metrics edit

 
  • Faidon: Labs more than 700 instances supporting 1000 tools.
  • Chase: Hard to get a count on the tools
  • Wes: thank you for all your efforts over the quarter Faidon and Mark. A lot of communication and coordination, at the same time, upgraded a lot of equip that needed refresh.
  • Jaime: echo that on capital side
  • Katherine: thank you for taking advantage of the opportunity we had at the end of the last fiscal

Release engineering edit

 
  • Greg presenting.
  • Team size: effectively 5
  • KPIs related to CI merge times slightly improved (-4.5%)

Time spent edit

 
  • Bulk in maintenance

Consolidate deploy tools edit

 
  • Hoped to move to Scap3
  • Finished about 50% of repos

Retire Gerrit in favor of Phabricator edit

 
  • Gerrit still in use, but preparation done.
  • Rob: Gerrit isn't gone yet, correct?
  • Greg: (explaining slide reporting process)
  • Katherine: What is the timeline?
  • Greg: another few quarters / hard to tell due to our current focus on technical debt (analysis starting in Q1, follow-ups prioritized accordingly after)

JavaScript Browser Testing edit

 
  • Wes: how has participation been from overall eng?
  • Greg: goal is to reduce burden on others. Survey: almost 30 responses. Working closely with Ed Sanders from VE team.

Other successes and misses edit

 
  • Retired gitblit. thanks to Danny_B and Paladox (community) and @mutante
  • Released MW LTS 1.27
  • Phab: thanks to Quim Gil for figuring out collab. Graph stuff really nice
  • CI server failure. large scramble to fix. thanks to ops. doing tech debt analysis

Core workflows and metrics edit

 
  • upgrade phab biweekly
  • daily swat deploys

Core workflows and metrics edit

 
  • CI config changes. metric isn't perfect, but gives you a good idea. Soon go away; making it so teams can change their own configs.
  • Selenium: 2 rels
  • Malu: pre releases
  • Browser tests: title doesn't really make sense. defining how bts are run and where they are run

Skill tracking matrix edit

 
  • we do this every quarter to make sure bus factor is healthy. obvious bus factor issues
  • Katherine: with security releases, we have obvious problem. what's the plan?
  • Greg: this started with our team offsite in May last year. RelEng made up of many parts. Started with deployments. now we can migrate to a new focus area.
  • Katherine: really impressed you are doing this skill mapping. it feels like healthy team maintenance
  • Wes: I'd like to add: you had some time away as well and people stepped in accordingly. Good coverage with a much reduced team.
  • Katherine: echoing that

Services edit

 

Slide 24 edit

 
  • task distribution is very approx. Large increase in REST API traffic. cache misses only increased marginally, caching very effective. Increase mostly driven by math switching to SVG and MathML served through REST API. Other big change, Android switched to using new app. We don't have full data
  • big change was the SVG math. lower graph is cache misses

RestAPI Buildout edit

 
  • second goal was improving support for devs. made sense to integrate with k8s effort in ops. effect of android app
  • reading team is working on app, so they're in charge. this quarter, several APIs they've exposed. definition endpoint; wikimania.
  • k8s disc at wikimania. lot of support; lots of people rallying around it, even 3rd party users.

Eventbus and Change Propagation edit

 
  • Eventbus is exposing events in a stream. change prop uses this to execute actions. rules can be configured. migrated several job queue items. now working on a better Kafka binding. that blocked multi-dc, we need to upgrade to Kafka 0.9

Successes and Misses edit

 
  • math rendering: Moritz Schubotz worked on getting this work over the line. community demanded this made the default.
  • Katherine: it's fascinating to see this was a challenge
  • Wes: improving the caching
  • Gabriel: community developed service that is used by 3rd parties
  • Katherine : that's what I find is interesting was that it was driven by outside needs
  • Wes: note: as we went into annual planning, this had direct impact on operations
  • Introduced rate limiting for expensive API end points like page view stats
  • Katherine: are you seeing a notable increase in impact?
  • Gabriel: A few heavy users had to throttle their requests, but were cooperative and understanding. Enforcing limits has made the API more reliable and performant for everybody despite temporarily limited backend capacity esp. in pageview API
  • wes: api rate limits?
  • Gabriel: terms of use: we talked about this as documentation for entry point. Devs are actually happy to have concrete limits spelled out, as they can work with those. Otherwise, it's hard to guess for users what something vague like "moderate use" means in practice.

Successes and Misses edit

 
  • library: title normalization library used in RESTBase, Mobile Content Service & Parsoid. Good to see cross team sharing.
  • Migration of services to Jessie. Parsoid is almost over the line.
  • Cassandra outage during upgrade: We need to improve testing, more automation.

Core Workflows edit

 
  • workflows
  • Katherine: are you seeing more interest in this?
  • Gabriel: it's very project based. we want to help people help themselves. amount of handholding we need to do diminishing over time

Q1 Preview edit

 
  • Preview Q1
  • firewalling off authentication and sessions. looking into authentication, rollout Q2
  • eventbus and change propagation. use cases

Security edit

 
  • Darian presenting
  • Chris's departure impacted, but Brian has stepped up. Brian is fluent in community issues. very little experimentation this quarter.

CentralAuth edit

 
  • we had hoped to complete high level design. Chris and Darian completed eval. we settled on nodejs. we did not complete design. Services taking the lead, we are taking the secondary role.

Successes and Misses edit

 
  • goal from last quarter was complete 2FA. we haven't formally notified. once authmgr became ready, redeployed. already working through functionality bugs one remaining outstanding. we'll follow up with survey
  • Katherine: working with OIT for staff rollout?
  • Darian: I hadn't been thinking about that
  • Jaime: I've been talking to Wes about having a session
  • Katherine: I heard from Reading this was a goal. Congrats, especially where team was short staffed
  • Darian: missed a couple of reviews. Django app review for bd808; we didn't have appropriate security controls and what we should be looking for.

Core Workflows edit

 
  • added a row
  • Wes: Thanks for support and diligence in transition this quarter to Darian and Brian and RobLa. Next quarter: we have a number of headcount to begin pursuing
  • Katherine: thanks Darian for stepping up. Just got sent a referral today

session wrapup edit

  • Katherine: I appreciate seeing how these things work from quaerter to quarter
  • Jaime: a lot going on in the org, and it's often quite complicated
  • Katherine: wouldn't be here without you