Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, October 2016

Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, July 14, 8:00 - 9:30 AM PT.

Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): RobLa, Michelle Paulson, Zhou, Gabriel; participating remotely: Maggie Dennis, Aeryn, akosiaris, Andrew Bogott, Brandon Black, Darian Patrick, Emanuele, Eric E., Faidon, Filippo G., Giuseppe, Greg, Jaime V., Jaime Crespo, Katherine Maher, Mark, Petr, Sarah R., Wes

Architecture edit

 

Objective: Support Wikimedia Security edit

 
  • Rob: See Slide
  • Rob: Working with Darian to Brian W. to help out. Darian really stepped up. I look forward to his management of security going forward.

Objective: Develop Fellowships program edit

 
    • Clarifying relationships between Brion V and Tim S. Not done this quarter -staffing issues in T&C and Executive Staff. This should be something to work with upcoming CTO.

Other successes and misses edit

 
  • Rob: Worked with Kevin S. with TPG to improve documentation about how ArchCom works. Creating continuity for the architecture committee. What are the essential components and what are manager specific?
  • Rob: Archcom meets weekly. We also have well-attended IRC meetings.
  • Rob: DevSummit coming together with Quim Gil and his team. Quim is charing and Robla is taking an active role.

Wiki Text Parsoid and Parsing has been working on documenting wikidom? and wikitext (which has been used for 15 years). Parsing has discussed this extensively at offsite and will be a focus for the upcoming quarter.

  • Rob: Misses: would like to work more with colleagues in tech.

Release Engineering edit

 
  • Greg: Team Size is 6
  • Greg: KPI went down about 8% (about a min from Q4)
  • Katherine: That seems a significant gain? Reasons?
  • Greg: It's hard to tell. Part of it may be that Node pool which were migrating to, we went back to permanent slaves which don't have the cost associated with them. We don't have benefits of isolation but the positive is that for quick tests they are faster. That is informing a lot of what we talk about at our offsite and decisions that we have made.

Time spent edit

 
  • Greg: Consolidated visual graphical view of time spent.
  • Greg: Team fills out spreadsheet from what people remember the week before. It is based off of memory but it is a general snapshot of where time is going to.

Time Spent by Category edit

 
  • Greg: All categories that we use to track time allocation.

Objective: Phase out Ubuntu Precise edit

 
  • Greg: Objectives: We have a meta objective of phasing out of ubuntu precise (See Slide)
  • Greg: Done with strong collaboration with Ops
  • Greg: Learning: Some assumptions were made about Ops level items that we should not have made and left the questions open for Ops, so we created some confusion

Objective: Reduce Tech Debt edit

 

Stretch Goals edit

 

Successes and Misses edit

 


Core workflows and metrics edit

 

Core workflows and metrics edit

 


Core workflows and metrics edit

 


Technical Operations edit

 
  • Mark: 19 staff members
  • Mark: 3 People joined this quarter
  • Mark: DBA was finally filled
  • Mark: Madhu from analytics joined
  • Mark: Ricardo also joined -- he will work specifically on automation
  • Mark: Main KPI is availability

Objective: Puppet edit

 
  • Mark: Puppet: First goal. We were running on an old slow version of this. It wasn't present in the new data center yet, if we lost the original we could have had some problems. This had been put off for a while. We made this a focus this quarter. It is running on multiple machines. We spent a bit fo time on it. There was some frustration for the engineers. There was some lack of documentation, so this slowed it down but we made it two or three weeks befroe the Quarter end.
  • Mark: Puppet runs are less than 20. This is not the end all solution but in general, this is a good thing.

Objective: Prometheus Metrics edit

 
  • Metrics monitoring. We have many systems across Wikimedia. We keep adding more. Many have problems. In the last year we did some work scaling graphite. Our systems couldn't keep up and there weren't solutions for that. Prometheus was one of the software packages we decided to experiment with our data in production. We deployed in both data centers from the start. Deployed across multiple servers in multiple data centers. It was met with praise and is efficient with bandwidth and storage. Several orders of magnitude more efficient from graphite.
  • Additional flexibility. We will move ahead in the next quarter.

Objective: Openstack Horizon edit

 
  • Custom program written by a labs engineer years ago and over the last few quarters we have been migrating managment of puppet classes to new interface (openstack)
  • We did have a snag. One of our draft goals was posted as a goal but was not meant to be. The team decided to not move forward with the goal as written but the goal as discussed. Next time, we will explicitly check the official posted goal on the wiki. Transitioning to a new system takes a lot of coordination and prep work and this should be done separately from the actual goal.

Objective: Varnish 4 edit

 
  • Part of a year long effort. This was a big migration. We have a lot of traffic and content on varnish. The storage backend has some issues. There is a solution but it is not open source so it is not available. We are trying to decide whether to stay with the open source version or migrate away.

Objective: Object invalidation with X-Key edit

 
  • With current varnish servers we can only purge one page at a time if we know the content. It is not efficient. Right now we have several thousand purges per second per server and there is an ew way to optimize that. It is called XKey and we will need coordination from the org.

Objective: Kubernetes edit

 
  • Managing containers. We did not finish but we felt it was important to get started. It did not get started until the last two weeks but it will be a focus next quarter.

Successes and Misses edit

 

Successes and Misses edit

 

Successes and Misses edit

 

Core workflows and metrics edit

 

Core workflows and metrics edit

 
  • Gradually expanding across quarters
  • Katherine: Thanks that was great. Appreciated discussion of availability.

Services edit

 

Objective: Improve Services Platform edit

 
  • Gabriel: Goal Improve Service Platform (See Slide for details)
  • Gabriel: Goals from last quarter and for next quarter.
  • Gabriel: Next quarter: we have several use cases in the pipeline with Editing and Reading

Objective: Improve Services and Security edit

 
  • Protect sensitive user info (See Slide)
  • Not as successful this quarter. It is a multi-team collaboration with dependencies on teams. This goal was de-prioritized and will be revisited.

Objective: Overhaul Legacy Systems edit

 
  • Provide maintainable cost effective pdf generation service for offline/mobile use. (See slide)
  • Wikipedia Germany has agreed to be product oder on this. We will prepare for deployment and this will go out in this quarter.
  • Wikimedia DE was interested in table support and this was a community goal as well

Core Workflows edit

 
  • Support offline, improve performance, increase flexibility

Core Workflows edit

 

Scorecard edit

 
 

Security edit

 
  • Darian: Team is the same size at about 1.5 people
  • Darian: 90% of time on core focus addressing security bugs and working with other teams
  • Darian: 2 Critical, 59 High security bugs

Objectives edit

 

Successes and Misses edit

 

Core Workflows edit

 

Core Workflows edit

 
  • Katherine: Katherine: Curious about hiring.
  • Darian: We have several solid candidates. A contractor, Sam Reed, who has worked with us before will be working with us.

Bugs edit

 

session wrapup edit

  • Wes: We had a Product and Technology onsite this quarter, and there were many great outcomes. One big thing that came out were looking at improvements to QA and Beta Cluster and working with Ops. For those watching the presentation, Gabriel presented in a new format that we are testing out. if you have feedback please let me or Kristen Lans (from Subteam) help to improve process.