Wikimedia monthly activities meetings/Quarterly reviews/Architecture, Operations, Release Engineering, Services, and Security, April 2016
Notes from the Quarterly Review meeting with the Wikimedia Foundation's Technology II: Architecture, Operations, Release Engineering, Services, Security teams, April 14, 10:00 - 11:30 AM PT.
Please keep in mind that these minutes are mostly a rough paraphrase of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material
Architecture
editSlide 2 - Architecture cover slide
editRob: I am the architecture dept, running the virtual team responsible for the arch committee. We do not have any KPIs for the arch group. I think that's OK, everybody has a lot of their KPIs they are trying to achieve.
Slide 3 - Lending ArchCom authority
editaka piloting Rust process
Rob: 1st goal: Pilot the RUST style (RUST is a programming lang by Mozilla. was a tool to rewrite Firefox. Gabriel has been observing RUST process and though why can't we do something like that.
I have been trying to work out what it would mean for the foundation
ArchCom said yes, we should give this a shot. We do not have sub team yet.
On the plus side, the process has the idea that each RfC has a shepherd. Using this we have a better way to talk about what the next steps and when something is blocked on only on the backburner.
This collab is super complicated.
Slide 4 Improve RFC documentation
editRob: Having better documentation, people understand what to expect, that meetings are happening and there is process.
The work will continue (it will never be done).
I had one more goal about renaming but that was not done.
Katherine: I'm intersted if there are next steps from roadmap. What is this work preparing for.
Rob: This is leading to consensus based conceptual integrity (see: <https://www.mediawiki.org/wiki/User:RobLa-WMF/CBCI>)
Conceptual integrity is making a system that makes sense. Conventional wisdom is that you have a single person responsible as visionary to make sure system fits together. To do that in a consensus oriented way is challenging but necessary.
How do we do this in a group way where no single person is laying down the law. That's the long term goal.
Immediate steps: when someone has good idea, how do they get support.
What we ar looking at is "what does it mean to be consensus based? Are we looking for an advice process. We've been talking about it in the ArchCom but still haven't decided if it is advice or authority that we are looking for.
Geoff: who is in the ArchCom?
Rob: Is mediawiki the center of the ArchCom is about, or Wikimedia software specifically. It's not clear.
The formal authority of the ArchCom is +2 access - the ability to commit code that goes live on site quickly. They also have the ability to take away someone's +2 rights. It's very rarely been done, but they have the trust to do it. (after meeting note: see <https://www.mediawiki.org/wiki/plus2>)
Geoff: who is on the committee?
Rob: Gabriel, Tim S, Daniel Kinzler from WMDE
Mostly staff, dominated by staff (answer: https://www.mediawiki.org/wiki/Architecture_committee#Members )
Operations
editSlide 5
editMark: last qtr was fundraising, uptime is fairly similar to last qtr.
Slide 6 - DR testing
editMark: biggest goal: was to test Dallas datacenter. Secondary datacenter should be able to be primary. Had data backed up there for a while. Equip was there. wasn't tested and fully ready. We haven't actually served MW traffic from there, there was still a fair amount of work needed. Goal this quarter was to do that. Largely completed, but eoq timing. Planned originally for March. We chose week before Easter, a few unexpected setbacks (e.g. job queue). Some security infrastructure. Backup date used. Planned for next week. We did switch over all of the other services in a test. MW is the most complicated part. One thing we should have done: planned the exact date before planning the goal. Also, we should have planned the comms plan. Sherry Snyder jumped on it (many thanks to her)
we will work on completing the goal next week and follow up on learnings on the process.
Slide 7 Labs dashboard
editMark: Move dashboard from homegrown solution to Horizon (used by upstream). Most important parts moved to Horizon. Users can now create new systems, can access DNS, http proxies. All done by Horizon: we almost didn't make it, but Alex Monk (Krenair) stepped in and gave his front-end support for the work.
Slide 8 Migrated from Ubuntu 12.04 to Debian Jessie
editMark: we have 1400 servers running multiple versions of Linux. We have to keep getting deprecated versions out. Some systems were harder to migrate. Some need to be done one by one. We were able to meet goal to do over 60. We will continue to work on this but will make a KPI: systems running on deprecated SW.
Slide 9 - Monitoring
editMark: this is a follow up of goal from last quarter which we had missed. So we started with needed, dependend on work not completed. Monitoring doesn't scale anymore and need to be modernized.
Since this was a follow goal, we decided mid qtr to redefine and reduce scope, which we did reach (evaluate solutions)
We're not going to make it a goal for next qtr because we don't have resources, but hope to rectify with hiring.
Slide 10 Varnish/caching
editStretch goal. Upgrade our caching..we've been running on Varnish 3 for a long time. Finally upgrading to Varnish 4. Lots of custom code to support wikipedia zero and analytics infrastructure. There's a lot of tech debt to solve before upgrading. This was on roadmap for over a year. We have a new ops eng hired and did entire goal. We also got support from Luca - new hire on analytics team - to help unblock issues on analytics infrastructure. It got done in the end. We are following up on goal with more migrations. Next qtr we should have everything running no Varnish 4.
Slide 11 other successes and misses (1 of 2)
editWe tend to have a lot of work outside of our goals so we listed accomplishments.
Slide 12 other successes and misses (2 of 2)
editMore accomplishments
Slide 13 Metrics
editMain KPI: availability. We need everyone's help to keep this number up. Slightly lower this quarter. Several audiences. Strategy process was about connecting to them. Reader, Contributors, Partners, Donors.
Katherine: difference between partners?
Mark: movement partners are people accessing our xml dumps (people working with our data)
eternal partners - rest of the world
It's admittedly hazy, we could make that distinction clearer
Questions?
Wes: Mark did a lot of work around annual planning this quarter while keeping the team working and 99% uptime. He did a really good job of making this work.
Mark: thank you. there's a lot to keep track of.
Geoff: you have service readers, and content contributors. how do you distinguish these groups
Mark: e.g next week, we're going to have to make the site read-only. Editing stuff is downtime for editors, but not for readers. There could be
Katherine: thank you mark. I understand work on swithcing over to Dallas is significant and it's important to call out judicious decision making and flexibility when it came to delaying the switchover test. Great planning and judicious decision making.
Release Engineering
editSlide 14 Release Engineering
editKPI - amount of time it takes to get stuff merged. YoY still going down.
=== Slide 15 Consolidate deploy tools Consolidate deployment tools. Scap3. we didn't complete this. we realized it would be easier to migrate non-MW services, so this quarter we implemented a few features needed for migrating thing. this should continue into next quarter.
Slide 16 retiring Gerrit
editNot completed. We wanted to integrate Differential into CI.
A lot of consensus was built at the dev summit, but the RFC is still in progress.
Stretch goal: get 1 early adopter per team. We didn't reach this but got one early adopter.
Overall goals were too large for one qtr. Retiring gerrit is not a single qtr project.
Slide 17 reduce CI wait time
editNodepool migration. not done for _all_ CI jobs, but did it for the npm ones. Other ones are in progress. User:Paladox helps with a lot with this stuff
Newer version of HHVM
Slide 18 other successes and misses
edit(read off long list on slide)
scap3 - necessary for large binaries
Slide 19 metrics
editPhab upgrades every other week. most people don't notcie downtime from that
MW Swat deployments M-Thu. Rolling responsibility - less stress, we can shift that responsibilty
Slide 20 metrics slide 2
editCI - trying to make this as selfserve as possible. A lot of churn in this. MW Selenium - two releases of that this quarter test cleanup
Slide 21 SPOF tracking / skill matrix
editwe keep track all of the things we are responsible. Trend is very positive. there are still areas where we still have SPOFs
Questions for RelEng
editKatherine: thank you for supporting your team. still learning this area. I liked the skill matrix, I really appreciated that.
Chad: we had an offsite prior to the Lyon hackathon. Pairing, making things a rolling . Greg? came up with it. just tracking
Rob: that has Greg's fingerprints all over it :-)
Katherine: are there plans to roll this over into next quarter?
Chad: scap and differential migration. lots of unknown unknowns. we have a lot of good momentum
Slide 22 Security
editKPI 2.25 people. (Chris & Darian) Brian Wolff is a contractor working for us as well.
We made a little progress over last qtr on our KPI.
Security
editSlide 23 2FA for CentralAuth
editChris: We had one goal: 2 factor auth on wikis.
Easy way to do this (we thought) is build off reading infrastructure's AuthManager. Auth Manger was not ready in time, so we moved ahead with integrating with CentralAuth directly.
We had more security bugs reported this qtr, so were interrupted more.
Goal: will be rolled out soon.
Slide 24 - other successes and misses
editRFC on meta to up password requirements to 8 characters.
Missed one security review
Slide 25 metrics
editChris: we do a lot. we can't narrow focus because it's Security.
We're either getting worse at writing SW or better at reporting security flaws.
We're finding more issues that are of less severity. We did find one big security flaw.
Geoff: I know you all are working hard on this, but do really know if we should be worried?
Chris: yes, we don't really know for sure
We could do better with security testing tools, we could have bug bounty programs. It's hard to say how much our threat is increasing. The amount of data we are storing is exploding so the impact of compromise is increasing.
Gabriel: do you have plans to reduce the impact of compromises?
Chris: yes, working next quarter on Auth Service, to better protect sensitive data if mediawiki is compromised (T120484)
Kathering: do we have more eyeballs on code? (we're getting more bugs)
Chris: I don't know.
Rob: One thing that would be nice is to do postmortems after a security problem is found so teams learn how t write SW without security issues. Right now Chris does review at the end of the process. Incorporating learning from mistakes we'll get better.
Geoff: Do designers have security knowledge?
Rob: not necessarily. all teams would tell you that having more people involved early would be better. Should we have more people involved at the start, yes but that's a balance.
Wes: some of these things are addressed in the annual plan.
Chris: some PMs I talk with regularly, some I don't, it varies drastically.
Chris: We're only able to complete a portion or the security reviews requested each quarter. Sometimes what we have to drop is issues from community.
We work with a lot of other teams as well.
Slide 26 metrics part 2
edit(redacted)
Slide 27 bug counts
edit(redacted)
Q&A
editGeoff: we seem to need more here
Chris: training, scanning, we need people in the middle
Gabriel: changing infrastructure can help
Ori: we're going to be tied to MW for the foreseeable future
Chris: there are things we can do
Ori: two suggestions: bug bounty programs? There should be strong collective voice to discourage unnecessary collection of data. we could get by with a lot less data.
Chris: bug bounties: total agree it could be a great method (redacted conversation).
Services
editSlide 28 - Services
editGabriel: 4 people. Core tasks. Usually we experiment more. KPIs: total requests to REST api. Refined the metric.
Mark: you are only testing the Varnish layer at the moment
Gabriel: yes, we need to look at that
Slide 29 REST API request rates
editGabriel: Increasing traffic: Increase due to Android app rollout
Slide 30 - Goal: REST API expansion and documentation
editGabriel: Documentation index page created. Service template created, used in mobile content services. API policies
Second sub goal: building out the API, targetting high traffic endpoints.
Integrating better with caching. Created RFC for versioning
Issue we ran into repeatedly: lots of APIs request a specific size for thumbnails. We need to let the client select the size. We have an RfC on this.
Slide 31 - REST API documentation (Swagger) screenshot
editGabriel: Swagger specs drive this. Automatically generates this, creates a sandbox.
Slide 32 - Scaling storage
editGabriel: We are storing HTML for all parsoid. It takes a lot of space and cost benefit algorithm is not there. We noticed people edit one line at a time. There is a lot of repetition. We experimented with compression and found one algorith - Brotli - that has a large window and compresses by a factor of 5. It also saves CPU, but increases mem usage. We built a patch for Cassandra and working on upstreaming it. So we could store the entire history in HTML.
Second was moving latest Content API to the edge (closer to user). We introduced a new storage format and increased throughput of API.
right now when you load an older version of a page, it's rendered on demand because it is not stored. Storing this would reduce the latency to get this.
Maggie: what about being able to view the history because the community is interested in attribution?
Gabriel: this will help.
Slide 33 - 3rd objective: Reliable event production & change propagation
editGabriel: Propogating events when editing. Making everyting that needs to be orchetrated for everything that needs to change as the result of a change being made.
Job queue is not the most reliable system right now.
This qtr we finished EventBus and made it multi datacenter ready. We can also handle failover.
The second half: making event production reliable is delayed.
Slide 34 Other successes and misses
editMulti-DC support
editA lot was built for this already. straightforward to do the failover testing. Latency increased, but less than 100ms. There were no user issues.
Reliable deploys and testing
editgood test coverage and good deploy scheme. So we didn't have any outages that we're aware of.
API result format versioning
editLet's us move forward without breaking things.
Slide 35 workflows & metrics
editLots of guiding and mentoring. e.g. Math work is close to ready (already an option)
Trying to help other teams get the skills they need. Citoid. Cassandra pageview API.