Wikimedia monthly activities meetings/Quarterly reviews/Release Engineering/January 2015

The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Release Engineering and Quality Assurance (RelEng/QA) team, January 23, 2015, 09:00 - 09:30 PST.

Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Present (in the office): Mark Bergsma, Faidon Liambotis, Greg Grossmeier, Tilman Bayer (taking minutes), Rob Lanphier, James Forrester, Chris Steipp, Željko Filipin, Antoine Musso, Erik Moeller, Toby Negrin, Arthur Richards, Chris McMahon, Mukunda Modell, Kunal Mehta; participating remotely: Dan Duvall, Chad Horohoe

Meeting overview

Agenda:
Team intro - 3 minutes
What we said we would do/what we did - 8 minutes
What we learned - 2 minutes
Metrics and callouts - 3 minutes
What’s next - 5 minutes
Asks - 5 minutes

Team intro

Greg: Welcome

slide 3

[slide 3]

Team

What we said/did

became a team (formally) in July

slide 6

[slide 6]

Phabricator: Mukunda's main work for last quarter(ish)
Jenkins perf improvements: not quite done yet, but picked low-hanging fruit
Beta cluster monitoring: done, even though we still need more metrics (for uptime - but have indirect data)
Yuvi set up Shinken in Labs
...
Ops supported that a lot
Toby: interested in how long it now takes to run jobs
Greg: if we restart Jenkins, it has to re-read entire history (Antoine: sometimes), so we limit to 30 days' logs
have data about job performance, but we don't display it yet
Toby: comfortable about hitting those targets?
Greg: not quite
Toby: we [Analytics] have some lighweight graphing things, feel free to ping us on IRC

slide 7

[slide 7]

browser tests, workshops (as a scalable alternative to earlier 1:1 coachings)

Erik: how many participants?
Greg, Zeljko: Later today we'll have 10 (15-20 signing up)

best practices/getting started documentation
EAL: Dan, with support from Chris, Antoine and Zeljko[?]

will give us cleaner code

slide 8

[slide 8]

stretch goal
Optimize Vagrant memory usage (to help people run it on their laptops): not done

slide 10

[slide 10]

What we learned

we are a fairly reactive service team, with good reaction times
flipside: longer term projects get pushed back, like usual for such teams
improving on that with help from team practices group (TPG)

Metrics and other key accomplishments

slide 13

[slide 13]

Participated in scorecard exercise

slide 14

[slide 14]

browser test maintenance/growth

slide 15

[slide 15]

maintaining and cleaning them is big part of our work
open questions about the right way to maintain these
unclear ownership/escalation process
if build breaks, we would like that team be responsible for fixing it
this is a cultural change, a team practices thing
ChriS: code coverage or these?
ChirsM: I like to talk about features coverage instead
it's highest for Mobile, second: VisualEditor, 3rd: Flow
also significant for Echo , UploadWizard...
Erik: Fixing UW issues is main focus of Multimedia team, so now is a good time to raise browser test ownership
ChrisS: of our 140 extensions, how many have them?
CrisS, Zeljko: maybe 20

slide 17

[slide 17]

same graph, I know ;)
as indicator for beta cluster uptime, which is pretty bad
ChrisM: majority of that red part is builds that target test2wiki
RobLa: everything in red should be either fixed or killed

What's next

slide 20

[slide 20]

Greg: Focus

Beta cluster stability

running nightly
Toby: Do you own this? What if these nightly tests break?
Greg: then it doesn't deploy
Toby: should't take metric I can't control as my success metric
Greg: green means ...[?]
We don't criticize test quality in public now, but in future, we might call people out for red
Zeljko: On Jenkins, it's public right now
Greg: but nobody looks ;)
Ops will soon make Swift cluster available in Labs
this will help a lot - right now it's a significant difference
Faidon: that's the long term fix, but we are also doing shorter term fixes
(Greg:)

MW releases: As announced earlier this week, we ceased relationship with Mark y Markus, taking these on ourselves again

security: a bit more load on ChrisS

Isolated CI (continuous integration) instances

ChrisM: JamesF suggested that, we looked at it and it seems very feasible
Greg: I think this will improve processes a lot
This is a 2quarter goal (~July)
Toby: could we find an intermediate goal?
Antoine: this q it's about architecture - my goal (for this q): have something we all agree on, so we can start implementing
Greg: this will be a topic at dev summit next week
Antoine: I'm open to changes
(Greg:)

Team practices:

Ongoing, with Arthur's team
offsite tomorrow

Asks

slide 22

[slide 22]

want TPG to support (another) offsite around hackathon
Ops buy-in
Get culture buy-in for testing from across Engineering

Erik: this varies across teams, start with those that are furthest behind
on goals, I second Toby's point
question: how well is team understanding happiness/needs of rest of org?
Greg: right now, via anecdotes
want to mimic what we did with Vagrant and do a survey
ChrisM: have sense of mission inside team, but to the outside it's less clear
I think we need to have that story before survey
RobLa: some of the disparity is because we don't have data
e.g. some people say beta cluster does fine, some say it's terrible
Erik: that's fair, but also want to know if people are happy with browser test infrastructure, and what their needs are
Arthur, can you help with the survey?
Arthur: we can at least talk about it ;)
it might be a nice side project, as we talk about these things anyway in the team health check surveys