Wikimedia monthly activities meetings/Quarterly reviews/TechOps/August 2014

The following are notes from the Quarterly Review meeting with the Wikimedia Foundation's Tech Ops team, August 28, 2014, 9:30AM - 11AM PDT.

Present: Jeff Gage, Lila Tretikov, Rob Lanphier, Erik Moeller, Toby Negrin, Tilman Bayer (taking minutes), Rob Halsell

Participating remotely: Alexandros Kosiaris, Andre, Andrew Bogott, Andrew Otto, Chase Pettet, Faidon Liambotis, Filippo Giunchedi, Giuseppe Lavagetto, Greg Grossmeier, Marc-Andre Pelletier, Mark Bergsma, Daniel Zahn (from 10:20)

Please keep in mind that these minutes are mostly a rough transcript of what was said at the meeting, rather than a source of authoritative information. Consider referring to the presentation slides, blog posts, press releases and other official material

Mark:
Welcome to our first ever quarterly review
I had attended quite a few for other teams, found them useful
This time, kind of looking back at last 6 months too
Introducing Team: ...
Staffing:
Backfilled 3 positions
new Dallas contractor (physical maintenance on the ground)
Faidon missing this q :(
Looking at some more hires
I (Mark) spread a bit thin now, tech project manager hire will help
ops security engineer: hard to find, might need to train someone internally
Yuvi has been very active in Labs as volunteer, joining team in October
Had our first offsite meeting in Athens - very positive experience; team had had some difficult times before
team-building, hacking
e.g. Graphite, improve monitoring, Heartbleed firefighting, talk about processes. And lots of food, organized by Faidon ;)
want to do it again next year
Problem had been isolation from other teams
Skeptical of idea of embedded Op (e.g. single point of failure)
but we established a liaison system - main contact person for other teams:

Brandon (for Zero and Mobile)

Our Varnish infrastructure is complicated

MW Core: Faidon, now Guiseppe
Analytics: Andrew O (from that team), Gage
RE/QA: Andrew B
Services: Faidon - expect to need a lot of communication with that team in the future

Erik: [Analytics - ] Toby?
Toby: has been good to have Andrew working with team
Concern about single person - getting Gage up to speed has been useful, also Christian from my team
Mark: Liaison is not meant to do all the work, but to lead communications
for actual project work, pick from team
Liaison is primary contact
Toby: OK, but still they have a lot of knowledge that's not well diffused
Mark: in case of Fundraising, other people have been helping out, want to extend that further
Toby: Great that you guys attend Scrum of Scrums
Mark: yes, that has had positive impact
Faidon and (now only) AndrewO attended
that already removed blockers
Caution: normally SoS is only used for immediate blockers, not for longer term planning
Ops can't always be as agile as you would like

Phabricator deployment
move RT there, as much as possible

projects (slide)
puppet upgrade done (very smoothly by Giuseppe)
Hiera: put data about our systems into Puppet data tree
Labs migration to eqiad (delayed because contractor disappeared)
Reduce Tampa (pmtpa)
data center RFP took a long time (18 sites visited)
Dallas: great price/infrastruture/location
Dallas buildout now happening
ulsfo - unexpected floor move, but seamless
admin user accounts management revamp in puppet
new data retention policy implementation (90 day period), Ariel working on checking that
Lila: how is Dallas set up?
Mark:
depends on service - where we can, we deploy active/active
MediaWiki not suitable for that though, so active/standby
Lila: what happens if primary datacenter fails, how long does the switchover take?
Mark: earlier it would have taken 1h to a day, did work on it since then, now still about 1h
should get quicker still
Lila: when planning first disaster recovery drill?
Mark: next q
after that, every q
Lila: how about replication, ...
Mark: nearly instant
for backup, 8h dumps
24h for...
Lila: what are we doing CDN-wise?
Mark: not quite doing that, we have very specific requirements about e.g. purging
our caching infrastructure is fairly small
Hoping to set up another (lightweight) caching center in Asia next year[?]
Gage: ulsfo dropped Asia time by 100ms
Toby: blog post https://blog.wikimedia.org/2014/07/11/making-wikimedia-sites-faster/
Lila: just speed of light is such a limiting factor to e.g. India
Mark: all our data centers are connected via private links
Lila: how many providers going into our dcs?
Mark: about 4
Lila: do we (peer)?
Mark: yes, have done for a long time[?]

(Mark:)
OSM - moved over with Toolserver shutdown
MariaDB 10 upgrade - some interesting new features which e.g. help with switchover
HTTP PFS - done by volunteer with support from Giuseppe
was meant for Q2 as part of general SSL revamp, now done earlier, world is happy
virtualization - had trouble resourcing this, had to put on hold (e.g. Chase had to work on Phabricator instead)
Greg: does that mean that virt Dallas will happen later?
Mark: later and in slimmer form
not an immediate problem because we have quite a few from Tampa still
monitoring revamp: have not yet been able to follow up on Athens
Yuvi is doing similar things for Labs, might be able to profit from that

Projects - externally driven:

HHVM - going well. new app servers also upgraded to Trusty

Robla: agree
(Mark:)

Phabricator: difficult to implement security levels, but progress, still hope to make deadline
Analytics: get Hadoop infrastructure up, needs two engineers right now

Toby: Kafka already up, replacing Webstatscollector, ... could revisit staffing soon

Toolserver migration to Tool Labs was a lot of work for team

completed, users seem fairly happy

Elasticsearch: last year, Ops was not able to take that on, so Platform did

RobLa: appreciate the recent help on search
decision needed: optimizing code vs. throwing hardware at it
back of the envelope: 20 servers needed
Mark: we are definitely throwing hardware at it, budgeted
Toby: Growth team is using search for task suggestions, not sure how much load
search is super useful
(Mark:)

PDF rendering:

had service by external company that we could not manage at all
(still two servers in Tampa we can't touch/fix)
new service has mostly been written, not activated by default yet (deployed before Wikimania, still failing on ~40% of PDFs, now Matt Walker has left)
want to shut down Tampa in 4 weeks
Erik: will ping Scott, I think he found that 40% bug already
Mark: we are still paying a lot of money for Tampa ;)

Ongoing Ops responsibilities
A lot of these
about 50 % of our workload

Tampa/Dallas migration
next week: Tampa backup data migrated to Dallas, complete in 4 weeks

Budget (not seen this in other teams's quarterly reviews, but since it's significant for us:)
Migrating Tampa is actually quite a large net benefit
almost $50k/month already
Total net savings will be $49k/month
(and this wasn't even the primary reason for the migration)

Capital expenditures:
April-July: $620k purchases for new dc (waited for tax reduction), more soon. but not too much before we need them
Lila: reusing servers?
RobH: we brought over about 300 servers from Tampa (50%)
Lila: how often do we replace servers?
Mark: on large cluster, we can fix easily, have spare parts: 5-6 years
on smaller, sometime 3, but normally 4-5 y
for migration, only took those which still had 6-8months warranty left, otherwise not worth it

Performance and uptime metrics:
not too great
navigation timing for desktop sites (slide)
set up by Ori 1.5y ago. great data
fairly stable, median 2000ms
outside monitoring: Watchmouse (in addition to our own Icinga)
they also do perf metrics, but not clearly defined, bugs
they report average page load time
This means what an anon user will see on our main sites (but not eg. Bugzilla)
Toby: this (Watchmouse per for enwiki main page) is 10% of the other times?
Gage: yes, suspicious
Mark: seen faster than speed of light ;) Really not trusting it too much
Lila: What is our cache hit ratio?

Mark: Faidon did a project with someone at RIPE, they have lots of worldwide probes https://blog.wikimedia.org/2014/07/09/how-ripe-atlas-helped-wikipedia-users/
helps decide which dc to use
Ams most suitable for Europe
we can use these probes to get additional data
Measure latency from our users to our network
This should help a lot, and it's basically free

Incidents: have gotten a lot better at (documenting) that, helped along by Greg
Also doing followups now, quarterly incident review meeting, already led to improvement
Seeing very few outages in last few months (partly due to lower summer traffic)
Lila: effective uptime?
Mark: couldn't get decent single number from Nimsoft
Lila: looking for total minutes of downtime
probably separated for large WPs etc
Mark: more complicated, e.g. specific pages downtime only
Erik: let's find something else than Nimsoft
Keynote?
Mark: hoped Gage would
old server crashed, lost data
Faidon: can get 95% of data from Nimsoft
Lila: I think about uptime as user-perceived uptime
which cluster failed etc. might be important for team, but not for user
Erik: it's not even Ops only, could also be app failure
Lila: set up threshold at which point perf degradation becomes failure
Mark: it's complicated to define
Lila: let's get the best possible monitoring solution, it's central for us. agree it needs careful preparation/definition
Mark: will work with Platform and maybe Analytics too
Erik: yes, Ori also started working on alert system for e.g. deploys that are causing perf issues
Toby: start with HHVM

(Mark:)
This q: site operation goals

IPSec for private data over cross dc links
Trusty upgrade, assist with HHVM

...

This q: Labs goals
...

backups of user data

Next q: site ops goals

core services at codfw (near end of q)
upgrade Varnish - mostly compatible, but need to port over some our custom code (Brandon)
codfw Fundraising infrastructure - don't expect problems there

...

Next q: Labs goals

also spread Tool Labs over both dc (as done with Tampa and Ashburn)
...

new data center (codfw)
migrate wikis - ability to serve from codfw
RobLa: what's the goal here?
Mark: ensure wikis can run from codfw, test
don't expect a lot of work from your team, but some things
RobLa: ok
Mark:
migrate remaining services like Bugzilla - many components, longterm project

Project: metrics
need better perf metrics
Toby: Analytics has been thinking about A/B testing as well, good opportunity for collaboration
let's talk about this in September
Mark: thinking about this in terms of team's strategy
Erik: some of the tested features only in JavaScript (i.e. not impacting Ops at all), but others affect caching etc.
Mark:
metrics gathering structure needs revamp - Graphite so far, needs scaling
Lila: want to separate general data collection / instrumentation and instrumentation for rollouts
e.g. can we roll out VE to 100%, performance-wise
RobLa: Platform working on such things

(Mark:)
Project: monitoring
incident monitoring: replace Icinga

Discussion

Lila: very informative, thank you
Erik: thanks in particular for the details on metrics

Erik: blockers, things that the org can do for you?
Mark: only small requests, not for this venue
Erik: HR support?
Mark: fairly OK - recent weeks with Wikimania and Jobvite migration have been a bit slow, but no complaints