IRC office hours/Office hours 2012-07-30

Day Date Time Location With Topic(s)
Monday 2012-07-30 19:00 UTC until 20:00 UTC #wikimedia-analyticsconnect WMF Analytics team Analytics and statistics related questions about Wikipedia and the other Wikimedia projects, and about the new "Kraken" platform

(Timestamps are in UTC-7.)

Jul 30 12:00:31 <HaeB> Hi all, welcome to the first ever office hour of the WMF Analytics team ( ) :)
Jul 30 12:00:31 <HaeB> I'm Tilman Bayer from the WMF communications team, just helping to facilitate this a bit.
Jul 30 12:00:31 <HaeB> So hopefully everybody has seen their recent blog post and its exciting announcements
Jul 30 12:01:46 <drdee__> Hi everyone, my name is Diederik and I am the Analytics product manager
Jul 30 12:02:33 <dschoon> <-- Dave Schoonover
Jul 30 12:02:36 <ezachte> Hello, my name is Erik Zachte. Data Analyst for WMF.
Jul 30 12:02:38 <peteforsyth> hi drdee !
Jul 30 12:03:00 <ottomata> Hi all!
Jul 30 12:03:09 <dschoon> I do data alchemy and software engineering. It's good times.
Jul 30 12:03:13 <ottomata> I'm Andrew Otto, a Systems Engineer for Analytics
Jul 30 12:03:41 <HaeB> (as always, you can find more info on the team members at )
Jul 30 12:04:06 <robla> I'm Rob Lanphier. This clip describes what I do best:
Jul 30 12:04:28 <HaeB> so we thought this should be a good opportunity to ask questions about what to expect from the new analytics framework the guys are setting up
Jul 30 12:04:53 <dschoon> (as well as anything else analytics-related that you might have questions about.)
Jul 30 12:05:04 <HaeB> but also, wheter a particular piece of data that you are interested is available, and where to find it
Jul 30 12:05:30 <HaeB> who wants to shoot the first question?
Jul 30 12:06:34 <WereSpielChqrs> Will this give us better stats on the size of our active admin community? On EN wikipedia we have had a dwindling admin cardre for some years and it would be helpful to know how many are active as admins
Jul 30 12:06:59 <jeremyb> 30 19:06:34 < WereSpielChqrs> Will this give us better stats on the size of our active admin community? On EN wikipedia we have had a dwindling admin cardre for some years and it would be helpful to know how many are active as admins
Jul 30 12:07:00 <WereSpielChqrs> Also how well we cover the 24 hour clock
Jul 30 12:07:35 <drdee__> WereSpielChqrs, what do you mean by how well we cover 24 hour clock?
Jul 30 12:07:35 <jeremyb> how much of analytics hours/hardware/etc. are used for fundraising vs. reportcard vs. other kinds of analytics?
Jul 30 12:07:42 <jeremyb> or does fundraising do their own?
Jul 30 12:08:00 <WereSpielChqrs> Things like vandal blocking require admins to be available and it would be good to know how often we have gaps
Jul 30 12:08:17 <drdee__> jeremyb: the fundraising analytics is part of the fundraising team and so it has it's own budget
Jul 30 12:08:24 <dschoon> jeremyb: absolutely. we aim to vastly increase the amount of instrumentation we collect and process
Jul 30 12:08:34 <jeremyb> drdee__: but it's own people?
Jul 30 12:08:46 <jeremyb> dschoon: huh?
Jul 30 12:08:48 <ezachte> how many are active as admins:
Jul 30 12:09:06 <Katja_WMDE> i also have a report card related question: is it possible to get numbers for the different language versions instead of continents?
Jul 30 12:09:33 <peteforsyth> Question: It seems to me that there are many kinds of questions that would require data to answer. Is there (or is there planned) a wiki page to collect/curate questions?
Jul 30 12:09:34 <dschoon> well, i was commenting on processing admin-related stats, but it extends to anything we're *not* actually processing right now.
Jul 30 12:10:09 <ezachte> the link I posted lists privileged people per wiki, not per region
Jul 30 12:10:42 <peteforsyth> In connection with that last question: it would be nice to know how the Analytics Team is defining the kind of questions it seeks to answer, as well as how much emphasis it will place on the needs of Wikipedians, of reasearchers, of readers etc. that are outside the WMF's strategic goals.
Jul 30 12:11:11 <drdee__> peteforsyth: there are a number of pages where we collect questions that we want to be able to answer:
Jul 30 12:11:14 <drdee__> mobile:
Jul 30 12:11:24 <drdee__> glam:
Jul 30 12:11:41 <dschoon> Katja_WMDE: ezachte is the gent who supplies the data that powers the graphs on -- we'll certainly be expanding the graphs available there over time. erik can comment more about those datasets if necessary.
Jul 30 12:11:49 <drdee__> we also started a glossary of definitions:
Jul 30 12:12:11 <WereSpielChqrs> @ ezachte, that is cool, but if our active admins are all in the US evening or simply don't use the tools that doesn't tell us that. We know that a few very active admins do a disproportionate amount of deleting and blocking, and that many active admins rarely use the tools
Jul 30 12:12:14 <Katja_WMDE> thanks. we can't really work with numbers that only refer to europe in general
Jul 30 12:12:54 * jeremyb points to and in particular
Jul 30 12:12:58 <jeremyb> how much of that should be solved by FR vs. general centralnotice dev. vs. change by analytics team or new data from e.g. kraken
Jul 30 12:13:01 <ezachte> peteforsyth: excellent point, I think for our new data cluster is is crucially important to get the metric sharply defined from the start so we don't see definitions shift over time (as a few times happened so far)
Jul 30 12:13:32 <Katja_WMDE> dschoon: can you also point me in th direction where we can access numbers for the german wikipedia that show us unique users instead of page views?
Jul 30 12:13:34 <ezachte> the glossary is mostly in concept phase, be we will get there :-)
Jul 30 12:13:40 <HaeB> (just to clarify for other readers: of course there are already numbers available per language version of wikipedia of wikipedia, e.g. for the german wikipedia at , i guess katja referred to the language of readers)
Jul 30 12:13:56 <peteforsyth> thanks for those links drdee__ and ezachte !
Jul 30 12:14:09 <Katja_WMDE> thanks, HaeB, i also referred to a nice graphic solution :)
Jul 30 12:14:35 <drdee__> WereSpielChqrs, your use case is very specific and as such I do not think it's very high on our list, i think this could probably faster addressed using a tool server script
Jul 30 12:14:47 <dschoon> also related to jeremyb's earlier question: fundraising *presently* does its own work, but we aim to make our resources available to them once the platform is stable enough to be relied on. fundraising requires bulletproof availability and data-integrity, so it'll likely be a while (if ever) before fundraising cuts over.
Jul 30 12:14:50 <ezachte> WereSpielChqrs: you're right, actually I've never seen this request before but of course make totally sense
Jul 30 12:15:18 <dschoon> Katja_WMDE: I don't think that data has been made available to Limn yet, but ezachte would know for sure.
Jul 30 12:15:28 <jeremyb> dschoon: well what about the other way around then? :) you adopt something that was made for FR?
Jul 30 12:15:43 <drdee__> jeremyb: like what specifically?
Jul 30 12:15:47 <dschoon> jeremyb: well, the goal is to create a computational platform.
Jul 30 12:15:58 <jeremyb> drdee__: well idk. see my diff link? :)
Jul 30 12:16:03 <ezachte> till 2 years ago we did not have any ip->region lookup, now we do, but mostly for ip addresses from squid log (Glob Dev did some work to track editor names per region)
Jul 30 12:16:04 <WereSpielChqrs> @ Erik, It may well be a new requirement. As the number of active admins falls so it becomes important to know how often we have gaps
Jul 30 12:16:29 <dschoon> jeremyb: so jobs written to process fundraising-related data would run on the platform. they're not engaged in infrastructure work, so it wouldn't make sense to see things flowing in that way.
Jul 30 12:17:01 <Katja_WMDE> ezachte: will you include the new data into the stats in the future?
Jul 30 12:17:17 <dschoon> jeremyb: that's why i say the goal is to make our resources available in a way that meets everybody's needs. eventually that should involve trusted members of the community as well, much as we do with toolserver.
Jul 30 12:17:19 <HaeB> (Limn is the visualization toolkit that the team wrote to turn the numbers in to the graphs you can see at )
Jul 30 12:17:52 <dschoon> the source for Limn is available here
Jul 30 12:18:05 <jeremyb> dschoon: sure. but if they have something that's more robust or reliable than what you do then you could copy what they did?
Jul 30 12:18:07 <WereSpielChqrs> @Drdee - all wikis rely on admins and I'm fairly sure here is a widespread issue about declining numbers of available admins. Some wikis have had to have interventions from global sysops
Jul 30 12:18:20 <ezachte> Katja_WMDE and soon there will be more in dashboard from Global Dev
Jul 30 12:18:31 <Nemo_bis> hello Nikerabbit
Jul 30 12:18:43 <Nemo_bis> Nikerabbit: do you want the scrollback?
Jul 30 12:19:08 <drdee__> Jeremby: looking at your diff, it seems that the fundraising team needs more the ability to track behaviours while we initially focus on counts
Jul 30 12:19:12 <dschoon> jeremyb: they're operating at a different scale, so that's unlikely. we're planning to able able to process and store 130% of all events that currently occur in all of all wiki projects. that's every commons request, every edit, every view. it includes internal mediawiki statistics and mobile app instrumentation.
Jul 30 12:19:39 <jeremyb> dschoon: even upload.wm.o ? wow
Jul 30 12:19:56 <dschoon> jeremyb: the target is ~200,000 events/second.
Jul 30 12:20:05 <ezachte> Katja_WMDE you can find most current reports in portal expect new links to be added there
Jul 30 12:20:06 <drdee__> WereSpielChqrs: i do not dispute the relevance, but you were asking specifically whether admins cover the 24 hour clock and that is a very specific question which is probably better answered using different sources.
Jul 30 12:20:10 <dschoon> uncompressed, it's probably around 250G/hour
Jul 30 12:20:26 <dschoon> as we say: big data.
Jul 30 12:20:53 <dschoon> (of course, for reference, Facebook generates ~10TB/half-hour. which is just mindbreakingly huge.)
Jul 30 12:21:25 <jeremyb> and google is unknown i guess?
Jul 30 12:21:44 <dschoon> jeremyb: correct. google does not even publish information about the number of machines they have.
Jul 30 12:21:49 <jeremyb> right
Jul 30 12:22:02 <dschoon> (it's estimated to be ~2-3 orders of magnitude larger than any other single consumer of compute in the world.)
Jul 30 12:22:20 <jeremyb> would something like be in scope for analytics?
Jul 30 12:22:24 <dschoon> there's a lot more [highly technical] info about the planned capacity in
Jul 30 12:22:37 <HaeB> Katja_WMDE or anybody else who wants to reuse the graphs from , do you find the format useful for your purposes, or would you like different options to export the graphics (e.g. as PNG file)?
Jul 30 12:22:53 <dschoon> jeremyb: absolutely. that's precisely the kind of instrumentation data we've been discussing.
Jul 30 12:23:05 <ezachte> WereSpielChqrs: the idea is that with Kraken time to market for new ad hoc reports will diminish (over time), and require less code
Jul 30 12:23:21 <WereSpielChqrs> @drdee. what other analytics are available for a "health of the community" thing? Remember that most metrics re stable or increasing, available admins is the only one that is clearly falling
Jul 30 12:23:47 <peteforsyth> drdee__: (or anyone) can you give an idea of when you're aiming to have something ready for public consumption?
Jul 30 12:23:55 <dschoon> WereSpielChqrs: I won't speak for Diederik, but I'd say that's something we hope to work with the community to determine
Jul 30 12:24:03 <Katja_WMDE> HaeB: i mostly just make screenshots but we can't really use data where no numbers about the german language version are included. since the german WP is the second largest, a couple of numbers would be helpful
Jul 30 12:24:04 <peteforsyth> Is there a core framework that will be released first, and then features/kinds of data added onto it?
Jul 30 12:24:20 <drdee__> well, the updated report card is our first product
Jul 30 12:25:20 <dschoon> jeremyb: other examples might be A/B testing data for features, mobile application instrumentation (how often do people open the Android app? how long do they stick around?), how often do MW parser cache lookups fail?, what's the mean time for page-render for certain classes of pages?, etc
Jul 30 12:25:26 <drdee__> WereSpielChqrs, as you know, we have a lot of adhoc solutions that focus on editor retention and i think we want to give those kind of tools a real home within Kraken, examples would be WikiPride and the Editor Trend Studies
Jul 30 12:26:08 <dschoon> peteforsyth: we're currently in the planning stages of Kraken, so it'd be a bit presumptuous to attach dates to things. that only really causes trouble :)
Jul 30 12:26:16 <jeremyb> dschoon: oh, damn now i have to go remember what the android app bug i saw last night was ;)
Jul 30 12:27:17 <peteforsyth> dschoon: yes, I understand that :) but curious to understand how you are defining your project a bit more clearly. If not by time scale, by what? by product? Is there a product that would become a priority after the updated report card? Is that known yet, or would it emerge over time?
Jul 30 12:27:18 <drdee__> Katja_WMDE, which metrics in particular are you looking for?
Jul 30 12:28:02 <drdee__> peteforsyth: right now we are benchmarking different solutions to stream the data into Kraken, this is a very complicated issue
Jul 30 12:28:05 <peteforsyth> ah, this updated report card looks beautiful, btw :)
Jul 30 12:28:11 <Katja_WMDE> drdee_ unique users, for example. how many people access the german wikipedia every day/hour/etc. via computer and via smartphone
Jul 30 12:28:21 <HaeB> Katja_WMDE, yes, this question was about the format rather than the content. and i think what you said earlier about including language versions got through (although i'd like to point out that the pageviews, mobile pageviews, new editors and active editors charts there actually all include include the german wikipedia, i believe you meant the unique visitors chart
Jul 30 12:28:23 <drdee__> once we have solved this we will be able to give a more clear planning
Jul 30 12:28:28 <dschoon> peteforsyth: sure, good question. we're publishing the planning docs. as we get a better idea about what components we'll be using, we'll have better answers to your original question :) like, once we settle on the data stream import toolchain, we'll publish whatever glue code and configuration is needed to make it work.
Jul 30 12:28:51 <drdee__> Katja_WMDE: unique visitors is something we depend on from comScore for the moment, so that breakdown is not yet possible
Jul 30 12:29:03 <Katja_WMDE> drdee_ if you could provide that in the nice report card kind of way that'd be wonderful :)
Jul 30 12:29:09 <Katja_WMDE> oh ok drdee_
Jul 30 12:29:12 <dschoon> peteforsyth: we aim to use existing open-source tools where-ever possible, so replicating our work will be a matter of walking through our deployment and configuration steps. (it wouldn't really make sense to have "releases")
Jul 30 12:29:15 <drdee__> totally and we will work on that
Jul 30 12:30:03 <dschoon> I'll also add, Katja_WMDE, our goal is to make the reportcard data available in a self-service fashion. that way you can visualize things to fit the needs of your site or presentation, rather than waiting for us slowpokes.
Jul 30 12:30:40 <Katja_WMDE> well, that'd be even better, dschoon. any time soon?
Jul 30 12:30:51 <Katja_WMDE> (no pressure :)) dschoon
Jul 30 12:30:54 <peteforsyth> drdee__: hooray for that principle :) :) :)
Jul 30 12:31:24 <drdee__> Katja_WMDE the customization is already possible and we are happy to show you how to do it
Jul 30 12:31:54 <Katja_WMDE> really? awesome! can you show me where to find a how-to?
Jul 30 12:32:01 <Katja_WMDE> drdee_
Jul 30 12:32:03 <dschoon> Katja_WMDE: we're in the process of getting there now. you can play around a bit with the edit interface we use to build the reportcard graphs
Jul 30 12:32:12 <drdee__> your metric might be missing though :)
Jul 30 12:32:49 <dschoon> (i'll note the major graphs are write-protected, in case there are any mischevious hackers in the room.)
Jul 30 12:32:55 <peteforsyth> I have to go -- but I hope somebody will be posting the transcript! Very excited about this project, thanks to you all for taking the time to discuss.
Jul 30 12:33:07 <drdee__> you are very welcome peteforsyth!
Jul 30 12:33:08 <ezachte> cheers Pete
Jul 30 12:34:07 <HaeB> cool, i think the above questions have been answered by now, right?
Jul 30 12:34:08 <Katja_WMDE> that's something, drdee__. thanks for showing!
Jul 30 12:34:22 <HaeB> who wants to aske the next one?
Jul 30 12:35:14 <dschoon> (back)
Jul 30 12:35:38 <drdee__> so we hope to release two new projects shortly:
Jul 30 12:35:58 <drdee__> 1) gerrit-stats, built on top of limn, it shows code review metrics for individual repo's
Jul 30 12:36:42 <drdee__> 2) anonymized search queries on the different wikis, this will contain a timestamp, query, number of results, url of best hit and score of best hit
Jul 30 12:37:10 <drdee__> those files will be downloadable from
Jul 30 12:38:13 <jeremyb> hrmmmm, will there be an ishmael counterpart?
Jul 30 12:38:34 <drdee__> what is ishmael?
Jul 30 12:39:20 <Katja_WMDE> well, my questions are answered for now. thanks a lot guys!
Jul 30 12:39:39 <WereSpielChqrs> For the search queries will the best hit take account of redirects/
Jul 30 12:39:41 <dschoon> pleasure!
Jul 30 12:39:47 <drdee__> you are welcome Katja_WMDE!
Jul 30 12:40:01 <Katja_WMDE> who can i contact in case of more questions?
Jul 30 12:40:14 <drdee__> all of us, either here or on the mailinglist
Jul 30 12:40:53 <drdee__> WereSpielChqrs, it will be exact replica of the results shown to user, so if a redirect is the best hit then the redirect will show up in the results page
Jul 30 12:41:52 <WereSpielChqrs> @drdee thanks and will we get most frequent usnsuccessful searches?
Jul 30 12:42:32 <drdee__> yes, you will get all searches, so obviously we will need some tools to be built around this data
Jul 30 12:42:54 <drdee__> but we believe that this will give valuable feedback to editors on what topics people are searching for and cannot find
Jul 30 12:43:07 <WereSpielChqrs> For each wiki the most frequent unsuccessfuk searches gives you a good idea of the most wanted articles or redirects
Jul 30 12:43:26 <ezachte> of course a large part will be typos
Jul 30 12:43:53 <drdee__> that shouldn't matter so much, as the lucene based search engine corrects for that
Jul 30 12:43:57 <WereSpielChqrs> typos are some of the easiest redirects to create
Jul 30 12:44:52 <WereSpielChqrs> We just need to know what the comon typos are in reader search queries
Jul 30 12:44:59 <ezachte> sure, now in Domas' hourly files one can lots of typos, also combining views for all redirects would be useful too, like grok does right now
Jul 30 12:46:05 <ezachte> correct: now in Domas' hourly files one can already find lots of typos
Jul 30 12:47:11 <HaeB> any other remarks? about what other data would be useful, or on privacy questions, or on how to find existing data?
Jul 30 12:48:57 <ezachte> oops, view for redirects are not combined there, would be nice though
Jul 30 12:50:48 <WereSpielChqrs> We have some divides in the community which data could resolve. For example at newpage patrol there is a big divide between those who believe that it is important to tag articles quickly to catch ediors before they log off and those like me who think that is bitey. Some stats as to which theory is more accurate would make it much easier to get consensus as to how we treat new articles
Jul 30 12:53:18 <drdee__> WereSpielChqrs, your question is on the intersection of the E3 and analytics teams
Jul 30 12:53:18 <WereSpielChqrs> Also there are myths and rumours as to article load times by geography,and article size in bytes. This has a potential effect on the maximum size of featured articles. It would be very useful to see how long it takes for people in dfferent parts of the world to read or edit wikipedia
Jul 30 12:53:23 <dschoon> we on the analytics team are in favor of data-driven decision making
Jul 30 12:53:44 <WereSpielChqrs> drdee e3=?
Jul 30 12:53:50 <dschoon> yeah, WereSpielChqrs, jeremyb mentioned boomerang earlier. i think getting client-side pageload times would be a great idea.
Jul 30 12:54:11 <drdee__> E3 == Editor Experiments and Engagement
Jul 30 12:54:30 <drdee__> it is a dedicated team that focuses on the editor retention problem
Jul 30 12:54:37 <ezachte> of course more data often primarily leads to refined questions, rather than answers
Jul 30 12:55:02 <WereSpielChqrs> OK I knew about them but not the e3 name
Jul 30 12:55:48 <drdee__> the loading thing is mostly an Ops issue and they are working hard on it.
Jul 30 12:56:04 <drdee__> there will be a new data centre on the west coast that will also serve Asia IIRC
Jul 30 12:56:05 <WereSpielChqrs> Does intersection mean that you will both be keen to resolve this or that it could fall into a crack in between you?
Jul 30 12:57:01 <dschoon> Intersection?
Jul 30 12:57:07 <dschoon> (Did I miss something? Sorry.)
Jul 30 12:57:14 <drdee__> and there is of course the E2 team, that is working on redesigning the new page workflow
Jul 30 12:57:15 <ezachte> intersection of the E3 and analytics teams
Jul 30 12:57:19 <drdee__> IIRC
Jul 30 12:57:24 <HaeB>
Jul 30 12:57:39 <WereSpielChqrs> Thanks, I knew about the new datacentre, but we often have discussions at en FAC about maximum article length, I suspect other languages will have the same issue
Jul 30 12:59:17 <drdee__> WereSpielChqrs, page loading times are a very complicated issue and there are so many factors influencing that, that i don't expect that you would find clear cut answers
Jul 30 12:59:26 <dschoon> well, in the future, we aim to provide infrastructure for teams like E3. the intersection is by design -- they're eager to use what we're building, as it makes it easier and faster for them to run experiments and investigate the results.
Jul 30 12:59:36 <dschoon> dunno if that is precisely what you were asking about, though.
Jul 30 12:59:51 <dschoon> (i can give concrete examples if it'd help)
Jul 30 13:00:35 <HaeB> ok, we're wrapping up now
Jul 30 13:00:46 <HaeB> thanks everyone for participating!
Jul 30 13:00:55 <HaeB> as usual. the log will be posted at shortly
Jul 30 13:01:09 <WereSpielChqrs> @drdee yup things will be complicated , but even if we can have a worst case scenario that wuld help abd might even encourage new cache cetres
Jul 30 13:01:26 <drdee__> a big TY for everyone asking questions and participating! i hope this was useful and let us know if we should do this more often
Jul 30 13:01:53 <ezachte> Great questions! Thanks
Jul 30 13:02:02 <WereSpielChqrs> Thanks from me for your answers
Jul 30 13:02:22 <dschoon> Yeah! Thank you all, especially question-askers, but also listeners! It's really nice to know people care about what we're up to :)
Jul 30 13:02:22 * HaeB has changed the topic to: Meet the Analytics Team!