User:Halfak (WMF)/WMF research libraries

I'd like to perform a substantial upgrade and consolidation of our (WMF's) python code for research in preparation for some dramatic improvements to my analysis/development environment. I'll use this page to document some of those ideas.

Python 3

edit

Transitioning from 2.7 to 3 is annoying, so I plan to bundle it with a larger transition. I'm also hoping to transition from R (love the community, hate the language) for statistical work too. This transition will rely heavily on support from numpy and scipy.

Python as an analysis environment

edit

IPython Notebook

edit

The environment is relatively straightforward. I found myself picking up markdown in a matter of minutes. It's fun to run code and then complain about what happened. There are a few complaints that I have. For example, I have to reach for my mouse to switch from code mode to markdown mode. However, the system mostly just works and it's much smarter and cleaner than an R document. 22:24, 4 November 2013 (UTC)

Pandas for data tables

edit

I just finished a quick run through the Pandas documentation and checked for some of the functionality that I regularly use in R. I found that most of it was intact, but quite a lot of the transformations and filtering I'd like to do are a little quirky and over-convoluted. I'm finding myself missing data.tables from R a lot, but I can do what I need to do. 22:24, 4 November 2013 (UTC)

Plotting with Bokeh

edit

I ran through a little bit of the set of examples for plotting in IPython notebook. It seems like the library is quite capable, but it's not ready. For example, geom_errorbar, one of my favorite functions, is missing. That's just one example. I think I'll be trying out bokeh another time, but I'm worried that reverting from my awesome R plotting environment will make me less productive. 22:24, 4 November 2013 (UTC)

Map reduce

edit

Python streaming

edit

Utilities

edit

There are two sets of problems that I'd like to solve in a set of utilities.

Common scripts (see clize)
A set of utility scripts for extracting information from the database/dumps/etc and performing operation & transformations or gathering stats.
Common utilities
A set of python modules that supports extension of these common actions (e.g. XML dump processing scripts & Persistence) or data transformations (e.g.

User stats ~ Wikimetrics

edit

It's important that any statistics generation/extraction is closely tied to m:Wikimetrics or we'll end up duplicating a bunch of work.