User:Halfak (WMF)/WMF research libraries
I'd like to perform a substantial upgrade and consolidation of our (WMF's) python code for research in preparation for some dramatic improvements to my analysis/development environment. I'll use this page to document some of those ideas.
Python 3
editTransitioning from 2.7 to 3 is annoying, so I plan to bundle it with a larger transition. I'm also hoping to transition from R (love the community, hate the language) for statistical work too. This transition will rely heavily on support from numpy and scipy.
Python as an analysis environment
editIPython Notebook
editThe environment is relatively straightforward. I found myself picking up markdown in a matter of minutes. It's fun to run code and then complain about what happened. There are a few complaints that I have. For example, I have to reach for my mouse to switch from code mode to markdown mode. However, the system mostly just works and it's much smarter and cleaner than an R document. 22:24, 4 November 2013 (UTC)
Pandas for data tables
editI just finished a quick run through the Pandas documentation and checked for some of the functionality that I regularly use in R. I found that most of it was intact, but quite a lot of the transformations and filtering I'd like to do are a little quirky and over-convoluted. I'm finding myself missing data.tables
from R a lot, but I can do what I need to do. 22:24, 4 November 2013 (UTC)
Plotting with Bokeh
editI ran through a little bit of the set of examples for plotting in IPython notebook. It seems like the library is quite capable, but it's not ready. For example, geom_errorbar
, one of my favorite functions, is missing. That's just one example. I think I'll be trying out bokeh another time, but I'm worried that reverting from my awesome R plotting environment will make me less productive. 22:24, 4 November 2013 (UTC)
Map reduce
editPython streaming
editUtilities
editThere are two sets of problems that I'd like to solve in a set of utilities.
- Common scripts (see clize)
- A set of utility scripts for extracting information from the database/dumps/etc and performing operation & transformations or gathering stats.
- Common utilities
- A set of python modules that supports extension of these common actions (e.g. XML dump processing scripts & Persistence) or data transformations (e.g.
User stats ~ Wikimetrics
editIt's important that any statistics generation/extraction is closely tied to m:Wikimetrics or we'll end up duplicating a bunch of work.