User:EpochFail/Journal/2011-08-01
Monday, August 1st
editI'm writing this entry after-the-fact. I was sick on Monday, but I did get some things done. I found out that MySQL won't perform efficient grouping operations with some aggregate functions. Most notably, SUM()
. I just... I can't understand... I guess MySQL wasn't really built for performing the kinds of aggregations that I'd like to do. So... I wrote a script that will perform the aggregation in a way that I suspect will be a little more efficiently, but sadly, it still won't use the index. I plan to test my assumption by racing the queries. I don't think this will be wasteful since I think that MySQL is going to write a table to the disk, perform a merge sort and then aggregate. If I am right, this will result in at least an order of magnitude more time. If not, I'll be very happy to be wrong :) --EpochFail 17:56, 2 August 2011 (UTC)
Tuesday, August 2nd
editI started today by trying to finishing loading the data into MySQL for Fabian's dataset so I can race MySQL vs. my aggregation script. Zack had a query running from the table that blocked my load, so while waiting for that to finish or him to shut it down, I started loading data into one of the new analytics machines. This is a perfect platform for testing the capabilities of MySQL aggregation w/o the index vs. my own aggregation script, since they'll be querying from different machines. Once this finishes and I can get things moving (about lunchtime), I'll start working on huggle outcomes and ANOVA. --EpochFail 17:56, 2 August 2011 (UTC)
Data loaded into MySQL and indexed. Query running. I worked on some huggle data today and am planning to do some hand coding tomorrow to check out outcomes for editors that continue to edit after reading their huggle warning. I hope to find a simple way to algorithmically predict success. Either way, I'll be producing and ANOVA of registration (as a metric of huggled anon editor success) tomorrow. --EpochFail 00:51, 3 August 2011 (UTC)
Wednesday, August 3rd
editMySQL to the rescue! The large GROUP BY finished *much* faster than expected when I ran by myself on one of the new analytical database machines. I now have a dataset that groups edit activity by user_id, rev_year, rev_month and namespace with the following fields:
- first_edit: An editor's first edit
- first_edit_year: An editor's first edit year (int)
- first_edit_month: An editor's first edit month (int)
- reverting_edits: The number of reverting edits performed
- noop_edits: The number of edits that did not changed the length of an article
- add_edits: The number of edits that increased the length of an article
- remove_edits: The number of edits that reduced the length of an article
- len_added: The total length added to articles
- len_removed: The total length removed articles
It's time to check Fabian's work! --EpochFail 16:39, 3 August 2011 (UTC)