Research talk:Modeling monthly active editors/Work log/2014-06-02
Monday, June 2nd
edit.... everything set up and now I have to reboot. bah!. See https://github.com/halfak/Monthly-Active-Editors-Model. --Halfak (WMF) (talk) 20:34, 2 June 2014 (UTC)
OK, so I wrote a script. It will take a query and a list of wiki database names and return the UNION ALL of the queries.
$ ./query_wikis --help usage: -c [-h] [--query QUERY] [--no-headers] wikis Queries a set of wikis and aggregates results positional arguments: wikis A file containing the list of wikis to process. optional arguments: -h, --help show this help message and exit --query QUERY The query to run. --no-headers Skip printing headers.
The script is running this query (sql/editor_months.sql):
SELECT
month,
wiki,
user_id,
user_name,
user_registration,
archived,
revisions
FROM (
SELECT
CONCAT(LEFT(rev_timestamp, 4), "-", SUBSTRING(rev_timestamp, 7, 2)) AS month,
DATABASE() AS wiki,
rev_user AS user_id,
FALSE AS archived,
COUNT(*) AS revisions
FROM revision
GROUP BY LEFT(rev_timestamp, 6), rev_user
UNION ALL
SELECT
CONCAT(LEFT(ar_timestamp, 4), "-", SUBSTRING(ar_timestamp, 7, 2)) AS month,
DATABASE() AS wiki,
ar_user AS user_id,
FALSE AS archived,
COUNT(*) AS revisions
FROM archive
GROUP BY LEFT(ar_timestamp, 6), ar_user
) AS editor_months
INNER JOIN user USING (user_id);
... on all 916 of the databases in (datasets/wikimedia_wikis.tsv). This query is very fast for a large SQL query. The expensive part of the query (reading revision and archive) don't read the actual table -- just the index. I made sure that the query_wikis script won't crash on error (just report the problem noisily) so that I don't run into issues with any strangeness in a random db. We'll see how far I get by morning. --Halfak (WMF) (talk) 00:48, 3 June 2014 (UTC)