Research talk:Modeling monthly active editors/Work log/2014-06-02

Monday, June 2nd edit

.... everything set up and now I have to reboot. bah!. See https://github.com/halfak/Monthly-Active-Editors-Model. --Halfak (WMF) (talk) 20:34, 2 June 2014 (UTC)Reply


OK, so I wrote a script. It will take a query and a list of wiki database names and return the UNION ALL of the queries.

$ ./query_wikis --help
usage: -c [-h] [--query QUERY] [--no-headers] wikis

Queries a set of wikis and aggregates results

positional arguments:
  wikis          A file containing the list of wikis to process.

optional arguments:
  -h, --help     show this help message and exit
  --query QUERY  The query to run.
  --no-headers   Skip printing headers.

The script is running this query (sql/editor_months.sql):

SELECT
    month,
    wiki,
    user_id,
    user_name,
    user_registration,
    archived,
    revisions
FROM (
    SELECT
        CONCAT(LEFT(rev_timestamp, 4), "-", SUBSTRING(rev_timestamp, 7, 2)) AS month,
        DATABASE() AS wiki,
        rev_user AS user_id,
        FALSE AS archived,
        COUNT(*) AS revisions
    FROM revision
    GROUP BY LEFT(rev_timestamp, 6), rev_user

    UNION ALL

    SELECT
        CONCAT(LEFT(ar_timestamp, 4), "-", SUBSTRING(ar_timestamp, 7, 2)) AS month,
        DATABASE() AS wiki,
        ar_user AS user_id,
        FALSE AS archived,
        COUNT(*) AS revisions
    FROM archive
    GROUP BY LEFT(ar_timestamp, 6), ar_user
) AS editor_months
INNER JOIN user USING (user_id);

... on all 916 of the databases in (datasets/wikimedia_wikis.tsv). This query is very fast for a large SQL query. The expensive part of the query (reading revision and archive) don't read the actual table -- just the index. I made sure that the query_wikis script won't crash on error (just report the problem noisily) so that I don't run into issues with any strangeness in a random db. We'll see how far I get by morning. --Halfak (WMF) (talk) 00:48, 3 June 2014 (UTC)Reply

Return to "Modeling monthly active editors/Work log/2014-06-02" page.