Research talk:AfC processes and productivity/Work log/2014-04-05

Saturday, April 5th edit

Today, my goal is to get a data processing workflow together that will produce a datasets to work with.

AfC status change
* page_id
* page_namespace
* page_title
* status
* timestamp
* rev_id
* duration
* old_status

I have two primary datasources for getting text: the dumps and the API. Given the speed with which I can process dumps, my plan is to process a recent dump first and then to try to request the rest of the data (deleted pages) from the API.

I've already gathered all "AfC pages" from the database with the following SQL:

CREATE TABLE halfak.afc_page_20140331
SELECT
    page_id,
    page_namespace,
    page_title,
    archived,
    identified_by
FROM (
    (
        SELECT
            page_id,
            page_namespace,
            page_title,
            FALSE AS archived,
            "namespace and title" AS identified_by
        FROM page
        WHERE page_namespace = 5
        AND page_title LIKE "Articles_for_creation/%"
    )
    UNION
    (
        SELECT
            ar_page_id,
            ar_namespace,
            ar_title,
            TRUE AS archived,
            "namespace and title (archive)" AS identified_by
        FROM archive
        WHERE ar_namespace = 5
        AND ar_title LIKE "Articles_for_creation/%"
        GROUP BY 1,2,3
    )
    UNION
    (
        SELECT
           main_page.page_id,
           main_page.page_namespace,
           main_page.page_title,
           FALSE AS archived,
           "categories" AS identified_by
        FROM page talk_page
        INNER JOIN page main_page ON
            main_page.page_namespace = 0 AND
            main_page.page_title = talk_page.page_title
        INNER JOIN categorylinks ON
            cl_from = talk_page.page_id AND
            cl_to IN (
                "Accepted AfC submissions",
                "Declined_AfC_submissions",
                "Pending_AfC_submissions",
                "Draft_AfC_submissions"
            )
        WHERE talk_page.page_namespace = 1
        GROUP BY 1,2,3
    )
) AS group_set
GROUP BY 1,2,3,4;

--EpochFail (talk) 16:36, 5 April 2014 (UTC)Reply


Since I'm going to have to process main article pages and talk page independently and there is no guarantee which one I'll see first in the dumps, I left the duration computation for a second pass over the dataset. For now, I'm just extracting the following fields:

  • page_id
  • page_namespace
  • page_title
  • rev_id
  • timestamp
  • status
  • source (either "id match" or "talk page")

I pushed the updated code up to the repo here: https://github.com/halfak/Articles-for-Creation --EpochFail (talk) 18:31, 5 April 2014 (UTC)Reply

Return to "AfC processes and productivity/Work log/2014-04-05" page.