Research talk:Wikipedia article creation/Work log/Thursday, November 14th

Thursday, November 14th edit

I just finished some substantial changes to the work entry. I've got a plan together for building a database of AfC events. I'll likely end up doing it based on a sample of pages since I'll need to go to the API to get deleted article text. Props to TheOriginalSoni for helping work out hypotheses about AfC and vetting my AfC event extraction strategy. I'm blocking on getting the permission to be able to extract deleted page content via the API.

So in the meantime, I'm going to try to get some ballpark numbers about how pages were created and what their success rate is. I want to focus on pages that are intended to be "articles", so I'll be looking at any page that was *ever* in the main namespace (ns=0).

So, I have a table of all pages, deleted & not. For each of those pages, I want to quickly select out those that are currently in the main namespace -- deleted or not. For pages that are currently not in the main namespace, I'll use the logging table to look for evidence that they once were in the main namespace. Then I'll need to figure out what namespace all of the pages in started in. This is the most difficult part.

Time for an example. First, let's look for pages that started out in a user sandbox and were moved to main. I have an example on hand. It turns out that this is how I wrote en:GroupLens Research.

> select page_id, page_namespace, page_title from page where page_namespace = 0 AND page_title = "GroupLens_Research";
| page_id  | page_namespace | page_title         |
| 21301483 |              0 | GroupLens_Research |
1 row in set (0.03 sec)

OK. Now I should be able to find some log entries for this page_id

> select * from logging where log_page = 21301483;
Empty set (0.04 sec)

Hmm... Am I mis-remembering?

Nope. Here is the move [1]. So...

> select * from logging where log_title = "EpochFail/Grouplens_Lab"
    -> ;
| log_id   | log_type | log_action | log_timestamp  | log_user | log_namespace | log_title               | log_comment                         | log_params               | log_deleted | log_user_text | log_page |
| 26739417 | move     | move       | 20100106210652 |  6396742 |             2 | EpochFail/Grouplens_Lab | Article ready for main namespace.   | GroupLens Research
      |           0 | EpochFail     | 25705943 |
| 26739418 | move     | move       | 20100106210653 |  6396742 |             3 | EpochFail/Grouplens_Lab | Article ready for main namespace.   | Talk:GroupLens Research
 |           0 | EpochFail     | 25705944 |
2 rows in set (0.04 sec)

That seems to have gotten it. Strangely, there is a page_id in log_page. Let's see what it corresponds to.

> select page_id, page_namespace, page_title from page where page_id = 25705943;
| page_id  | page_namespace | page_title              |
| 25705943 |              2 | EpochFail/Grouplens_Lab |
1 row in set (0.04 sec)

Hmmm. So, it looks like the page move resulted in a change to the page_id. This is the silliest thing. There's only one revision in that my sandbox page. [2]

So, what this means is that I can't use a page_id to look for move events to see if a page was ever in ns=0 (or came from another namespace). However, I should be able to use the logging table to see where the last move was from.

It looks like the only way I can gather that is by matching on log_params. So, the move record in 2010 had a log_params that looks like this:

GroupLens Research

Note the line break after GroupLens Research.

A more recent move record has a log_params that looks like this:

a:2:{s:9:"4::target";s:61:"Keep Your Health Plan Act of 2013 (H.R. 3350; 113th Congress)";s:10:"5::noredir";s:1:"0";}

The crazy-looking thing above is a PHP serialization of the following associative array:

  '4::target' ==> "Keep Your Health Plan Act of 2013 (H.R. 3350; 113th Congress)", 
  '5::noredir' ==> "0"

So, if I'm going to work with move entries in the logging table, I'm going to need to be able to parse both types of params.

Just to check my math, I wonder if this new log_page will contain an apt page_id.

> select page_namespace, page_title from page where page_id=41082226
    -> ;
| page_namespace | page_title             |
|              2 | HistoricMN44/113hr3350 |
1 row in set (0.04 sec)

Nope.  :\

Time to recap. So, I want to figure out which pages were *ever* located in the main namespace and where they originally started. Since a move changes the page_id, the best strategy I can use for determining which namespaces a page lived in is by recursively searching the logging table for matching titles corresponding to moves.

So, it looks like I made a mistake earlier. I thought that the logging table was storing an old page_id when a page was moved, but it turns out that it was storing the redirect's page Id. Even looking back at my notes now, that seems obvious. See bug #57084.

Return to "Wikipedia article creation/Work log/Thursday, November 14th" page.