User:EpochFail/Journal/PEF 1

Wednesday, Aug. 22nd edit

I'm loading in the log data from rfaulks initial load and filtering out weirdness. This seems to be common practice. Check out RFaulk's log for details.

CREATE TABLE halfak.pef1_assignment
SELECT 
	user_id, 
	user_registration,
	SUBSTRING(e3pef_event, 33) as assignment,
	count(*) as logged_revisions,
	min(e3pef_timestamp) as first_edit,
	max(e3pef_timestamp) as last_edit
FROM rfaulk.e3_pef_iter1_log_data 
INNER JOIN enwiki.revision 
	ON e3pef_rev_id = rev_id
INNER JOIn enwiki.user
	ON rev_user = user_id
WHERE e3pef_event LIKE "ext.postEditFeedback@1-postEdit-%"
AND rev_user != 0
AND user_registration BETWEEN "20120730220000" AND "20120813060000"
AND rev_page = e3pef_page_id
GROUP BY 1,2;

CREATE UNIQUE INDEX user_idx ON halfak.pef1_assignment (user_id);
Query OK, 5282 rows affected (0.26 sec)
Records: 5282  Duplicates: 0  Warnings: 0

Query OK, 5282 rows affected (0.03 sec)
Records: 5282  Duplicates: 0  Warnings: 0

Most of these events should be right. Hopefully we can learn something from them. First things first, I need a table of all of their revisions that happened within an interesting timespan. I'l going to go two weeks from registration since that should (1) include the experimental period and (2) negate any right truncation effects since it has been over a week since the end of the period.

CREATE TABLE halfak.pef1_revision
SELECT 
	rev_id, 
	rev_user as user_id,
	page_id,
	page_namespace,
	rev_timestamp as timestamp, 
	rev_comment as comment, 
	rev_sha1 as sha1, 
	FALSE as deleted 
FROM enwiki.revision
INNER JOIN halfak.pef1_assignment 
	ON user_id = rev_user
INNER JOIN enwiki.page
	ON rev_page = page_id
WHERE rev_timestamp 
	BETWEEN user_registration 
	AND DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 14 DAY), "%Y%m%d%H%i%s")
UNION
SELECT 
	ar_rev_id as rev_id,
	ar_user as user_id,
	ar_page_id as page_id,
	ar_namespace as page_namespace,
	ar_timestamp as rev_timestamp,
	ar_comment as comment,
	ar_sha1 as sha1,
	TRUE as deleted
FROM enwiki.archive
INNER JOIN halfak.pef1_assignment
	ON ar_user = user_id;

CREATE UNIQUE INDEX rev_idx ON halfak.pef1_revision (rev_id);
Query OK, 44752 rows affected (14.92 sec)
Records: 44752  Duplicates: 0  Warnings: 0

Query OK, 44752 rows affected (0.25 sec)
Records: 44752  Duplicates: 0  Warnings: 0

OK. Now time to look at reverts. I think the best way to do this is with a script.

Tuesday, Sept. 4th edit

I just got back from WikiSym and got back to work on this problem right away. It looks like some differences I had previously observed were due to data anomalies. First things first, lets talk about data.

I re-aggregated my dataset based on a set of users generated by re-hashing user_ids during the experimental period. I pulled them from Dario's db.

CREATE TABLE halfak.pef1_assignment
SELECT 
	u.user_id, 
	u.user_registration,
	bucket as assignment
FROM dartar.e3_pef_iter1_users 
INNER JOIN enwiki.user u USING (user_id)
GROUP BY u.user_id;
CREATE UNIQUE INDEX user_idx ON halfak.pef1_assignment (user_id);

Then I pulled in the first couple weeks of revisions as I did before:

CREATE TABLE halfak.pef1_revision
SELECT 
	rev_id, 
	rev_user as user_id,
	page_id,
	page_namespace,
	rev_timestamp as timestamp, 
	rev_comment as comment, 
	rev_sha1 as sha1, 
	FALSE as deleted 
FROM enwiki.revision
INNER JOIN halfak.pef1_assignment 
	ON user_id = rev_user
INNER JOIN enwiki.page
	ON rev_page = page_id
WHERE rev_timestamp 
	BETWEEN user_registration 
	AND DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 14 DAY), "%Y%m%d%H%i%s")
UNION
SELECT 
	ar_rev_id as rev_id,
	ar_user as user_id,
	ar_page_id as page_id,
	ar_namespace as page_namespace,
	ar_timestamp as rev_timestamp,
	ar_comment as comment,
	ar_sha1 as sha1,
	TRUE as deleted
FROM enwiki.archive
INNER JOIN halfak.pef1_assignment
	ON ar_user = user_id
WHERE ar_timestamp 
	BETWEEN user_registration 
	AND DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 14 DAY), "%Y%m%d%H%i%s");
CREATE UNIQUE INDEX rev_idx ON halfak.pef1_revision (rev_id);

Then I ran a script that used the sha1 checksum in the DB to look for reverts based on the previous table.

CREATE TABLE halfak.pef1_reverted (
	rev_id INT(8) UNSIGNED,
	rvtg_id INT(8) UNSIGNED,
	rvt2_id INT(8) UNSIGNED,
	distance INT UNSIGNED,
	PRIMARY KEY(rev_id)
);
--mysqlimport --local -h db42 halfak pef1_reverted.tsv

Ok... Now back to plotting.

 
The success rates of editors is plotted based on experimental condition.
 
A zoomed-in version of the success rates of editors is plotted based on experimental condition.

Hmm... this doesn't look like much of a difference. I ran a chi^2.test to check.

None of the conditions are significantly different:

  • e_1 vs control: p=0.1083
  • e_2 vs control: p=0.2239

If this is the result of a few power-users, we should be able to see a difference by collapsing the value of each user. It's hard to measure success-rate -- even given a high amount of error -- without having a lot of revisions to look at. I limited my dataset to users who make at least 10 revisions and took the mean success-rate for each group.


 
The mean success rates of workers (>= 10 edits) is plotted based on experimental condition.
 
A zoomed-in version of the mean success rates of workers (>= 10 edits) is plotted based on experimental condition.