Research talk:Teahouse long term new editor retention/Work log/2015-10-06
Wednesday, October 7, 2015
editI need to gather all the edits to the talkpages of the 14K users in this sample, made between the time the user registered and when the bot invited some of them. Aaron and I will use these data to explore what kind of messages these users received before they were invited to the Teahouse, and maybe identify some factors that make users more likely to seek help.
- First, build out the th_retention_sample table to include talkpages.
- get all the talkpages for the users whose talkpage can be found
update th_retention_sample trs join enwiki.page p on REPLACE(trs.user_name, " ", "_") = p.page_title set trs.user_talk_id = p.page_id, trs.ut_is_redirect = p.page_is_redirect where p.page_namespace = 3;
#13735 rows
13735/14766=0.9302 or 93% of the total. Some of those that remain (in the control group) never had talkpages. How many?
select count(user_id) from th_retention_sample where user_talk_id IS NULL and sample_group = "control";
#1013
So 13735+1013=14748 , which means that only 18 of these pages are otherwise unaccounted for (tho some of the control users may have also had their talkpages deleted.
- next, let's get the ids of the pages that are redirects
- I assume most of these were redirected because of a user name change (tho we don't know this for certain). We'll try to control for that by only looking for new page ids that are still in the user_talk namespace
update th_retention_sample trs join enwiki.redirect r on trs.user_talk_id = r.rd_from join enwiki.page p on r.rd_title = p.page_title set trs.ut_new_id = p.page_id where trs.ut_is_redirect = 1 and r.rd_namespace = 3 and p.page_namespace = 3;
#50 rows updated
A couple of people's user talkpages redirect to content space. We'll ignore those.
select * from th_retention_sample where ut_is_redirect = 1 and ut_new_id is null;
#Jaspreet singh mankoo
#Nasranilokam
- now, get the ids of any deleted talkpages from the archive table
- add them in here, and set the page_is_deleted flag so we can find them later
first, set page_is_deleted = 0 for everything that has a page
update th_retention_sample trs set ut_is_deleted = 0 where user_talk_id is not null;
#13735 rows updated
how many invited users don't have talkpages?
select * from th_retention_sample where sample_group = "invited" and user_talk_id is null;
#18
- Some of these were deleted: https://en.wikipedia.org/wiki/User_talk:Megha_Gupta_India
- Others were moved without a redirect: https://en.wikipedia.org/wiki/User_talk:Maoist
Let's get the ids of the deleted ones, redirect or no
update jmorgan.th_retention_sample trs join enwiki.archive a on REPLACE(trs.user_name, " ", "_") = a.ar_title set trs.user_talk_id = a.ar_page_id, ut_is_deleted = 1 where a.ar_timestamp > 20141001000000 and trs.user_talk_id is null and trs.ut_is_deleted is null and a.ar_namespace = 3;
#11 pages updated
We have about 10 users left who are lised as "invited" but don't have a talkpage. Several of these are SUL finalization moves that didn't do a redirect. For example: https://en.wikipedia.org/wiki/User_talk:Hocngo
At least two are is a mystery. Looks like this invite was never delivered:
Let's get the new page ids for the SUL condition ones, from the log table.
update jmorgan.th_retention_sample trs join enwiki.logging l on REPLACE(trs.user_name, " ", "_") = l.log_title set trs.ut_new_id = l.log_page, trs.ut_is_deleted = 1 where trs.ut_is_redirect is null and trs.ut_new_id is null and l.log_namespace = 3 and l.log_action = "move" and l.log_type = "move" and l.log_timestamp > 20141001000000;
#9 rows updated
Are there any deleted user pages for control users that we need to account for?
select * from jmorgan.th_retention_sample trst join enwiki.logging l on REPLACE(trst.user_name, " ", "_") = l.log_title where l.log_namespace = 3 and l.log_action = "move" and l.log_type = "move" and l.log_timestamp > 20141001000000;
#78 rows returned
Yes, looks like there are a variety of edge cases to account for here. Mostly, people who were moved without redirects.
- finally, update the new page ids
- for those last control users whose pages were deleted after moving
update jmorgan.th_retention_sample trs join enwiki.logging l on REPLACE(trs.user_name, " ", "_") = l.log_title set trs.ut_new_id = l.log_page, trs.ut_is_deleted = 1 where l.log_namespace = 3 and trs.sample_group = "control" and l.log_timestamp > 20141001000000 and log_action in ("delete") and trs.user_talk_id != l.log_page;
#3 rows
- Alright! Now let's create a table to hold all the edits.
first, everyone who has a non-deleted talkpage that isn't a redirect. the most straightforward case, and the largest number of users
create table th_retention_sample_preinvite_talkpage_edits select rev_id, rev_user, rev_user_text, rev_page, rev_comment, rev_timestamp, rev_minor_edit, rev_deleted, rev_len from enwiki.revision r join jmorgan.th_retention_sample t on r.rev_page = t.user_talk_id where r.rev_timestamp between t.user_registration and DATE_FORMAT(t.sample_date, '%Y%m%d%H%i%s') and t.ut_is_redirect = 0 and t.ut_is_deleted = 0;
#14857
now, for users whose talkpage redirects to another page, and we have the other page, but that page is not deleted, give me edits to that page
insert ignore into th_retention_sample_preinvite_talkpage_edits (rev_id, rev_user, rev_user_text, rev_page, rev_comment, rev_timestamp, rev_minor_edit, rev_deleted, rev_len) select rev_id, rev_user, rev_user_text, rev_page, rev_comment, rev_timestamp, rev_minor_edit, rev_deleted, rev_len from enwiki.revision r join jmorgan.th_retention_sample t on r.rev_page = t.ut_new_id where r.rev_timestamp between t.user_registration and DATE_FORMAT(t.sample_date, '%Y%m%d%H%i%s') and t.ut_is_redirect = 1;
#49
now for edits to deleted pages, using the original id (non-redirects)
insert ignore into th_retention_sample_preinvite_talkpage_edits (rev_id, rev_user, rev_user_text, rev_page, rev_comment, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, ut_is_deleted) select ar_rev_id, ar_user, ar_user_text, ar_page_id, ar_comment, ar_timestamp, ar_minor_edit, ar_deleted, ar_len, ut_is_deleted from enwiki.archive a join jmorgan.th_retention_sample t on a.ar_page_id = t.user_talk_id where a.ar_timestamp between t.user_registration and DATE_FORMAT(t.sample_date, '%Y%m%d%H%i%s') and t.ut_is_deleted = 1;
#20 rows
now for edits to deleted pages, using the new id (redirects)
insert ignore into th_retention_sample_preinvite_talkpage_edits (rev_id, rev_user, rev_user_text, rev_page, rev_comment, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, ut_is_deleted) select ar_rev_id, ar_user, ar_user_text, ar_page_id, ar_comment, ar_timestamp, ar_minor_edit, ar_deleted, ar_len, ut_is_deleted from enwiki.archive a join jmorgan.th_retention_sample t on a.ar_page_id = t.ut_new_id where a.ar_timestamp between t.user_registration and DATE_FORMAT(t.sample_date, '%Y%m%d%H%i%s') and t.ut_is_deleted = 1;
#11 rows
Sanity check
editwe should have around as many pre-invite revisions between these two groups. taking the largest subset (those whose talkpage wasn't redirected
select avg(revs) from (select count(rev_id) as revs from th_retention_sample_preinvite_talkpage_edits tt join th_retention_sample t on t.user_talk_id = tt.rev_page where t.sample_group = "control" and t.ut_is_redirect = 0 and t.ut_is_deleted = 0 group by t.user_talk_id) a;
#control average = 2.1519
#invited average = 2.0886
Looks good!