Research talk:Teahouse long term new editor retention/Work log/2015-10-16
Friday, October 16, 2015
editJust getting some thoughts together this evening. I'll actually start writing code tomorrow.
There are three questions that I really want to be able to ask of the data:
- How does the initial investment (edits saved before invite) moderate the effect on long term retention?
- How does the completion of R:productive edits before the invite moderate the effect on long term retention?
- How does receiving negative templated messages before the invite moderate the effect on long term retention?
For initial investment, I can just count the number of edits per editor from J-Mo's tables. It would be nice to run an R:activity session analysis on those edits too, but I'll leave that as a nice-to-have.
For productive edits, I'll need to write a script that uses mwreverts to look for edits to articles that did not get reverted.
For receiving negative templated messages, I'll have to do a little bit more work. I should be able to borrow the template matching code from snuggle, but I'll want to do some checks to make sure that there haven't been some new negative template types added since I wrote that code 2 years ago.
Given that I think that #3 is one of most interesting followup analyses, I think I'll be focusing on that tomorrow. To do some spot checking, I'll need to write a script that extracts the comments added to the talk page and visually examine them in comparison to the regex matchers. That means I'll want to do this analysis in two stages. Stage 1 will just get the added text of the diff. Stage 2 will process the text and flag negative template comments (e.g. "uw-vandalism"). --EpochFail (talk) 00:20, 16 October 2015 (UTC)
And I'm back. So first things first, I want to get the text added in edits. First, let's find out how many edits I need to process.
> select count(*) from th_retention_sample_preinvite_talkpage_edits; +----------+ | count(*) | +----------+ | 14937 | +----------+ 1 row in set (0.00 sec)
That's more than expected, but totally tractable. I've been considering how I'll do the diff to extract the added text and I realized that I could really just process the content of the page at the time the invitation was posted. If I do that, I only need to do one request per talk page.
mysql:halfak@db1047.eqiad.wmnet [jmorgan]> select count(distinct rev_page) from th_retention_sample_preinvite_talkpage_edits; +--------------------------+ | count(distinct rev_page) | +--------------------------+ | 7102 | +--------------------------+ 1 row in set (0.05 sec) mysql:halfak@db1047.eqiad.wmnet [jmorgan]> select count(distinct sample_user) from th_retention_sample_preinvite_talkpage_edits; +-----------------------------+ | count(distinct sample_user) | +-----------------------------+ | 7102 | +-----------------------------+ 1 row in set (0.01 sec)
So it looks like that would dramatically reduce the amount of requests I need to send. It would also let me use the Snuggle code directly as it is designed to process an entire talk page -- not diffs.
In order to do this, I'll need the most recent edit to the talk page before the message was posted. That's pretty easy.
> select sample_user, rev_page, max(rev_id) from th_retention_sample_preinvite_talkpage_edits group by 1,2 limit 10; +-------------+----------+-------------+ | sample_user | rev_page | max(rev_id) | +-------------+----------+-------------+ | 22891690 | 44156633 | 630340168 | | 22892705 | 44151112 | 630301808 | | 22892807 | 44149237 | 630170993 | | 22893113 | 44159763 | 630333721 | | 22895159 | 44151203 | 630203593 | | 22895570 | 44151846 | 630211480 | | 22895602 | 44161065 | 630346113 | | 22896534 | 44161977 | 630365203 | | 22897845 | 44153274 | 630234049 | | 22897859 | 44153408 | 630232261 | +-------------+----------+-------------+ 10 rows in set (0.02 sec)
So now I need to write a python script that uses mwapi to get the text of those revisions. I'll need to do a little bit of work to query both prop=revisions and prop=deletedrevisions, but that should be easy. I'll just try one and then the other. --Halfak (WMF) (talk) 14:42, 16 October 2015 (UTC)
Well that was easy. See my code here: https://github.com/halfak/teahouse_experiment/blob/master/th2/extract_rev_text.py
I ran a test and it just completed on the first try, so now I have my data! Now to adapt the Snuggle code. --Halfak (WMF) (talk) 15:27, 16 October 2015 (UTC)
One more thing. I should record the counts.
$ wc th_preinvite_talk_text.tsv 7096 2684040 26571370 th_preinvite_talk_text.tsv $ wc th_preinvite_talk.tsv 7103 14206 134953 th_preinvite_talk.tsv
Looks like ~7 of the revisions were revdeleted. That seems like a good miss rate. --Halfak (WMF) (talk) 15:29, 16 October 2015 (UTC)
Flagging templates
editI just finished writing and running the template finder script. It seems to be working pretty good. Here's the meet of the script.
TEMPLATE_RE = [(type, re.compile(r'Template:({0})'.format(exp)))
for type, exp in
[('vandal_warning', r'uw-vandalism([1-4])?(im)?'),
('spam_warning', r'uw-spam([1-4])?(im)?'),
('copyright_warning', r'uw-copyright(-([a-z]+))?([1-4])?'),
('general_warning', r'uw-.+?([1-4])?(im)?'),
('block', r'.*?block.*?|uw-[a-z]*block[a-z]*'),
('welcome', r'w-[a-z]+|welcome|First article'),
('csd', r'.*?csd|db-|speedy.*?'),
('deletion', r'proposed del.*?|prod|afd|.*?delet.*?'),
('afc', r'afc.*?'),
('teahouse', r'teahouse.*?')]]
Good news: None of the texts matched the teahouse template type. Win. --Halfak (WMF) (talk) 16:15, 16 October 2015 (UTC)
SQL for edit stats
editSELECT
user_id,
COUNT(rev_id) AS edits,
SUM(page_namespace = 0) AS main_edits,
SUM(page_namespace = 1) AS talk_edits,
SUM(page_namespace = 2) AS user_edits,
SUM(page_namespace = 3) AS user_talk_edits,
SUM(page_namespace IN (4,5)) AS wp_edits,
SUM(page_namespace > 5) AS other_edits
FROM th_retention_sample
LEFT JOIN th_retention_sample_preinvite_edits ON
rev_user = user_id
GROUP BY user_id;
Just gets some simple counts. There should be some good signal there. I'll need to spend more time looking for reverted edits and all that jazz. That will have to happen later. --Halfak (WMF) (talk) 16:40, 16 October 2015 (UTC)
Sanity checking the pre-invite talkpage edit counts
editEpochFail raised concern over IRC that the number of users who have talkpage edits (7102) seems low. Is this dataset incomplete? I checked over my work log from 10/6 and can't see any issues that would lead to substantially under-counting. And even if we are under-counting, that's mostly only a problem if we have more data on one sample group than the other.
My previous sanity check showed that users in each group who had SOME edits had the same number of edits to their talkpage. This is what we expected, but averages can be deceiving. We also expect that that the same proportion of users in each group would will have had at least ONE edit to their talkpage before invite time, since the users were randomly assigned to the two groups after these edits took place. Let's test that.
How many total users in each sample group?
select sample_group, count(user_id) from th_retention_sample group by sample_group;
group count control 3092 invited 11674
How many of these users had at least 1 edit to their talkpage before the point of invitation?
select sample_group, count(user_id) from th_retention_sample where user_id in (select distinct sample_user from th_retention_sample_preinvite_talkpage_edits) group by sample_group;
group count control 1474 invited 5628
- 1474/3092 = .4767. 48% of users in the control group have pre-invite talkpage edits
- 5628/11674 = .4821. 48% of users in the invited group have pre-invite talkpage edits
Half of all users had at least one talkpage edit, no matter what group they're in. So we can be confident that we aren't missing data that would skew our comparative analysis. The 48% figure also jibes with my experience running HostBot: as often as not, a Teahouse invite is the first edit to a new user's talkpage. Jmorgan (WMF) (talk) 18:48, 16 October 2015 (UTC)