User:Halfak (WMF)/Pre-registration anonymous activity

Using the cu_changes table, we generated a dataset that contains a newcomers' pre-registration and first session activity. See the following diagram to get a sense of the kind of activity we're interested in:

In order to generate this, wegathered a sample of newly-registered users from the cu_changes table using this query:

SQL source code
SET @month_end = (SELECT max(rc_timestamp) FROM recentchanges);
SET @month_start = DATE_FORMAT(@month_end - INTERVAL 30 DAY, "%Y%m%d%H%i%S");

SELECT 
    user_id as id, 
    user_name as name, 
    user_registration as registration, 
    user_editcount as editcount, 
    cuc_ip as registration_ip
FROM user
INNER JOIN cu_changes ON 
    cuc_user = user_id AND
    cuc_type = 3 AND
    cuc_actiontext LIKE "User account % was created"
WHERE user_registration BETWEEN @month_start AND @month_end
AND user_editcount > 0
AND user_id % 10 = 0

Using the registration_ip from this dataset, we scan the revision and archive looking for revisions with a user_text field that corresponds to the registration_ip. Using the edit session method with a 1 hour cutoff, we then gather the last session before registering (if any) and the first session since registering.

Results

edit

In my sample, 6.9% of newcomers who ended up making at least one edit in their first week edited as an IP before registering their account. I'm generating some histograms now to get a sense for how much they before and after registering.

 
Histogram of session revisions pre-registrations
 
Histogram of session revisions post-registration (with zeros filtered)
 
Histogram of session revisions post-registration

So how do users who edit before registering behave compared to users who don't? The following plots include data from both the pre-registration session and the post-registration session.

 
 
 
 
Data Table
   pre_session revisions.geo_mean revisions.geo_se main_revisions.geo_mean
1:       FALSE           1.723199       0.01047070                1.115439
2:        TRUE           4.281658       0.03594024                3.452403
   main_revisions.geo_se revert_prop.mean revert_prop.sd productive.k    n
1:           0.009144279        0.2321521      0.4122159         2342 4564
2:           0.037101374        0.3916604      0.4287481          232  340
   productive.prop productive.se revert_prop.se
1:       0.5131464   0.007398557    0.006101715
2:       0.6823529   0.025248611    0.023252133

What if we limit the analysis to just the post-registration session?

 
 
 
 
Data Table
   pre_session revisions.geo_mean revisions.geo_se main_revisions.geo_mean
1:        TRUE           2.178460       0.04449625                1.728482
2:       FALSE           1.723199       0.01047070                1.115439
   main_revisions.geo_se revert_prop.mean revert_prop.sd productive.k    n
1:           0.038133895        0.3089849      0.4439494          195  340
2:           0.009144279        0.2321521      0.4122159         2342 4564
   productive.prop productive.se revert_prop.se
1:       0.5735294   0.026821492    0.024076534
2:       0.5131464   0.007398557    0.006101715