User:EpochFail/Journal/2011-07-18

This content has been copy-pasted from another wiki. Effort has been made to move content and images, but some links might be dead.

Monday, July 18th

It looks like the process that was aggregating data over the weekend pooped out since it lost connection to the MySQL server. I've fixed the code of such disconnections won't be disastrous and restarted. I plan to spend the morning examining the first bits of the output. --Ahalfaker 16:41, 18 July 2011 (UTC)

So I spent the morning and an hour after lunch working on the huggle experiment. We are getting the template put in huggle for hugglers!!! This is awesome.

I've got my dataset for first user sessions so I'm planning to work on that for a little bit (descriptive stats and visualizations) then do some writing. I have a few ideas that would be good to document. --Ahalfaker 21:13, 18 July 2011 (UTC)

Edit count distributions by year. Log bucketed.

Edit count proportions by year. Log bucketed.

I got side-tracked by some work for the huggle experiment. I just checked in svn/mediawiki/trunk/tools/wsor/newbie_warning/track_hugglers.py. I ran a test over night to make sure I could handle a lot of talk posts. This morning, I came in to find that I had tracked 500+ messages with 351 still waiting to be read. --Ahalfaker 16:25, 19 July 2011 (UTC)

Tuesday, July 19th

Worked on huggle a bit and am going back and forth with Stu on getting it running. I've take a back seat since I got the tracking script finished and that has given me time to get back to analysis.

Early editor survival by year.

To start, I came up with a simple metric for early editor survival. I define an editor as surviving if they perform at least one edit one month after their first edit session. I then plotted the average survival rate by year for editors who started editing in that year.

Early editor survival by year (no vandals).

Then, I wondered if the introduction of non-productive editors could explain the decrease in survival so I limited the samples to editors that performed at least two edits in their first session and had less than 25% of those edits reverted for vandalism. The resulting plot shows essentially the same trend.

Early editor survival by year and initial edits.

Now I wonder if the amount of work that editors do in their first sitting predicts retention and how that has been changing over the years. I hypothesize that editors who perform more work in their first session are starting off with a higher level of motivation and will sustain more barriers and demotivation before being scared away. To the right is a lattice of years of three edit groups, editors that performed ~1 (1 edit), ~2 (2-3 edits), ~4 (4-6 edits), ~8 (7-13 edits), ..., ~64 (46-90 edits) in their first session.

Early editor survival by year grouped by initial edits.

So... that last plot is a difficult visualization. What I really want to see is how editors, grouped by their first session edits, change over time. The plot to the right is the same data, but the panels merged and axis split. Although the ~32 and ~64 are a bit noisy, the other groups consistently reduce in early survival proportion over the years.

I just met with Fabian and Dario to discuss a really interesting result of one of fabian's plots. It looks like editors who joined in 2007 are still contributing 90+% of the content to Wikipedia namespaces. This is really interesting because it supports the hypothesis that new editors aren't making into the community spaces. Further, other work suggests that new editors are still editing Wikipedia articles at an acceptable rate. This could be the explanation for the decline. As soon as I finish up the last couple of things I am working on, I want to focus directly on testing the Impenetrable Community hypothesis. GOOD NIGHT! --Ahalfaker 00:36, 20 July 2011 (UTC)

Wednesday, July 20th

Average edit activity for first three edit sessions by year of first edit and bucketed by first session edit group.

I just sat down and started getting data together this morning so I didn't report in. Now I've finished a plot of the average number of edits for the first three edit sessions for editors grouped by how many edits they did in their first session. I think I am seeing a regression towards the mean effect and no notable changes between years. Sadly, I think I have defeated my hypothesis that number of edits performed in the first session has a non-linear effect on retention. In other words, editors who do lots of work right away hit the same wall, but just hit it a little harder.

On the other hand, I finally got to modeling retention based on rejection today. So I took the edits an editor performed in their first three sessions and gathered the proportion that were deleted or reverted. I call this the initial rejection proportion'. I used a logistic regression over early survival to find out what sort of effect was had. I also included the number of edits made in the first session (es_0_edits), the time since 2001 (years_since_2001) in the regression.

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1680  -1.0351   0.6162   0.9300   1.6762  

Coefficients:
                                                          Estimate Std. Error z value Pr(>|z|)  
(Intercept)                                                0.49731    0.01666  29.855  < 2e-16
sc(es_0_edits)                                             1.28777    0.14685   8.769  < 2e-16
sc(years_since_2001)                                      -0.43307    0.01708 -25.354  < 2e-16
sc(initial_rejection)                                     -0.61209    0.01722 -35.552  < 2e-16
sc(es_0_edits):sc(years_since_2001)                        0.20489    0.14880   1.377 0.168508
sc(es_0_edits):sc(initial_rejection)                      -0.43796    0.13207  -3.316 0.000913 
sc(years_since_2001):sc(initial_rejection)                 0.13335    0.01784   7.474 7.78e-14
sc(es_0_edits):sc(years_since_2001):sc(initial_rejection)  0.36832    0.14688   2.508 0.012157 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25749  on 19221  degrees of freedom
Residual deviance: 22900  on 19214  degrees of freedom
AIC: 22916

Number of Fisher Scoring iterations: 7

The regression above confirms that, independent of the number of edits in the first session and the age of the encyclopedia, initial rejection has a powerful effect. Let me enumerate the likely explanations for the effects observed.

First, effects on retention can be ordered by the size of the effect.

+ es_0_edits (Initial motivation/investment)
- This makes sense and relates to what I talked about above with "hitting the wall harder". This editors that make a bigger initial investment are more likely to stick around.
- initial_rejection
- Of any of the rest of the effects, this one is the strongest and that should be telling. What might be more telling is that the intersection of initial_rejection and es_0_edits (rejection and investment) is strongly negative. This means that the more initial investment you make, the more negatively effected you are by rejection!
- years_since_2001
- This term represents the general decline in editor survival over the years. (because I am so excited) I'd like to point out again that this term has less of an effect than rejection and the intersection of rejection and investment, which could suggest that rejection and investment represent the lion's share of the decline newcomer retention.
- There's one more bit here that was unexpected. The intersection of rejection and years_since_2011 is significantly positive and this could suggest that editors who joined recently tend to have a thicker skin for rejection. I hypothesized the exact opposite!

20:57, 20 July 2011 (UTC)

Early survival proportions are plotted by year and grouped by the proportion of early edits that were deleted or reverted.

The plot to the right visualizes the changing effect of early rejection over the years. This plot confirms the diminishing effect of rejection on editor survival over the years.

I'm just about to start work on a script that should regenerate the data for Fabian's plots in the Wikipedia namespaces so I can confirm his results and ask my own questions. --Ahalfaker 23:46, 20 July 2011 (UTC)

Thursday, July 21st

I spent most of this day monitoring huggle data, helping Fabian with WSGI and aggregating data to replicate Fabian's work with bytes added. --Ahalfaker 16:30, 22 July 2011 (UTC)

Friday, July 22nd

I came in this morning to find my simple table for bytes changed still not built. All I am doing is adding columns to a table that exists. It's all based on a set of very simple joins, so it should have been fast. I've done a bit of reading and it looks like the internet doesn't know what's up. I'm starting to troubleshoot, but I'm really mad/embarrassed that this is taking so long. It really should be a simple operation. --Ahalfaker 16:30, 22 July 2011 (UTC)

It looks like indexing tall tables on MySQL is pretty ridiculously slow. I think that something is blocking the operation. I've never experienced such poor performance with these kind of operations. I've done some reading, but it's hard to know what is up without being able to check the status of db42. In response, I currently have two running processes competing to produce the dataset I need. One is a re-aggregation of the data in python and the other is building indexes so that I can get what I need through a table join. So far, I think the python script is winning. Sadly, I still have to make indexes on this once I get it into the database :(. --Ahalfaker 18:16, 22 July 2011 (UTC)