User:EpochFail/Journal/2011-07-11

Monday, July 11th edit

The dataset is ready! Loading into R at the moment. --Ahalfaker 16:37, 11 July 2011 (UTC)

Found a strange problem with loading the dataset near user "on wheels". I just found out that the name belongs to a classic troll, so that's fun.

Right now, I've been side-tracked to work on cleaning up some of the vandal-fighter plots for the signpost so I'm getting back to that immediately.--Ahalfaker 18:20, 11 July 2011 (UTC)

Just finished cleaning up bots. Reloading into R and sending emails. Lunch in a few minutes. --Ahalfaker 19:28, 11 July 2011 (UTC)

Back from lunch a bit ago and Alpha is behaving strangely--or rather my R session is. I think it might be swapping. --Ahalfaker 20:59, 11 July 2011 (UTC)

I finished the work for the signpost. That took longer than I expected, but it is done. I talked with Ryan and he pulled 12K(!!!) converted vandals from my dataset already. I'll be looking at those tomorrow. --Ahalfaker 16:53, 12 July 2011 (UTC)


Tuesday, July 12th edit

Today I hope to:

  • review problems with stream reading XML in hadoop
  • propose a few metrics for measuring editor fitness after receiving a level 1 warning.
  • spot check the converted vandals dataset and convince one of the quals to look at it with me.

16:53, 12 July 2011 (UTC)

SO, we have confirmed that the XML streamer is the problem in hadoop. It looks like we might be able to solve our XML streaming issues by plugging into the Mahout XML streaming interface. Shawn is looking into that. I'm hoping to reason about what is happening inside the XML streamer and causing the problem in the meantime. --Ahalfaker 17:32, 12 July 2011 (UTC)

I spent the rest of the morning working with Shawn and Hadoop to discover the source of the problem. We defeated a lot of hypotheses, but hadoop is pretty difficult to penetrate. --Ahalfaker 21:50, 12 July 2011 (UTC)

Back from lunch and worked out the numbers for Stuart. It looks like we are going to need about 450 observations to determine significance if the number of sessions an editor engages in after being revert-warned increases by .5 on average. This is based on a t-test and lots of algebra. --Ahalfaker 21:50, 12 July 2011 (UTC)

I'm currently waiting for hadoop to finish so I am writing a script to try to rectify editor registration rate around the end of 2005. --Ahalfaker 21:51, 12 July 2011 (UTC)

Finished the update to the pre-2006 editor registration dates. It is a lame approximation, but I think it will work for what we need it to do. --Ahalfaker 22:49, 12 July 2011 (UTC)

Wednesday, July 13th edit

First thing I did today was write a logging script to monitor the huggling that Stuart is doing. We had some frustration with CRON, so I just let him run the script. Everything appears to be groovy. --Ahalfaker 20:48, 13 July 2011 (UTC)

After lunch I gave Giovanni, Yusuke and Fabian a brain dump of the way I see problems in Wikipedia and why I think a live chat client would help. It was fun. I'm even more excited about that now. Now back to looking for converted vandals. --Ahalfaker 20:48, 13 July 2011 (UTC)

I've been looking through the suspected converted vandals. I'm finding a lot of false positives, so I've loaded up the dataset in R to see if I can make sense of the best predictors. I'm struggling with a fever at the moment, so I might have to give up. I'll give it another half hour and glass of ice-water to see if it subsides. --Ahalfaker 23:17, 13 July 2011 (UTC)

Thursday, July 14th edit

So there were a bunch of hangups for the newbie warning experiment, but it is off and running. Now it's time to get back to the converted vandals. I've classified about 20 of my suspected converted vandals and it looks like there is a lot of error. 5/24 look like conversions, but another 8/24 were *not* doing damage initially, but had their revisions reverted for vandalism anyway. That means 13/24 were at least initially suspected of being vandals, but ended up being good editors. It is interesting to look at these, but I think it might be more fruitful to check out the work of top contributors in a given year and look at their first set of edits. --Ahalfaker 16:11, 14 July 2011 (UTC)

I think it might be more fruitful to check out the work of top contributors in a given year and look at their first set of edits – YES! Please do it :) --Mpinchuk 17:04, 14 July 2011 (UTC)
OK. As soon as I've either found the light for detecting conversion or decided that it may not be fruitful, I'll look into this. I think I'll start by getting lists from Fabian for top content contributors. --Ahalfaker 18:16, 14 July 2011 (UTC)

In the meantime, plots!

So it looks like the right way to capture initial vandalism is to look for anything that was reverted for vandalism in the first edit session. There are just so many editors with no vandalism in the first edit session that can be ignored. I wish I had help the future edits in this dataset. In fact, I think I might write a script to add that to this dataset. I bet that I can use a regression to make some predictions. OK... I'm doing that now. I think I have have the data ready for this afternoon. --Ahalfaker 18:22, 14 July 2011 (UTC)

Just kidding. It looks like I'll have the data before lunch. --Ahalfaker 18:30, 14 July 2011 (UTC)

 
Histogram of the number of edits an editor performs after their first session for editors with more than 20 total edits.

Since I am limiting editors to the those with at least 20 edits, I have a weird distribution of future work with a spike at ~ 19 future edits. This makes sense since most editor's first session is 1 edit and they should have performed at least 20 edits total so 19 should have come after that. I'm not sure what to do with this, but I'm starting to get excited about predicting an editors retention (future work) by looking at what happens to the edits they make in their first session. --Ahalfaker 19:10, 14 July 2011 (UTC)

Time to go find lunch and ponder next move. I plan to ping Steven and Maryana then to help me prioritize. --Ahalfaker 19:15, 14 July 2011 (UTC)

After a quick chat it looks like this is my plan:

  1. Find out the proportion of prolific editors who look like they started editing by doing damage.
    • Hopefully report on this by the end of the week.
  2. Generate dataset based on a random sample of first edit sessions of editors and future activity then build a predictive model.
    • Hopefully have dataset ready by the end of the week.

20:12, 14 July 2011 (UTC)

I spent a good hour working on Hadoop and then an hour meeting with Diederik to tell him where we were with Hadoop, but in the meantime my computer crashed so I lost the proportions of top100 editors. I'm going to get them back quickly and then head out. --Ahalfaker 01:00, 15 July 2011 (UTC)

Top 100 first edit session:

  • Editors with at least one edit discarded: 33
  • Editors with at least one edit reverted: 16
  • Editors with at least one edit reverted for vandalism: 3

--Ahalfaker 01:10, 15 July 2011 (UTC)

Friday, July 15th edit

First thing I did when getting in today is read the bikeshed article that has been floating around. Then I talked hadoop with Shawn and Diederik a bit. Now I am running the query to sample across new users who make at least one edit so that I can examine their first edit session, their level of activity, community reaction and how that predicts future work. I think this one is going to be really fruitful :). However, it is going to take me a little while to get the data together. Here is my plan:

  1. Fix the user_meta table so it includes edits to deleted pages.
  2. Generate samples of 10k editors (who made at least one edit) for each year of Wikipedia
  3. Gather first, second, third session data for every sampled editor (should be ~100k. I expect this to take ~2 hours to run.)
  4. Load sessions and future edits into TSV for analysis in R
  5. Plots and regressions (predictive modeling)

If everything goes well, I expect plots by Wednesday. If something goes wrong (as it often does with the amount of adhoc code and baby-sitting I'll need to do) I'll write about that here. --Ahalfaker 17:25, 15 July 2011 (UTC)

Everything is going well. The dataset is being generated.

I spent about an hour on hadoop and about an hour going over research with Fabian and Yusuke. --Ahalfaker 00:14, 16 July 2011 (UTC)