Research talk:Measuring edit productivity/Work log/2015-04-15

Wednesday, April 15, 2015 edit

The diff job finished! Here's the hadoop stats:

        File System Counters
                FILE: Number of bytes read=11992158169291
                FILE: Number of bytes written=11342016265314
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=634498881239
                HDFS: Number of bytes written=337764574821
                HDFS: Number of read operations=13317
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4000
        Job Counters
                Launched map tasks=2439
                Launched reduce tasks=2000   
                Data-local map tasks=2438
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=446505506700
                Total time spent by all reduces in occupied slots (ms)=26616033150
                Total time spent by all map tasks (ms)=44650550670
                Total time spent by all reduce tasks (ms)=2661603315
                Total vcore-seconds taken by all map tasks=44650550670
                Total vcore-seconds taken by all reduce tasks=2661603315
                Total megabyte-seconds taken by all map tasks=228610819430400
                Total megabyte-seconds taken by all reduce tasks=13627408972800
        Map-Reduce Framework
                Map input records=583741359  
                Map output records=415592383 
                Map output bytes=8579023624969
                Map output materialized bytes=3778907328006
                Input split bytes=508991
                Combine input records=0
                Combine output records=0
                Reduce input groups=415592383
                Reduce shuffle bytes=3778907328006
                Reduce input records=415592383
                Reduce output records=415592383
                Spilled Records=1246338388   
                Shuffled Maps =4878000
                Failed Shuffles=0
                Merged Map outputs=4878000   
                GC time elapsed (ms)=183793619
                CPU time spent (ms)=45173296420
                Physical memory (bytes) snapshot=6270283120640
                Virtual memory (bytes) snapshot=14864163971072
                Total committed heap usage (bytes)=8861103685632
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=634498372248
        File Output Format Counters
                Bytes Written=337764574821   
15/04/09 14:00:04 INFO streaming.StreamJob: Output directory: /user/halfak/streaming/enwiki-20141106/diffs-snappy

real    8408m36.464s
user    4m50.086s
sys     7m25.772s

I had to implement some diff timeouts in order to get it to finish. For that reason, there's some edits that have no diff. I wasn't able to find them in with a simple grep for "diff: null", so I'm just going to kick off the persistence job and see how it goes while I prepare to perform an analysis of the diffs. --Halfak (WMF) (talk) 16:15, 15 April 2015 (UTC)Reply


I had to make some modifications, but the script is now started. In the meantime, I want to (1) confirm that all the diffs are in fact not "null" and (2) plot the diff timing data that I extracted. --Halfak (WMF) (talk) 16:28, 15 April 2015 (UTC)Reply


Well... it looks like I should have been looking for "ops: null". Oh well. Let's grab a sample and start working with it.

So, I randomly sampled 100k revisions from the first reducer. That might result in some bias. I'm not sure. So I'll do come analysis on this while I pull a larger sample.

OK.

 
Diff time density. The density of time spent performing a diff between revisions is plotted for English Wikipedia.

Well, that looks fast to me.

Let's look at some stats.

quantile(diff_stats$diff.time)
#  0%  25%  50%  75% 100% 
#0.00 0.02 0.05 0.13 3.85

summary(diff_stats$truncated)
# False 
#100000

Cool. It looks like we're performing about right for my expectations. Now I'm just waiting for the proper sample to finish. --Halfak (WMF) (talk) 18:40, 15 April 2015 (UTC)Reply


Looks like the persistence generator failed. That was because I changed the format of diffs in order to track stats. I've released a new version of mwstreaming (0.5.5) to fix this and restarted the job. --Halfak (WMF) (talk) 18:41, 15 April 2015 (UTC)Reply


Well, it's all running, but this is going to take at least a couple of hours, so I'm going to go work on other things. If the sample finishes today, I'll update here. If not, look for future worklogs. --Halfak (WMF) (talk) 20:18, 15 April 2015 (UTC)Reply


Update from the FUTURE! The proper sample completed. It looks like stats didn't change in any meaningful way, but I did update the #Diff time density plot above. --Halfak (WMF) (talk) 17:38, 16 April 2015 (UTC)Reply

Return to "Measuring edit productivity/Work log/2015-04-15" page.