User:Brooke Vibber/Dump build split

Current plan: split to four threads; one for enwiki, one for the next few largest wikis, one for a few dozen more medium-to-large, and a fourth for everything else.

This will allow spreading out the timing more, make better utilization of database servers, etc.

Currently going to run:

  • thread 1 (enwiki) on srv31
  • thread 2 (large) on benet
  • thread 3 (medium) on srv31
  • thread 4 (small) on benet

ZOMG edit

 

Handy splitter tool edit

Attempts to break up the database list into similar-sized chunks. Not totally succesful. ;)

<?php

$total = 0;
$counts = array();
$threads = 4;
$fudge = 1.0;

foreach( file("dbsizes.csv") as $line ) {
	list( $revs, $db ) = explode( "\t", trim( $line ) );
	if( $db == "Database" ) continue;

	//echo "$db: $revs\n";
	$counts[] = array( "db" => $db, "revs" => intval( $revs ) );
	$total += intval( $revs );
}

$perthread = intval( $total / $threads );

echo "Total: $total\n";
echo "Desired threads: $threads\n";
echo "Ideal count per thread: $perthread\n";

$assignments = array();
$dbindex = 0;
for( $i = 0; $i < $threads; $i++ ) {
	$assignments[$i] = array();
	$dbcount = 0;
	$revcount = 0;
	
	while( $revcount < $perthread * $fudge && $dbindex < count( $counts ) ) {
		$revcount += $counts[$dbindex]["revs"];
		$assignments[$i][] = $counts[$dbindex];
		$dbindex++;
		$dbcount++;
	}
	
	echo "Thread $i: $dbcount databases, $revcount revisions\n";
}

foreach( $assignments as $i => $dbs ) {
	echo "\n# Thread $i\n";
	usort( $dbs, 'sortDatabases' );
	foreach( $dbs as $item ) {
		echo $item["db"] . "\n";
	}
}

function sortDatabases( $a, $b ) {
	return strcmp( $a["db"], $b["db"] );
}

?>

Suggested splits from the tool edit

Total: 113956291
Desired threads: 4
Ideal count per thread: 28489072
Thread 0: 1 databases, 48078833 revisions
Thread 1: 4 databases, 29382691 revisions
Thread 2: 40 databases, 28577478 revisions
Thread 3: 635 databases, 7917289 revisions

Thread 0 edit

  • enwiki

Thread 1 edit

  • dewiki
  • frwiki
  • nlwiki
  • plwiki

Thread 2 edit

  • arwiki
  • bgwiki
  • bgwiktionary
  • cawiki
  • commonswiki
  • cswiki
  • dawiki
  • dewiktionary
  • enwikibooks
  • enwikinews
  • enwikiquote
  • enwiktionary
  • eowiki
  • eswiki
  • etwiki
  • fiwiki
  • frwiktionary
  • hewiki
  • hrwiki
  • huwiki
  • idwiki
  • iowiktionary
  • itwiki
  • ltwiki
  • metawiki
  • nowiki
  • plwiktionary
  • ptwiki
  • rowiki
  • ruwiki
  • sep11wiki
  • skwiki
  • slwiki
  • sourceswiki
  • srwiki
  • svwiki
  • trwiki
  • ukwiki
  • viwiki
  • zhwiki

Thread 3 edit

  • everything else!