Thursday, July 23, 2009

Speedlink: Engine Yard SHA-1 Contest Winner Writeup

Another blog from the Engine Yard's winning team: Paul Wins an iPhone.

This fills in some of the details left out in the more prominent post, such as the hand-optimized C code for getting 10x improvement on the SHA-1 for ordinary CPUs (not the GPU).

Tuesday, July 21, 2009

A Crack at Cracking the EngineYard's SHA-1 Challenge

That was fun. This is a brief report on my strategy for the just-concluded Engine Yard programming contest. My farm of thirty-something CPU cores, with Polar SSL's SHA-1 hash code and my wrappings managed to come in about #57 with a Hamming distance of 36 for key dijkstra parameters format i10n expo dedicated crispin programming output inspect beta.

I wasn't originally going to enter, but after reading some of the blogs and forum discussions, my interest was piqued and I started thinking about the problem.

I began to get pulled in by counting up the order of magnitude for the search space, and the claim of someone in a discussion forum that he could "solve" the contest in about 5 minutes with about 30,000 nodes. My estimate was that the search space was about 10^63, large enough to make a lot of decisions pretty clear.

Also, I had higher priority activities (work, family, intermittent power outages that were traced back to a tree rubbing away the insulation on one wire of our service line and neutral), so I didn't start coding for the contest until it was already underway, which of course is a bit self-defeating, but I'm happy with where I ended up.

Hardware: I have a collection of nodes at work that are typically unused, most of them of relatively recent vintage meaning somewhere around 3GHz. Mac, Linux, and Windows are all represented, most of them dual core with a couple of 4- or 8-core machines, all of them AMD 64 chips. At the peak I had 37 cores running.

Software: Prototyped in Mathematica, then recoded in C. Polar SSL's SHA1, perfectly acceptable Hamming distance function based on a bytewise bit counting lookup table, and a simple enumeration technique for generating candidate phrases that ignores the capitalization and random character suffix parts of the search space.

Parallelism: Manually start the contest executable by hand:

% nohup contest &
Not scalable, but fine for the dozen-odd machines I had. I did two or three rounds of manually killing all the executables and restarting them with an updated version, which did get old. To distribute the workload, the executable picks a random initial phrase. Random generator is seeded with the random seed based on the current time and process ID.

Tracking minimum: contest executable is compiled with the current best known minimum.

Status reporting: All programs are started from the same network file system directory and write updates there. When an executable finds a new minimum, it writes the phrase to a file using the Hamming distance, hostname, and PID in the filename. My Windows explorer window, sorted by name looks something like this, and I just ignore or delete files with higher scores from nodes that don't know a better score has been found:


Performance: the SHA-1 algorithm (of course) seemed to be the limiting factor for my C code, clocking in somewhere around 1M hashes/sec. My initial simple and stupid method of forming candidate phrases was responsible for cutting performance in about half:
sprintf(phrase,"%s %s %s %s %s %s %s %s %s %s %s %s",
contestDictionaryWords[word[11]]], contestDictionaryWords[word[10]]],
contestDictionaryWords[word[9]]], contestDictionaryWords[word[8]]],
contestDictionaryWords[word[7]]], contestDictionaryWords[word[6]]],
contestDictionaryWords[word[5]]], contestDictionaryWords[word[4]]],
contestDictionaryWords[word[3]]], contestDictionaryWords[word[2]]],
contestDictionaryWords[word[1]]], contestDictionaryWords[word[0]]]);
All this copying is horribly slow, but it got me underway. Before retiring for the night on Monday, I refined it so this copy was only done every 1000 times.

When I went back and wrote something that optimally copies only what's changed, I saw around a 50% to 75% speed improvement. One of the tricks in the improvement is to make two copies of the dictionary: one with a space appended to each word (for the first 11 words) and one without the space. Here's how the final phrase initialization code looks:

void initPhrase()
int i, w;
unsigned char *p;


/* Start with a random enumeration to partition work */
for(i=0; i < 12; i++)
word[i] = random() % 1000;

p = phrase;
for(i=11; i > 0; i--){
w = word[i];
strcpy(p, contestDictionaryWordsWithSpace[w]);
phrasepartend[i] = p += contestDictionaryWordWithSpaceLen[w];

w = word[0];
strcpy(p, contestDictionaryWords[w]);
p += contestDictionaryWordLen[w];
phraselen = p-phrase;

printf("initial phrase: %s\n", phrase);

Hamming distance: my initial code for this seemed to perform well enough that I never bothered optimizing it further, e.g. performing the XORs a 32-bit word at a time:

unsigned challengeDistance(unsigned char hash[20])
int i, distance;
distance = 0;
for(i=0; i < 20; i++){
distance += numOnes[challengeHash[i] ^ hash[i]];
return distance;
For each byte, xor the hash with the corresponding byte in the challenge hash, and use a precomputed lookup table to count the ones.

Performance reporting: The main program structure is a while(1) loop that generates and evaluates candidate phrases in batches of a few million. These batches are times and current results are printed to the console. I had dozens of console windows where I could keep an eye on progress:

390392448 evals, 1000000.000000 eval/sec, 40 session min, 36 min
400392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
410392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
420392448 evals, 909090.909091 eval/sec, 40 session min, 36 min
430392448 evals, 1000000.000000 eval/sec, 40 session min, 36 min
440392448 evals, 909090.909091 eval/sec, 40 session min, 36 min

Conclusion: I put together a modest-performing program that performed about 100 billion searches during the contest and found a phrase that had a score of 36. One thing I'd like to do is analyze the performance of the SHA-1 hash implementation itself and understand why it takes a few thousand machine cycles, what the limiting factor is, and whether there's a better performing implementation.