Tuesday, December 12, 2006

Splitting the Zip

Can you remember a time when you thought "Goodbye to splitting a file across a bunch of floppies, it all fits on a CD now"? Maybe like me you're never 100% comfortable with this form of information surgery where you split a file, usually valuable, into parts, transmit it somewhere, and reassemble. If so you blew a sigh of relief when thinking those days were over, because hey, CD's store everything. Well, it was only a matter of time before the scenario revisited me.

In this case, the patient was a 1.2GB zip file. In order to help a certain close relation meet an academic commitment at the end of the semester, I was called upon to install at home a trial version of the program used in the school lab. The problem came after the two-hour download, when the file was reported to be corrupted and we noticed about 45 megabytes were missing from the expected size.

We figured that transmitting such an enormous file was too much opportunity for error. Something got dropped in transmission. We could re-download the file on a computer with a better, faster Internet connection than our DSL hookup. We did this fairly quickly. But we still needed to get the file to our computer at home somehow. I was surprised that Firefox wasn't doing some kind of data integrity checking. My solution then was to do it manually: split the file into chunks and verify each chunk with its MD5 hash. If a chunk got transmitted correctly, it would be progress because we didn't need to start over from scratch.

Hoping to find standard Unix commands to do this, I found that split and cat both support byte mode, so they could be used. I did a quick check that the Cygwin versions worked as I thought. The procedure I used is this:

Step 1. Decide what size chunks to use

I settled on a size of 300,000,000 bytes, big enough to make five chunk files.

Step 2. Split the file


% split -b 300000000 valuablefile.zip

This created a series of files xaa, xab, xac, xad, xae. The first four have the exact size specified, 300000000, and the last (xae) has the remaining bytes.

Step 3. Transmit and check integrity of the chunks

As each chunk file finished download, we ran md5sum on the chunk and compared the hash with that obtained on the originating computer.

Step 5. Reassemble the chunks

% cat -B x* > valuablefile.zip

This step took about 18 minutes on Sony Vaio laptop, and I suppose there's no surprise that virtually none of it was CPU time.

At this point we had our file back, and was full and complete. I offer this account in hopes it will be helpful. If you find it helpful, leave a comment!

1 comment:

Susan's Husband said...

Buy a DVD drive, then you get up to 4.7G, presuming no compression.

Some CD burner programs support chunking as well, but I don't know how you'd re-assemble on a system without that installed (ghost, maybe?).