Friday, November 30, 2007

Dollars Per Gigabyte Per Month: On Backups and Reliability

Following "Half a Terabyte, Hold the Gravy," several JFKBits readers have been discussing the JungleDisk interface to Amazon S3 and I wanted to explore storage reliability, extending the discussion from dollars per gigabyte for raw ownership to dollars per gigabyte per month.

Reliability and Backups

First, a note about reliability in general, and CDs/DVDs in particular. Failure of any backup device or media can occur due to basically two things: aging and mishandling. The NIST publishes "Care and Handling of CDs and DVDs: A Guide For Librarians and Archivists" and it tells you in great detail exactly what goes wrong when you mishandle a disk. When we talk about reliability, we may easily remember numbers from quotes like "An accelerated aging study at NIST estimated the life expectancy of one type of DVD-R for authoring disc to be 30 years if stored at 25°C (77°F) and 50% relative humidity" (which is far under the industry consensus of 100-200 years). What we may not remember as easily is that reliability is directly affected by handling. Part of any backup plan needs to be hygiene practices, if you will, of taking care of the backup media and hardware.

Costing Storage Over Time

Amazon S3 is $0.15/GB/month, or $1.80/GB/year, and for someone looking to store a few hundred gigabytes forever, this is an important consideration. Surely the reliability of hard drives are better than one year, and why would we pay $1.80 for our gigabyte to Amazon when we can get away with copies on a couple of drives at 36 cents each or a couple of DVDs at 9 cents each?

The JungleDisk web site quotes a stat from a recent independent study on drive reliability at Carnegie-Mellon by Schroeder and Gibson stating that 15% of hard drives will fail in 5 years. If you ignore multiple drive failures, and counting on retiring a 5-year old drive, I figure that means a drive at price P will actually cost 1.15 P in replacement costs. Of course you'll probably want to be the replacement drives up front, and unless you're buying 100 drives or more, you're actually spending more on replacements that 15% extra, perhaps up to 5 P depending on how much redundance you want (we'll get to RAID later). At the minimum 2 P for us small-scalers who aren't putting 100 drives into active use simultaneously, my $182 LaCIE drive comes out to $364 over 5 years, or $0.013/GB/mo. That's still a healthy distance from $0.15/GB/mo. I could buy five more replacements, acknowledge that a "500GB drive" is only 460GB and still be at $0.04/GB/mo. What else do we need to consider?

Inexpensive Disks or Cheap Disks?

Marcus Smith is the owner of a company providing IT services including backup and recovery to small business, and had some things to say about hard drives as archival medium on the Association of Moving Picture Archivists mailing list:

I completely agree that hard drives are only one component of an overall backup strategy and that multiple technologies will need to be employed in order to provide truly safe nests for our precious ones and zeros. There are a couple of universities experimenting with the idea of ditching tape systems and instead building very large hard drive arrays in which drives are given a certain time to live and are replaced on a rotating schedule. If every drive actually lived up to an exact span this idea would probably be used more often. The sad truth is the hard drives can fail at any time and this unpredictability alone is enough to show the overwhelming undesirability to use them as an archival media.

The idea that cost-per-megabyte savings is an enabling feature needs more clarification. How does this translate exactly in to providing new methods of storage? One major problem with using a cost-per-megabyte approach in favor of large capacity drives is the intended market. Who is buying these drives? Who is using them for long-term storage?

When it comes down to it, I'd be willing to wager that far more people are using hard drives as their sole mechanism for storage because of their low cost, which therefore makes price a liability, not a benefit. As the price of drives drops, it only traps us tighter into relying on unreliable media.

To store the same amount of data on other mediums - tapes, really - we're already spending thousands of dollars. The low cost and high capacity of hard drives is exactly why ATA is used instead of SCSI, and it continues to have a draining effect on quality of any solution because of the lack of any other cheap storage media. As long as cheap drives push the market, that is what most people will use. Again, this points to the question of who the market really is. Are we talking about large archives who may or may not be able to spend money on developing proper, multiple technology storage solutions? Or are we talking about the "Prosumer" level where individuals migrate film or store native digital video on hard drives because there is simply no other reasonably attainable storage device? Since Fry's are offering cheap drives, and archival institutions probably aren't shopping there for serious storage solutions, it seems to me that we're really talking about individual consumers who need basic, fast, easy access storage. In short, folks who probably don't have tape drives, a multiple technology backup systems, or a well-planned rotating storage schedules.

Costing Data Failure

For JFKBits readers, I know our needs are all over the map: high res digital photos; Linux backups including both system configuration; personal creative works representing who knows how many hundreds of hours of labor. In most cases the scenario of losing our data affects mostly ourselves. I contacted Marcus for permission to quote him in this article, and he described his small business clients: lots of professionals, where other people depend on the data; losing information invokes the specter of a lawsuit. When we make our backup plans, we know the purpose is to prevent data disaster. But when we start costing it, sometimes the cost of the data disaster tends to be left out:

Determining the value of the data you want insured may not be easy. A single database or secret recipe may be the heart of your business, and estimating the worth of that intangible asset is a problem that can give CFOs nightmares
--Kevin Savetz in "Data Insurance", published in New Architect May 2002

RAID When Things Go Wrong

At this point, RAID comes forcefully to mind. Duplicate the data and add some automated error correction and detection. More caution, more consideration, urges Marcus:

RAID arrays can offer some benefit, but there are dangers here also, and they need to be addressed. RAID arrays do not technically have to be constructed with identical drives, unless one uses crappy hardware controllers. But for RAID to work efficiently, then identical geometry and capacity drives should be used to build the array. Most RAID systems are based on using identical drives. Here again money comes into play. RAID 5 is nice for the overall capacity, but using identical drives may be seen to have similar failure characteristics. Let's mentally build an array with five drives, and buy five more for replacement. It could very well be that within two years all ten drives may be exhausted and getting two year old drives can be very difficult to procure. Nearly impossible, actually. There are two obvious problems: (1) you're only as safe as the number of drives you buy, which is going to be many. Go ask Compaq for a 9.1gig hot-swap replacement for a Proliant server. For that matter, head over to Fry's and ask for an IBM 30-gig Deskstar drive. It will never happen. (2) If identical drives have similar failure characteristics, then it stands to reason that when one drive fails, the other drives in the array are not far behind. There are also two non-obvious problems: (1) If two drives fail before the operator knows there is a failure, all of the data on the array is lost. (2) Data recovery from missing RAID volumes is notoriously difficult and usually ends in failure of recovery.

Luckily there are other RAID options, like mirroring and striped mirrors. Again, be prepared to invest in drives. Lots of drives. These options use half the drive capacity of the total number of drives. I.e., for 500 gigs of space you need not five 100gig drives, but 10 100gig drives. You cannot do this with ATA drives because of the limitations of ATA controllers. I believe the largest possible number of drives on a single controller is the 3Ware 8-channel RAID adapter. Eight drives plus eight spares. Plus more if this is where the Company Policy places its long term future. Add a storage rack for this and a Very Expensive RAID adapter and suddenly the one-year warranties on the drives turn an expensive project into an expensive and unreliable project. The benefit over RAID 5 is that data recovery is much easier of this kind of array if it comes to it.

So yes, we need to think very, very differently about storage. RAID solutions are not as safe as their cracked up to be, and depend greatly on the vigilance of the systems administrator to keep things running smoothly.

Whew! The overriding message here I think is that RAID doesn't solve everything - you'll want to buy your replacement disks up front, as you probably can't get them in 5 years. You still need to keep watch over the health of the RAID array. (I had extra time to think about this on my way home tonight as I changed a flat tire.) And as the experience of one of our readers reveals, you still need a strategy to watch that your backups happen and that the backups themselves are OK. One JFKBits reader checked his backups after reading the recent article and discovered that his backup software actually scrozzled the data. Now how much would you pay to make sure this all gets done right? Ready to fork over that extra dime per gig per month? I thought you might be.

Conclusions

We all know we need backups. If we're thinking long term enough to realize that even the backup media may fail, we've started planning the cost over time. Making your plan depends on what you're doing; backups of a machine image or a database have different needs than backups of segmentable data such as are needed by pro photographers. Amazon S3's figure of $0.15/GB/mo is not unreasonable, but you still will want a plan to verify integrity for yourself. A $1000 RAID NAS box (maybe two) is reasonable, providing you buy the replacement drives up front and verify the backups. If your data is segmentable, DVD backups are quite reasonable, providing you have the time to burn plenty of copies, distribute them geographically, store them in a climate-controlled environment, and organize them well. Tape backups are not dead either, and we'll close by letting Marcus have the last word:

I personally am much more in favor of tape systems than hard drives. I just wish they were less expensive. My problem is the same as many others - I don't have $5000 to shell out for a tape drive to store 400gigs of materials. Unless, of course, someone would like to buy me a new LTO drive and a bunch of tapes.

Does anyone want to wax poetic about the beauty of Tar as backup software?

2 comments:

Susan's Husband said...

Tape has obsolescence problems as well. I have a box of backup tape in the basement of a variety of types for none of which exists reader hardware on the market.

What may emerge in the future is some sort of cooperative bit-torrent like scheme, where every one stores on unrealiable disks but share copies around, i.e. I keep your photos and you keep mine so there's no single point of failure.

Unknown said...

I have often thought about this problem and the driving factor for me is the continued drop in storage cost. I run a website from a 4 * 750G sata raid5 array. I only own 1 replacement disk and plan to buy a new one only when I have had an actual drive failure. If in two or three years it becomes difficult to buy replacement 750 gig drives I will simply buy a whole new computer with a new 4 * 4terabyte array or whatever is common in 2 years. this solution will not only save me money but effort and worry. Migrate my application, my data over gigabit ethernet and I am done. If you want to store long term you still need to think in short hops. This is all digital data, the beauty is how easily and safely it can be replicated. Buy what you need now keep it safe and worry about what you will need later, it just makes sense.