Friday, November 30, 2007

Dollars Per Gigabyte Per Month: On Backups and Reliability

Following "Half a Terabyte, Hold the Gravy," several JFKBits readers have been discussing the JungleDisk interface to Amazon S3 and I wanted to explore storage reliability, extending the discussion from dollars per gigabyte for raw ownership to dollars per gigabyte per month.

Reliability and Backups

First, a note about reliability in general, and CDs/DVDs in particular. Failure of any backup device or media can occur due to basically two things: aging and mishandling. The NIST publishes "Care and Handling of CDs and DVDs: A Guide For Librarians and Archivists" and it tells you in great detail exactly what goes wrong when you mishandle a disk. When we talk about reliability, we may easily remember numbers from quotes like "An accelerated aging study at NIST estimated the life expectancy of one type of DVD-R for authoring disc to be 30 years if stored at 25°C (77°F) and 50% relative humidity" (which is far under the industry consensus of 100-200 years). What we may not remember as easily is that reliability is directly affected by handling. Part of any backup plan needs to be hygiene practices, if you will, of taking care of the backup media and hardware.

Costing Storage Over Time

Amazon S3 is $0.15/GB/month, or $1.80/GB/year, and for someone looking to store a few hundred gigabytes forever, this is an important consideration. Surely the reliability of hard drives are better than one year, and why would we pay $1.80 for our gigabyte to Amazon when we can get away with copies on a couple of drives at 36 cents each or a couple of DVDs at 9 cents each?

The JungleDisk web site quotes a stat from a recent independent study on drive reliability at Carnegie-Mellon by Schroeder and Gibson stating that 15% of hard drives will fail in 5 years. If you ignore multiple drive failures, and counting on retiring a 5-year old drive, I figure that means a drive at price P will actually cost 1.15 P in replacement costs. Of course you'll probably want to be the replacement drives up front, and unless you're buying 100 drives or more, you're actually spending more on replacements that 15% extra, perhaps up to 5 P depending on how much redundance you want (we'll get to RAID later). At the minimum 2 P for us small-scalers who aren't putting 100 drives into active use simultaneously, my $182 LaCIE drive comes out to $364 over 5 years, or $0.013/GB/mo. That's still a healthy distance from $0.15/GB/mo. I could buy five more replacements, acknowledge that a "500GB drive" is only 460GB and still be at $0.04/GB/mo. What else do we need to consider?

Inexpensive Disks or Cheap Disks?

Marcus Smith is the owner of a company providing IT services including backup and recovery to small business, and had some things to say about hard drives as archival medium on the Association of Moving Picture Archivists mailing list:

I completely agree that hard drives are only one component of an overall backup strategy and that multiple technologies will need to be employed in order to provide truly safe nests for our precious ones and zeros. There are a couple of universities experimenting with the idea of ditching tape systems and instead building very large hard drive arrays in which drives are given a certain time to live and are replaced on a rotating schedule. If every drive actually lived up to an exact span this idea would probably be used more often. The sad truth is the hard drives can fail at any time and this unpredictability alone is enough to show the overwhelming undesirability to use them as an archival media.

The idea that cost-per-megabyte savings is an enabling feature needs more clarification. How does this translate exactly in to providing new methods of storage? One major problem with using a cost-per-megabyte approach in favor of large capacity drives is the intended market. Who is buying these drives? Who is using them for long-term storage?

When it comes down to it, I'd be willing to wager that far more people are using hard drives as their sole mechanism for storage because of their low cost, which therefore makes price a liability, not a benefit. As the price of drives drops, it only traps us tighter into relying on unreliable media.

To store the same amount of data on other mediums - tapes, really - we're already spending thousands of dollars. The low cost and high capacity of hard drives is exactly why ATA is used instead of SCSI, and it continues to have a draining effect on quality of any solution because of the lack of any other cheap storage media. As long as cheap drives push the market, that is what most people will use. Again, this points to the question of who the market really is. Are we talking about large archives who may or may not be able to spend money on developing proper, multiple technology storage solutions? Or are we talking about the "Prosumer" level where individuals migrate film or store native digital video on hard drives because there is simply no other reasonably attainable storage device? Since Fry's are offering cheap drives, and archival institutions probably aren't shopping there for serious storage solutions, it seems to me that we're really talking about individual consumers who need basic, fast, easy access storage. In short, folks who probably don't have tape drives, a multiple technology backup systems, or a well-planned rotating storage schedules.

Costing Data Failure

For JFKBits readers, I know our needs are all over the map: high res digital photos; Linux backups including both system configuration; personal creative works representing who knows how many hundreds of hours of labor. In most cases the scenario of losing our data affects mostly ourselves. I contacted Marcus for permission to quote him in this article, and he described his small business clients: lots of professionals, where other people depend on the data; losing information invokes the specter of a lawsuit. When we make our backup plans, we know the purpose is to prevent data disaster. But when we start costing it, sometimes the cost of the data disaster tends to be left out:

Determining the value of the data you want insured may not be easy. A single database or secret recipe may be the heart of your business, and estimating the worth of that intangible asset is a problem that can give CFOs nightmares
--Kevin Savetz in "Data Insurance", published in New Architect May 2002

RAID When Things Go Wrong

At this point, RAID comes forcefully to mind. Duplicate the data and add some automated error correction and detection. More caution, more consideration, urges Marcus:

RAID arrays can offer some benefit, but there are dangers here also, and they need to be addressed. RAID arrays do not technically have to be constructed with identical drives, unless one uses crappy hardware controllers. But for RAID to work efficiently, then identical geometry and capacity drives should be used to build the array. Most RAID systems are based on using identical drives. Here again money comes into play. RAID 5 is nice for the overall capacity, but using identical drives may be seen to have similar failure characteristics. Let's mentally build an array with five drives, and buy five more for replacement. It could very well be that within two years all ten drives may be exhausted and getting two year old drives can be very difficult to procure. Nearly impossible, actually. There are two obvious problems: (1) you're only as safe as the number of drives you buy, which is going to be many. Go ask Compaq for a 9.1gig hot-swap replacement for a Proliant server. For that matter, head over to Fry's and ask for an IBM 30-gig Deskstar drive. It will never happen. (2) If identical drives have similar failure characteristics, then it stands to reason that when one drive fails, the other drives in the array are not far behind. There are also two non-obvious problems: (1) If two drives fail before the operator knows there is a failure, all of the data on the array is lost. (2) Data recovery from missing RAID volumes is notoriously difficult and usually ends in failure of recovery.

Luckily there are other RAID options, like mirroring and striped mirrors. Again, be prepared to invest in drives. Lots of drives. These options use half the drive capacity of the total number of drives. I.e., for 500 gigs of space you need not five 100gig drives, but 10 100gig drives. You cannot do this with ATA drives because of the limitations of ATA controllers. I believe the largest possible number of drives on a single controller is the 3Ware 8-channel RAID adapter. Eight drives plus eight spares. Plus more if this is where the Company Policy places its long term future. Add a storage rack for this and a Very Expensive RAID adapter and suddenly the one-year warranties on the drives turn an expensive project into an expensive and unreliable project. The benefit over RAID 5 is that data recovery is much easier of this kind of array if it comes to it.

So yes, we need to think very, very differently about storage. RAID solutions are not as safe as their cracked up to be, and depend greatly on the vigilance of the systems administrator to keep things running smoothly.

Whew! The overriding message here I think is that RAID doesn't solve everything - you'll want to buy your replacement disks up front, as you probably can't get them in 5 years. You still need to keep watch over the health of the RAID array. (I had extra time to think about this on my way home tonight as I changed a flat tire.) And as the experience of one of our readers reveals, you still need a strategy to watch that your backups happen and that the backups themselves are OK. One JFKBits reader checked his backups after reading the recent article and discovered that his backup software actually scrozzled the data. Now how much would you pay to make sure this all gets done right? Ready to fork over that extra dime per gig per month? I thought you might be.

Conclusions

We all know we need backups. If we're thinking long term enough to realize that even the backup media may fail, we've started planning the cost over time. Making your plan depends on what you're doing; backups of a machine image or a database have different needs than backups of segmentable data such as are needed by pro photographers. Amazon S3's figure of $0.15/GB/mo is not unreasonable, but you still will want a plan to verify integrity for yourself. A $1000 RAID NAS box (maybe two) is reasonable, providing you buy the replacement drives up front and verify the backups. If your data is segmentable, DVD backups are quite reasonable, providing you have the time to burn plenty of copies, distribute them geographically, store them in a climate-controlled environment, and organize them well. Tape backups are not dead either, and we'll close by letting Marcus have the last word:

I personally am much more in favor of tape systems than hard drives. I just wish they were less expensive. My problem is the same as many others - I don't have $5000 to shell out for a tape drive to store 400gigs of materials. Unless, of course, someone would like to buy me a new LTO drive and a bunch of tapes.

Does anyone want to wax poetic about the beauty of Tar as backup software?

Wednesday, November 28, 2007

Fifty Cent Lecture on the Unification-Based Type Inference Algorithm

In December 1978 the Journal of Computer and System Sciences published "A Theory of Type Polymorphism in Programming." In 1992 I was introduced to Standard ML in compilers class, and used it to write a typechecker for our Ada subset compiler. I was in love with types, typecheckers, and type inference. In 1993 my good friend and I tried unsuccessfully to implement type inference as part of a much larger project, without actually researching the prior art. We needed to learn the fine art of finding and reading research papers. A few years later I was back in grad school, and got the help I needed to locate the right papers and understand them enough to implement an inferencing type checker. The paper for that project is dated December 8, 1997. I'm very satisfied that ten years later, type inference has not gone by the wayside but is poised to go mainstream, going the way of automatic memory management (GC) via Java.

In commemoration then, let's explore a bit this fascinating feature that may soon be coming to a programming language near you.

Cocktail Party Explanation (Type Inference in 30 Seconds)

The way most bloggers are explaining type inference lately is with a Java/C# assignment example. Instead of typing
WaitingPageHandler handler = new WaitingPageHandler(args);
which has this offendingly redundant "X x = new X" form, you can get away with something like
var handler = new WaitingPageHandler(args);
How does the typechecker figure this out? This doesn't look too hard to infer. What about an even simpler form:
var delay = 250;
Obviously, the literal 250 is an int, and so is delay. The type of literals like 250, 3.14159, -1L or "duck" are well-defined by the language. An expression involving the new operator is that of new's argument. Here we see the easiest part of understanding type inference -- that there are some expression tree leaves that can be immediately inferred.

Of course you don't just write assignments. As you know from studying lambda calculus, there are three fundamental operations to care about: lambda introduction (corresponding to introducing literals), variable reference, and function application. So how does type inference handle variable reference and function application?

Checking Variable References with Known Type

Let's tackle variable reference first. What if we were given this:
var timeout = 500;
var timeoutTime = System.currentTimeMillis() + timeout;
In the second line, what type does timeout have? It's an int, but how does the typechecker know it? It was inferred from the first line, you're right. But technically, the typechecker probably doesn't know or care that it was inferred. After the typechecker inferred the type of the timeout in the first line, it updated the environment so that in subsequent processing timeout's actual type is known. The typechecker did a lookup in the environment for the type of timeout before doing anything else, as it would in an ordinary static typing typechecker.

So we come to the clue that as with ordinary static typechecking, an environment is needed. The environment is a stack of identifier to type bindings. Environments will come very much into play as we develop the algorithm.

We have looked at how variable reference works when the type is known. There is another possibility, that the type of the variable being referenced is unknown, but first let's look at function application when the function's type is known.

Checking Function Application with Known Type

What about this simple line:
var delay = 15 * 60;
Given two ints, * returns an int, making int the type of the function application expression *(15,60), and thus the type of the variable delay. More generally, when we see a function call and the type of the function is known, the type of the expression is that of the function's return type.

Inferring a Type

The interesting part of type inference comes when a variable is referenced but not immediately inferred. A good example of this situation is when the variable is the argument to a function:
function(x)
{
print(x);
if(x < 2)
1
else
x*factorial(x-1)
}
This puts us into a little deeper water. But stay with me, because this is it; the heart of basic type inference using unification. First, notice that the type of x is obviously int from the comparison in the if test, and also from the expression x-1 which is similar to the 15*60 example above. The fun comes in how we can process this line by line and arrive at the right conclusions. The print(x) line is a function call where x's type is as yet unknown. We assume that like in most languages print is overloaded for all types, or at least all the basic ones like int and String. By the time the typechecker is done, we do need to know x's type so our code generator or interpreter has enough information to call the right print code. How?

When we encounter a reference to a variable with unknown type, we tag its type with a unique type variable. Wuzzat? Hello? you say. Yes, we will need a separate environment that tracks type variables. For example, we will say that x's type is that of type variable t1, and in our type environment we say that t1 is bound to an unknown type -- the variable is kept in symbolic form, it's unbound. That's all we can and need to do for the print(x) line, so we move on.

When we encounter the next line, the if expression, we will type check its three arguments: the test expression, the "if true" expression, and the "if false" or "else" expression. So far we haven't mentioned unification, but here's where it really becomes necessary. What we actually do to typecheck the x < 2 expression is run the unification algorithm. The types of the arguments to the less-than operator (which you should always remember to properly escape in your blog editor if you don't want to mysteriously lose portions of your text) are unified with their expected types. For example, we know the less-than operator takes two numeric operands. We'll skip over the overload resolution part here and assume that the literal 2 tells us we're using the int version of the less-than operator. So we know that this operator is a function which takes two ints. We unify the type of the first argument's expression x with the type of the function's first argument, which we just said is int:
Unify(t1,int)
In this case the unification algorithm is getting a type variable and a concrete type, indicating we've discovered what type the variable should be bound to. So at this point we can update t1 in the type environment:
t1 = int
That's it! That's perhaps the key part of the unification-based type inference algorithm.

From this point on, whenever we lookup the type of x in the environment, we'll still be referred to the type variable t1, but when we ask the type environment to evaluate that type expression, we'll get a concrete type of int. We can construct our typechecker to have two passes, one that runs this algorithm and guarantees the type environment has nothing but known types afterwards, and then a subsequent pass that updates the environment to concrete types.

Also, note an assumption in the algorithm: once we had any concrete type for x we used it. If later in your program x was used in a context of a different concrete type, the typechecker would flag a type error. This is where you get errors like "found a real when expecting an int".

Equivalent But Still Unknown Types

Another key part of unification comes when a variable of unknown type is unified with another variable of unknown type. For example:
function(x,y)
{
z = if(choose()) x else y;
2^z
}
In this case the first line of the function tells us only that x, y and z are the same type. This is important information that must not be lost. Let's step through these expressions. In the first line, we check the if/else expression. We assume the type of choose() is known to be boolean-valued to match the expected type of an if test expression (and if the return type of choose wasn't known, it would be inferred by its use here). We then check the two branches of the if, and since they simple variable references we generate a unique type variable for each and bind the variable reference appropriately. In other words:

In the type environment,
t1 -> unknown
t2 -> unknown
In the user variable environment,
x -> t1
y -> t2
This is the state of things after checking both expressions. The last part of checking the if/else is to unify those two expressions, since we know they must match. In other words,

Unify(t1, t2)
is evaluated. We saw what happens when unifying a type variable and a concrete type; we assign the type variable to the type. But here both are variables. In this case, unification says we simply remember the equivalence of these unknown type variables. The way I handled this was to augment the type environment to keep a list of known equivalences for a type variable:

In the type environment,
t1 -> equivalences[t2]
t2 -> equivalences[t1]
If we ever find a concrete binding for t2 we can bind t1 as well as t2 at that time. And that should explain the rest of the example.

When we evaluate the assignment statement z = if, we enter a similar type variable entry for z and make it equivalent to both t1 and t2:

In the type environment,
t1 -> equivalences[t2,t3]
t2 -> equivalences[t1,t3]
t3 -> equivalences[t1,t2]
The way you handle this equivalence list is an implementation choice but I hope you get the idea.

When we finally hit the function's final expression 2^z we again have an overloaded operator but the presence of 2 helps us resolve it to the integer power operator, so we can unify the type of z with int, letting us simultaneously bind the type variables t1, t2 and t3 since we know their equivalence.

Where Polymorphic Variables Come From

Sometimes when checking a function the parameter may never be used in a context specific enough to bind its type. What happens then? For example:
function(x)
{
(x, [])
}
This takes an argument and returns a pair of the original argument x and an empty list, possibly a normalizing step for a function that will visit x and accumulate some results seeded with the empty list. The pair construction works for values of any type, so it adds no information that can be used to deduce a particular type for x. When the typechecker gets to the end of the function declaration, x is found to still be bound to its type variable. When this happens, x is "promoted" to be a polymorphic-typed variable. In order words, since x is not used in any particular way, it can be used for any type. A language designer could choose not to support polymorphic types for some reason, and could flag this as an error.

This example also introduces the notion that type inference operates within a certain boundary. There has to be a point in the abstract syntax tree where the typechecker knows this is where to promote still-symbolic types to be polymorphic. In Standard ML, I believe this boundary is the named function definition introduced by the fun keyword.

Conclusion

The complete algorithm is a little more involved than I've described here, but I hope this helps you understand the essential pieces and the general flow. For more details please see "Principal type-schemes for functional programs" by Luis Damas and Robin Milner (ACM 1982) or the original "A Theory of Type Polymorphism in Programming" by Robin Milner (Journal of Computer and System Sciences, Vol.17, No.3, December 1978, pp.348-375), particularly section 4 "A Well-Typing Algorithm and Its Correctness."

Tuesday, November 27, 2007

Warehouse Robot Video Clips

My dear JFKBits readers, for your enjoyment I submit these video clips of Kiva Systems' warehouse robotic picking system:

Friday, November 23, 2007

Different type returned by getElementsByTagName in Safari, Others

The Javascript function getElementsByTagName recently gave me cross-browser grief. I had a web page with a bunch of checkboxes (a type of input element) and wanted to iterate over the ones with a certain class. So I wrote this:

var nodes = document.getElementsByTagName('input')
for(i in nodes)
{
if(nodes[i].className == 'aCertainClass')
; // process checkbox
}
This worked fine in Firefox 2 and IE 6. Later I started testing with Safari for Windows and noticed a problem. I tracked it down to this discovery: getElementsByTagName returns different types in Firefox and Safari.

In Firefox, getElementsByTagName returns an HTMLCollection, which works fine with the iterator style for(i in nodes). In Safari, it returns a NodeList, and iterating with for(i in nodes) takes you on a trip through the object's methods, rather than its collection contents.

The fix for this was to change the iteration style, since both HTMLCollection and NodeList support the length field:

var nodes = document.getElementsByTagName('input')
for(var i=0; i < nodes.length; i++)
{
// same as before
}
For what it's worth, I've only seen NodeList documented as the expected return type from getElementsByTagName. Why it doesn't work with the for(i in nodes) iteration syntax is something for which I'd like an explanation.

Thursday, November 22, 2007

Half a Terabyte, Hold the Gravy

What do you do when you've got a device that creates very valuable 2MB files with the press of a button, an action that you may repeat up to 600 times an hour (maybe 20GB a week)? You end up buying another device, such as this puppy: the 500GB Lacie D2 HD Quadra 7200RPM 16MB external drive. I had never before heard of LaCIE, a French concern, but this drive seemed to fit the bill as a backup device. It has a reasonable dollars per gigabyte number, the reviews seemed encouraging, and I went ahead and paid the extra $60 to get the Firewire-capable model as this seems to be recommended over USB for sustained reads and writes typical of an external drive.

PriceGrabber got me thinking about the whole backup plan from the dollars per gigabyte angle for different types of storage devices, summarized here:




Tech Sample Product Price w/ S&H Capacity $/GB
DVD-R Verbatim DVD-R 16x 100-pack $42.27 470GB $0.09/GB
external HD 500GB Lacie D2 HD Quadra 7200RPM 16MB $182.60 500 GB $0.36/GB
flash memory PNY 4GB USB flash drive $31.89 4 GB $7.97/GB
DVD's are an economical choice as a backup medium, especially in my case where making modifications is not important. The 100 disk spindle is almost a direct comparison to the LaCIE drive in terms of capacity, since the LaCIE's actual capacity is more like 460GB. The external drive has the advantage that you don't need find the right disk when you need to retrieve your backup, so it is a good "live" backup solution. The external drive also gives you a little bit of portability, and the model I bought has four connection technologies (hence the name "Quadra") giving some flexibility in working with other devices.

The USB flash drive, on the other hand, is a terrible choice for long term storage, as far as I can tell. That salesman at the electronics store certainly had a lot of nerve trying to talk my Dad out of buying blank CDs in favor of buying a flash drive. The advantage of flash drives is portability, not economics. They weren't looking much cheaper than $8 a gigabyte on PriceGrabber for 1GB and up.

I've left out some options here which I didn't research as much. I've left out the venerable tape backup as well as internal hard drives and NAS. Internal hard drives in a RAID might be a good alternative to the single external drive because you get some automatic failure detection. And the NAS, or Network Attached Storage, goes one better in simplicity.

For this exercise, I've included in my personal backup plan ditching the idea of an "archive quality" medium such as tape in favor of any storage medium which has a long enough life to copy the data onto something newer.

Another idea is to build layers of redundance; one backup is not good enough, and maybe not even two. Somewhere in the back of my head in all this is my dear Grandma's judgment of disgust at the idea of computers, where you can erase everything at the press of a button. This information vulnerability is certainly an Achille's Heel of computing, and it's been an interesting exercise to price out and plan a moderately serious backup system. I'm certainly a novice to this, so if you have some experience to share, I'm all ears.

Friday, November 16, 2007

Quick Thoughts: Programming Productivity

I was thinking hard yesterday about which languages let you be more productive. I'd like to say functional languages let you be more productive. It seemed that it would be so nice to have a few studies done on this, some clinical trials on Scala the wonder drug if you will, with comparisons to languages which have a high placebo effect (they make you think you're writing a program when really you're just creating more work for everyone).

Just now I connected that thought with an earlier one: software is difficult to estimate because it's always the creation of something that has never been done before (even when you're writing a rip-off of an existing program). I reckon that the more unknowns you have, the more risk that your initial estimates will be wrong.

The connection between these two ideas is that it's hard to judge the productivity of a programming language when the utility of the language is to write things that have never been done before. You really do have to actually use and test it in practice. Hence it seems especially useless to have online debates about the productivity of languages without trying them out. It's a little like trying to predict the time performance of a piece of code by examining its source; you really need to actually run it.

For further reading, see Ian Chai's doctoral thesis on documenting object-oriented frameworks (Ian studied under Gang of Four member Ralph Johnson), where he conducted experiments with human subjects to compare how successful different types of documentation were on training someone to use a new framework. It's the only research I know of (not that I'm an expert) where a psychology-esque experiment has been done to test for productivity of a software environment.

Update: added parenthetical remark identifying Ian's advisor.

Wednesday, November 14, 2007

It's Doing What You Tell It, or Concurrent Programming Rediscovered

How did you come by your expertise? If you're a seasoned programmer you probably have a treasure chest of "war stories", experiences that you use to do your job. Some of the best are those that help you "get" a concept in some new profound way. I want to share a very simple such story.

A coworker of mine was having some problems with some multi-process programming, and since the behavior looked like a race condition, we looked through the code for synchronization problems.

Right here, I said, you're reading this shared value without locking it, and writing the changed value back here.

But, my coworker objected, most of the time the two processes are not running at the same time. I don't really need to lock that, do I?

When put that way, I saw the basic synchronization problem in a new helpful light, and I responded "Yes, you're right, and most of the time you don't have the race condition. It's doing what you told it to do."

The problem is that with a computer our intuition of "most of the time" is skewed by the fact that they've got, like, billions of things to do before this second is up.

The program as it was written was behaving exactly as it was written - most of the time, no synchronization problems. Just some of the time. The frequency of opportunities for this to happen were simply too high for the application.

I think we make this type of tradeoff in programming all the time, where we reject developing a more complicated mechanism as being not worth the effort for the amount of time it would cost us. I think primarily of error handling, but any edge case outside the typical flow of control is a candidate for being "undeveloped." Indeed, sometimes this is the only way we can manage the prohibitive number of different cases that exist. But of course this was ordinary concurrent programming. I think when you get used to it, you develop the instinct to program for the contention points, knowing they must be handled. But it is very tempting for the novice to model the situation as we do these edge cases: the typical no-contention case and the infrequent contention case.

Got any stories about concurrent programming? Leave a comment or share it with me, jfkbits at gmail.

Wednesday, November 07, 2007

Remove This - A Search Interface Improvement

The computer world has a lot of search interfaces these days, and I don't mean just GOOG (732.94 down 8.85 in heavy trading). I'm talking about things like "Search Messages" in Thunderbird and product searches in pricegrabber. In the old library search style, you had to think about what property you wanted to search by: Title, Author, Subject, depending on which index you wanted. Google let us focus on the search term and let the computer search all the indexes.

Either way, the user interface for reporting results is the same: you get a big list to sort through.

And either way, if you find your search isn't working, you start a new, slightly different search.

But I'd like to a twist: a "remove this" button to exclude certain results.

Deleting...From What? with Thunderbird

I should explain how I got this idea: I thought Thunderbird already did it, and wound up deleting some of my email.

I set up some elegant query, and still got back a whole lot of results in the nice table view. I noticed the Delete button, so I selected a number of them that were not relevant, and clicked it, thinking that would remove them from the search view. Ho, ho, ho. They were list traffic archived elsewhere, so no biggie. But then I wondered why not also offer the feature I thought it was offering.

Remove This

This got me wishing that all search interfaces would work like that - with a Remote This button or link. If you know you don't want a result, eliminate it from consideration in future variations of your query. It helps remove the clutter of seeing the same undesired but superficially-related result come back each time you refine your query.

Of course this implies some state to track your list of exclusions, and some way to clear or manage the exclusions. For an application, e.g. Thunderbird, it's no problem, and for product searches as in PriceGrabber that already support a complex filtering system, it's a logical extension. For Google, I'm not sure if it makes sense, interface issues aside, but I think it's worth considering for the uncluttering effect it can have.

(Above is an Artists Conception of how Remove This could appear in a Google search result.)

As a footnote, Supercomputing 2007 kicks off this weekend in Reno, Nevada. I'm not going this year, but I hear my software is, and I wish everyone attending a fun and productive time.