Monday, December 31, 2007

2007 Year-End Language Version Wrapup

Following the 2006 language version wrapup, we present the latest stable version or other improvements to some of our favorite languages and tools at the end of 2007 :

While I don't use all of these a lot (or even a little, like Ada), they're all tools close to me.

As we close the new year, I'd like to invite my readers to leave a comment (can be done anonymously) stating (1) your #1 most used programming language this year (2) a language you'd most like to try out.

Friday, December 28, 2007

Revisiting Tree Traversal Patterns

Experts may disagree, but I like to agree with LtU that the killer app for functional languages is compiler writing. I propose that one simple reason for the success is structural pattern matching, because it lets you write compact and natural tree traversal code:

datatype Expr = Atom of string | Pair of {car : Expr, cdr : Expr};
fun textFormatter(Atom a) = a
| textFormatter(Pair{car=x, cdr=y}) =
concat["(",textFormatter x,",",textFormatter y,")"]
Do OO languages have any hope of approaching this kind of conciseness? That led to my question in Tree Traversal in Object-Oriented Languages "which of these two techniques do you like better and is there a better one?"

In that post JFKBits reader "peter" commented my "event model" was just the Visitor Pattern. I'd had another idea of what the Visitor Pattern was. After some digging I learned there are two variations both considered the Visitor Pattern, and today we'll look at what difference it makes for writing programming language processors.

Why Tree Traversal Matters

For the programming language tool writer, a tree traversal pattern is like the for loop (or other iteration pattern) for most other programmers -- after all, your bread and butter data structure is the tree, not a flat sequence.

In a word, tree traversal matters if you traverse a lot of trees - sum up the number of traversals you write in a single program or the number of processors you write over time.

You could be writing a series of toy interpreters each with a single eval loop, like this one for Scheme and this one for Lisp both written in Python.

On the other hand, you could have one large project with many phases. I have a 12,000 line compiler for Ila written in Standard ML with five major tree traversals areas in the compiler core, plus some utility traversals like error message pretty printing and a Javadoc-like HTML generator. Writing that compiler one time impressed me that tree traversal pattern matters, especially since there were quite a few node types.

Further, your choice of tree traversal pattern will immediately affect the way you think about your algorithm. I think one variation of visitor makes it unnecessarily difficult, and one makes it quite a bit more natural.

Visitor Pattern in General

The visitor pattern is generally described as a way to decouple algorithms that operate on a data structure of related node classes from the node classes themselves.

The double-dispatch technique is often described in presentations of visitor. To use the visitor pattern, take the tree object t to traverse, a visitor object v corresponding to the algorithm (type checking, syntax highlighting) and call the virtual method t.accept(v) (this is the first dispatch). The node subclasses (Atom, Pair) each have an implementation of accept that calls the visitor method corresponding to their particular class (the second dispatch, hence "double dispatch"). That way the visitor can be written as a collection of methods that each know the exact node subtype they're dealing with.

Visitor Pattern for Language Processors

What's often glossed over is the critical question where is the knowledge about a node's children? In other words, where does the recursion/iteration occur - in the acceptor or the visitor? I originally imagined the visitor pattern was defined to mean recursion occurs in the acceptor. Either way is considered an accepted variation of the pattern, but I claim it makes a substantial difference for most compiler algorithms.

Recursion in the Acceptor

Here's recursion-in-the-acceptor for the Lisp structure above:

public abstract class Expr
{
public abstract void accept(Visitor v);
}

public class Atom extends Expr
{
public String value;
public void accept(Visitor v)
{
v.visit(this); // Atom has no children, simple accept
}
}

public class Pair extends Expr
{
public Expr car;
public Expr cdr;

public void accept(Visitor v)
{
car.accept(v); // children visited in the acceptor
cdr.accept(v);
v.visit(this); // postorder traversal
}
}

public interface Visitor
{
void visit(Atom a);
void visit(Pair p);
}
Now, given those definitions, how would you write the TextFormatter iterator that produces the same output as the SML example given earlier?

Recursion in the acceptor introduces some problems. One is that it fixes the traversal order -- postorder in this case. Now the iterators need to know what the traversal order is, a dynamic property that must be remembered when writing each visitor. Also, what if postorder isn't what you need? In particular, in order to generate the grouping parentheses, which are associated with the Pair, you would like code associated with the Pair to run both before and after visiting the Pair's children. Further, you wanted to generate a comma in between -- that means a visit in between the two children. You're forced to write something like this:

import java.util.Stack;
public class TextFormatter implements Visitor
{
private Stack m_stack = new Stack();
private String m_result = null;

public void visit(Atom a)
{
m_stack.push(a.value);
}

public void visit(Pair p)
{
// We need to know the order of children traversal. Hmmm.
String cdr = (String)m_stack.pop();
String car = (String)m_stack.pop();
String value = "("+car+","+cdr+")";
m_stack.push(value);
}

public String toString()
{
if(m_result == null)
m_result = (String)m_stack.pop();
return m_result;
}
}
Using an explicit stack is not 100% always bad -- if you tend to process programs with extremely deep nesting, you may hit your implementation language's control stack limit, but overall this is not the style in which I'd prefer to write dozens of visitors.

Note that the SAX parser model, which follows recurse-in-the-acceptor, addresses the fixed-order problem by having multiple visit methods for nodes with children: it has startElement as well as endElement, and the like.

The other problem with this approach is that not every visitor needs to visit every node. Some may only need to go so deep. That brings us to the recurse in the visitor approach, which I would consider the recommended one for programming language processors.

Recurse in the Visitor

This approach has the advantage that the accept method is always the trivial v.visit(this), and the visitors can be written in a more natural style:

public class TextFormatter implements Visitor
{
private StringBuffer m_buff = new StringBuffer();
private String m_result = null;

public void visit(Atom a)
{
m_buff.append(a.value);
}

public void visit(Pair p)
{
m_buff.append("(");
p.car.accept(this);
m_buff.append(",");
p.cdr.accept(this);
m_buff.append(")");
}

public String toString()
{
if(m_result == null)
m_result = m_buff.toString();
return m_result;
}
}
Is this what I'd like to write as a functional programmer? Of course not. I'd prefer to have accept return a value, so I could write visit(Pair) more like this:

public void visit(Pair p)
{
m_buff.append("("+p.car.accept(this)+","+p.cdr.accept(this)+")");
}
Which of course means that visit should return a type, instead of accumulating results in a member variable. But then we need to know what type to return, which in Java most likely needs to be Object. It all gets ugly again.

Conclusion - Still Not Perfect

So, a visitor with recursion in the visitor is acceptable, but still not ideal, and it doesn't come close to the conciseness of pattern matching. The accept method needs to be added to each node class, which is a pain for the rapid prototyping, and the visitor of its nature requires that you accumulate state in member variables instead of using a functional style of combining returned results in a single expression.

Still, I'm just one guy with a certain slice of experience --- is there a better (more compact, less boilerplate) tree traversal technique for OO languages?

Or is this really the ultimate pattern, OO loses this particular battle, and Nice multi-methods or Scala case classes or other language extensions become necessary parts of an OO-capable language that supports programming language implementations as well as the functional languages?

Thursday, December 27, 2007

Foxtrot

Inspired by Jam-Packed FoxTrot I'm declaring FoxTrot the Official Cartoon of JFKBits:

Jason: So if the Times Square ball doesn't fall at midnight...
Does that mean the guy who drops the ball dropped the ball?
And if he drops it correctly, has he then not dropped the ball?
Roger: I could use a refill on the champagne, Hon.
After all, if the Official Cartoon doesn't have impredicative humor, what good is it?

Wednesday, December 26, 2007

XEmacs Groks file://path

On a whim, I just found out you can paste this into XEmacs' mini-buffer when you Ctrl-x Ctrl-f to open a file:

file:///C:/code/file-of-interest.html
It's nice to know XEmacs understands file:// URLs, and I expect this now falls into my recommendation as a best practice for user interfaces.

Unfortunately, XEmacs won't import a web page that way: you can't Ctrl-x Ctrl-f http://cnn.com and edit the news.

Friday, December 21, 2007

Lighthearted: Roundoff Error

I'm gettin' nothin' for Christmas
Amazon dot com is mad
I'm getting nothin' for Christmas
Cuz I reported their bug
On my blog

--

Inspired by the buy-this-get-that-free promotion where we put 2 of this and 3 of that in the shopping cart, and all 3 thats were accounted to be free. Don't want any server queue riots now...

Wednesday, December 19, 2007

Tree Traversal in an Object-Oriented Language

There are two methods I know of for representing an expression in classes: the event-model and the subclass model. The main problem I see is in how to traverse the tree without replicating verbose "if this type, then" boilerplate and without piling all the code that might ever use the expression into the same sets of classes. I'm merely observing in this post, if you have any opinions on which you like better, let me know.

Grammar

For both of these, w'll use the following grammar:

E -> INT | STRLIT | VARREF | List[E,...] | E(E) | FN(x){E}

An expression is either an integer constant or a string literal, a variable reference, a function call, or a function definition.

Subclass Model

In the subclass model, we'd start with a class for E itself:
class E
{
}
We'd augment E with lots of handy utility functions and fancy abstract (pure virtual) methods. Then we'd define a subclass of E for each of the possible "actual" expression types:
class Int extends E
{
int k;
}

class StringLit extends E
{
String s;
}

class Varref extends E
{
String x;
ScopeLevel level;
}

class EList extends E
{
List elts; // List of E objects
}

class Call extends E
{
E function;
E argument;
}

class FnDef extends E
{
String x;
E body;
}
One thing I don't like here is adding a virtual method to E to handle anything anybody might do to an expression. Type checking? Add a typeCheck method. Code generation? Add a generateCode method. Optimization or source transformation passes? Good luck with that. I think it's sticking too much unrelated code into each class.

And if you separate your concerns and have code elsewhere that gets handed E objects, you end up with this structure that suggests you haven't learned about old-code-calls-new-code virtual methods yet:

void f(E expr)
{
if(expr instanceof Int)
f((Int)expr);
else if(expr instanceof StringLit)
f((StringLit)expr);
else ...
...
else // Oh no!
;// It wasn't one of the types I thought it could be!
}

Event Model

The other model is already known in Java programming as the event model, as in the way SAX parsers work. I like to think of the True Origins of this concept, in lambda calculus encoding. Every basic type we know of in the functional languages like ML (tuples, lists, integers) has an encoding in the pure, 3-form lambda calculus. And the type we're talking about here is called a sum type. Sum is related to "OR" just as product is related to "AND." An expression is an int OR a string literal OR a variable reference, and so on.

Here's what I like about the event model: it factors out the "if expr is this choice, call this; else if expr is that choice, call that..." code to one place and you are freed up to write functions that are given the exact subtype:
interface EHandler
{
Object handle(Int i);
Object handle(StringLit s);
Object handle(Varref v);
Object handle(EList l);
Object handle(Call c);
Object handle(FnDef f);
}
Now your language processor phases can be written in terms of EHandlers:
class TypeCheck extends EHandlerImpl
{
Object handle(Int i)
{
return Types.INT;
}

Object handle(Varref v)
{
// look up v. if unbound, flag an error
// if bound, return its type
}

Object handle(Call c)
{
Object fnType = this.handle(c.function);
Object argType = this.handle(c.argument);
// check that fnType is a function,
// make sure fnType's range type matches argument type
// return fnType's return type
}
}
This is fine as long as you only need to look at one level of the tree at a time. And if we wanted these handle functions to return a type other than Object, or if we want to add another argument to each handle function such as an environment or accumulator object (not likely since we could make it a class member, but conceivable) we'd need to create another interface and/or abstract base class.

This second one appeals to me because I'm so accustomed to ML's pattern matching facility (which also appears in other languages like Haskell and Scala), which is the sweetest of syntactic sugar for sum types, and sum types in turn are the bread-and-butter of language processors.

Conclusion

Either of these two approaches seems fine for small and simple expression tree types (meaning the number of different types of nodes, not the size of a particular tree), but I think there's got to be a better way for representing sum types in classes that can handle larger sum types and more complex processing needs. Do you know what it is? Leave a comment or email to jfkbits at the mail domain of google.

Update 12/20: Fixed code formatting.

Monday, December 17, 2007

Your Idea on RAILs

Read this with the voice of Deep Thoughts by Jack Handy:

If you want your code to survive,
you want to put it on RAILs -
a Redundant Array of Inexpensive Languages.
That way, if one language fails, your ideas will still be around in one of the other ones.

Saturday, December 15, 2007

You Want Electrical Grounding

We just moved our home office to a different room with two-prong outlets. One outlet was on the fritz, so it got my attention enough to realize the lack of a ground connection was probably not a good thing. I wanted to understand it better, so I consulted the good ol' Electrical Wiring FAQ (written back in the days when FAQs weren't written by marketing people).

The FAQ first explains that without a ground, your surge suppressor is working at a disadvantage:

A word about grounding: most suppressors and EFI filters
require real grounds. Any that don't are next to useless.

For example, most surge suppressors use MOVs (metal oxide
varistors) to "clamp" overvoltages. Yes, you can have a
suppressor that only has a MOV between neutral and hot to
combat differential-mode voltage excursions, but that isn't
enough. You need common-mode protection too. Good suppressors
should have 3 MOVs, one between each pair of wires. Which
means you should have a good solid ground. Eg: a solidly
connected 14ga wire back to the panel. Not rusty BX armour or
galvanized pipe with condensation turning the copper connection
green.
And then it goes on to give three more good reasons to have a real ground installed on the outlets where your computer lives:
Without a ground, a surge or spike is free to "lift" your
entire electronics system well away from ground. Which is
ideal for blowing out interface electronics for printer ports
etc.

Secondly, static electricity is one of the major enemies of
electronics. Having good frame grounds is one way of
protecting against static zaps.

If you're in the situation of wanting to install computer
equipment on two wire groundless circuits take note:

Adding a GFCI outlet to the circuit makes the circuit safe for
you. But it doesn't make it safe for your equipment - you need
a ground to make surge suppressors or line filters effective.
If you ever open up the case and have the habit of discharging static electricity on the metal frame, it's not going to do any good if the case isn't actually connected to ground.

This has been a public service announcement about the importance of upgrading your hardware infrastructure.

Friday, December 14, 2007

TBB's parallel_for

Today I wanted to tell you about a very simple but interesting function, parallel_for, included in the Intel Threading Building Blocks (TBB) C++ library (available under either a commercial license for a fee, or as a GPL open source free library). It does the work of partitioning an embarrassingly parallel loop onto different threads, and all you need to do is write a couple of small adapter classes.

The sequential loop might be this:

for(i=0; i < vec.size(); ++i)
out[i] = vec[i]->serialize();
The parallel version requires that we abstract the loop into a Range and a Body, both classes, and let TBB partition the vector and assign chunks of the loop to different threads. You don't need to create the threads. The call site above would change to something like
parallel_for(Range(vec), Body(vec));
The algorithm uses methods you define in Range to partition the entire range into chunks of the vector that you (or the library code) decide is just small enough that there's no more parallelism to exploit but not so small that scheduling overheads eat into the benefit. A Range needs to implement methods
bool empty() const
bool is_divisible() const
Range(Range& original, tbb:split)
The empty predicate is used to know if a range is empty and thus can be abandoned. The is_divisible method implements your grain size policy, e.g. is_divisible returns true unless there are less than 1000 elements.The copy constructor actually performs a split. This is a classic C++ idiom, the constructor creates the new object representing one part of the range, and the original object being destructively updated to represent the other part of the range. I imagine these Range objects as being quite lightweight, containing perhaps begin and end pointers or array indices into the original collection.

Once TBB has gotten a range of the desired grain size, it gets put onto a thread queue for execution by invoking the Body's function operator:
void operator()(Range&) const
Invoking the function operator on an object looks just like the object is a function; a Body object b's function operator is invoked b(r).

The trivial extra work in wrapping your collection in these Range and Body classes is rewarded by the fact that you don't need to write a line of thread-aware code to get your loop partitioned, and that you can take advantage of the thought someone else put into how best to do the partitioning and queuing.

TBB has taken some insights from Cilk for scheduling, and while it splits your range depth-first (splits a non-empty range and recurses on the left range until reaching a non-divisible range), but "steals" bread-first. The insight is that a thread should be scheduled with parent/child and similar neighboring ranges to take advantage of cache locality between the ranges in the same CPU core. Stealing refers to the way that a thread's queue is replenished when it is emptied, in that an entry from a randomly selected alternate thread's queue is removed.

Wednesday, December 12, 2007

Things to Do With Your Programming Language Course

It's finals week across the country, and thousands of computer science students are finishing up a programming language course. At least a handful of these have been infected with the fascination of the subject, and I hope to meet them someday. From the enthusiast to the couldn't-care-less, here are my thoughts about where you can go from here with one programming language course under your belt.

Going Pro

Take more classes, participate in an open source language project, go to grad school, and get a job with a company that makes a compiler or interpreter (IBM, Microsoft, Wind River). Remember that there are lots of languages with a smaller reach that still need people to work on them. Embedded systems have lots of chips (purpose-built CPUs and microcontrollers) that need compilers and development environments.

Programming Tools Development

Similar to Going Pro. Not every parser is feeding a code generator. Join a team making an IDE, an IDE plugin, a code formatter, or a code checker (a la LINT). Eclipse puts this category within reach of anyone (so did EMACS). In order to develop at this level you probably want to take at least one more course or do a semester long project to have a fighting chance of getting enough experience.

In-House Tools Development

The need to make sweeping changes or do some kind of transformation on a code base comes up all the time on large projects. I should probably say the "opportunity" rather than the need, because some problems that could be solved by a transformational tool will be worked around. But here's where a little initiative and paying attention during those parsing lectures will pay off.

Better Programming and Debugging

Even if you don't directly use your programming languages course, even if you hated it, if you took the right kind of course you will have your mind trained in ways that make you a better programmer. It's not a substitute for reading good code and learning from experienced practitioners, but it is similar to learning how to disassemble, clean, and reassemble a car engine blindfolded. It's analytical training for a language, you see what parts compose the tokens and phrases of the language, and see how they're combined into your program. It helps dispel myths about how your program is likely to work under the hood. This of course is especially true if the language you implement is similar to the language you write in, but if you can't go too far wrong working with a language more "interesting" than you're likely to work in.

Even the time you spend studying language design is not wasted. At some point, programming on a large team involves discussions and development of a coding standard for the group. These discussions are in a sense exercises in language design, because a coding standard effectively limits the expressive power of a language into a subset of the language that everybody is comfortably working with.

Recommended Reading



"Essentials of Programming Language" by Friedman, Wand and Haynes. Build up your knowledge of interpreters and compilers, starting with a simple lambda calculus, then to Scheme, building up to the powerful and high-performance techniques of continuation-passing style and tail recursion. Sometimes used as a text in PL courses. Maybe you like my arch-geek friend Tony will use your newfound knowledge to spend your Christmas break writing a Scheme compiler for your HP scientific calculator.

"A Retargetable C Compiler" by Hanson and Fraser. A funny book, it uses literate programming to document and explain the implementation of lcc, a C compiler used to compile Quake III. From memory management (arenas with constant time deallocation) and string hashing, the authors show and explain their C code from the bottom up. The ANSI C style is a little odd, but for the modern student, C will perhaps seem odd anyway.

Friday, December 07, 2007

Hashtables and Wedding Days

While using a commercial STL hash implementation a few years ago, my program appeared to hang (the old halting problem dilemma; is it in an infinite loop or just slow?) on a hash table insert. I got a trace of the key/value pairs inserted and wrote a minimal program only doing the inserts. That hung too. Sent it off to the vendor who found a bug in his bucket rebalancing code.

He sent back a patch for the implementation with the advice to get a better hash function, since a lot of the inserts were hashing to the same location. Using the key object's address as the hash value was a poor choice: it was word aligned, and the lower 2 bits were always zero. So we changed the hash function to right-shift the address, and with the patched code, we were off and running again.

Jeff Atwood draws an analogy of the birthday paradox and the probability of two objects hashing to the same location. The problem with my poor hash function resembles the Wedding Day Paradox, as I'm calling it. I recently had a conversation with a coworker:

Me: My anniversary is coming up.
Coworker: Mine too! How many years?
Me: Four.
Coworker: Me too. When were you married?
Me: October, how about you?

We quickly discovered we both had our weddings fall on the same day of the same year. We realized this should not surprise us, since most weddings occur on a Saturday. Knowing that we were both married in the same month of the same year gives only 4 or 5 choices. This effect is even more significant than the seasonal variations of birthdays with more concentration in some ranges than in others.

I've filed this under bugs, so let this serve as a cautionary tale if you undertake to do your own hash functions; consider well the distribution of your keys. Also, if you write a hash table implementation, include in your test suite a scenario where the hash function is really dumb.

Wednesday, December 05, 2007

Type Directed Code Search

We live in a Golden Age. You can buy Ubuntu desktop computers at Wal*Mart for under $200, there are more choices of programming language than ever, and their software libraries are rich and well-designed with lots of orthogonality. Let's say you're learning one of these new libraries, you're feeling all baroque and XMLish, and you set your heart on acquiring an org.w3c.dom.Document object. Where do people get those, anyway? Where's the parser that spits it out? After much digging, you find that a javax.xml.parsers.DocumentBuilder will make you your document. Ah, but that's an abstract class. Now, who makes DocumentBuilders? DocumentBuilderFactory, of course. This kind of searching gets old quickly.

Maybe this kind of run-around is unavoidable when learning libraries, and documentation and books can certainly help orient you, but there's another tool for our kit which may help in some cases. What might be useful is an index keyed by type that points to methods using the type, either in parameters or in the return type. This index could answer the question "where can I find methods that return type X?" or "what are all the methods that take an X as an argument?" Such an index could help both in learning as described above but in any activity where you need to search the code.

Eclipse goes pretty far in implementing this, not far enough, but enough to hint at what we could do. In the dialog below I've typed to search for "DocumentBuilder", selected to Search For "Type", and limited the search to "References". Ideally I'd like to limit the search to methods returning the given type; I want to know what methods generate this type. Also note that we've checked "JRE libraries" because we want to know what built-in library generates the DocumentBuilder:

The search results:
This feature seems a natural for Java since it is both statically typed and has the necessary reflection to allow a tool to do the analysis. However, it looks like type inference is coming to the dynamically typed languages too, including Ruby and Javascript and even Wasabi. Since types are not as much part of their language, it may not be as natural to think of searching this way, but the need will grow as more and bigger applications get written in these languages.

With a type-informed search, you can do debugging research. One thing I've seen with large team projects is the bug that has an exact duplicate lurking elsewhere. For example, you find a URL object for a web server has omitted setting the port number via the config.getPortNumber() call. Since everybody tests with the well-known default port 80, things appear to work, but a real customer configures with a different port and suddenly links break. You track it down, add the right calls, and close the bug only to have the customer call back with a different broken link, same root cause but in a different spot in the code. Once you identify a bad pattern like that, you typically want to look for other places it might occur. To find this pattern you could do a text search for URL, but a type-informed one or an AST-search is likely to cut down on unrelated references to "URL" in comments and other areas.

In conclusion, this is yet another application of type information. It's a little on the fringes, perhaps not an everyday tool, but for learning a new library, for doing research prior to refactoring or for debugging, it could be a standard feature in IDEs of the future.

Monday, December 03, 2007

Bonus Links - Jonathan Coulton Music

For your listening pleasure, allow me to recommend this twin spin from Jonathan Coulton, your new favorite Internet singer/songwriter:

1. Chiron Beta Prime (mp3), a little relief from the "Well, Our Producer Says We Had to Record this Christmas Song (And We Only Have Two More Verses To Go Now)" music on the air since last week.

2. Code Monkey (mp3), which can be accompanied by the mesmerizing Guitar Hero wannabe video that visually demonstrates the choppy power chords are a dovetail fit to the monkey-talk lyric style:

Code Monkey get up get coffee
Code Monkey go to job
Code Monkey have boring meeting
With boring manager Rob
Rob say Code Monkey very diligent
But his output stink
His code not functional or elegant
What do code monkey think
I say let's take up a collection and commission a work from this guy. He, ahem, rocks. Chords, lyrics and tab available for you who know how to use them.


What one programming language design and implementation topic would you pay real money to hear a song about?

Update: Corrected lyrics. Revised line break formatting.

Friday, November 30, 2007

Dollars Per Gigabyte Per Month: On Backups and Reliability

Following "Half a Terabyte, Hold the Gravy," several JFKBits readers have been discussing the JungleDisk interface to Amazon S3 and I wanted to explore storage reliability, extending the discussion from dollars per gigabyte for raw ownership to dollars per gigabyte per month.

Reliability and Backups

First, a note about reliability in general, and CDs/DVDs in particular. Failure of any backup device or media can occur due to basically two things: aging and mishandling. The NIST publishes "Care and Handling of CDs and DVDs: A Guide For Librarians and Archivists" and it tells you in great detail exactly what goes wrong when you mishandle a disk. When we talk about reliability, we may easily remember numbers from quotes like "An accelerated aging study at NIST estimated the life expectancy of one type of DVD-R for authoring disc to be 30 years if stored at 25°C (77°F) and 50% relative humidity" (which is far under the industry consensus of 100-200 years). What we may not remember as easily is that reliability is directly affected by handling. Part of any backup plan needs to be hygiene practices, if you will, of taking care of the backup media and hardware.

Costing Storage Over Time

Amazon S3 is $0.15/GB/month, or $1.80/GB/year, and for someone looking to store a few hundred gigabytes forever, this is an important consideration. Surely the reliability of hard drives are better than one year, and why would we pay $1.80 for our gigabyte to Amazon when we can get away with copies on a couple of drives at 36 cents each or a couple of DVDs at 9 cents each?

The JungleDisk web site quotes a stat from a recent independent study on drive reliability at Carnegie-Mellon by Schroeder and Gibson stating that 15% of hard drives will fail in 5 years. If you ignore multiple drive failures, and counting on retiring a 5-year old drive, I figure that means a drive at price P will actually cost 1.15 P in replacement costs. Of course you'll probably want to be the replacement drives up front, and unless you're buying 100 drives or more, you're actually spending more on replacements that 15% extra, perhaps up to 5 P depending on how much redundance you want (we'll get to RAID later). At the minimum 2 P for us small-scalers who aren't putting 100 drives into active use simultaneously, my $182 LaCIE drive comes out to $364 over 5 years, or $0.013/GB/mo. That's still a healthy distance from $0.15/GB/mo. I could buy five more replacements, acknowledge that a "500GB drive" is only 460GB and still be at $0.04/GB/mo. What else do we need to consider?

Inexpensive Disks or Cheap Disks?

Marcus Smith is the owner of a company providing IT services including backup and recovery to small business, and had some things to say about hard drives as archival medium on the Association of Moving Picture Archivists mailing list:

I completely agree that hard drives are only one component of an overall backup strategy and that multiple technologies will need to be employed in order to provide truly safe nests for our precious ones and zeros. There are a couple of universities experimenting with the idea of ditching tape systems and instead building very large hard drive arrays in which drives are given a certain time to live and are replaced on a rotating schedule. If every drive actually lived up to an exact span this idea would probably be used more often. The sad truth is the hard drives can fail at any time and this unpredictability alone is enough to show the overwhelming undesirability to use them as an archival media.

The idea that cost-per-megabyte savings is an enabling feature needs more clarification. How does this translate exactly in to providing new methods of storage? One major problem with using a cost-per-megabyte approach in favor of large capacity drives is the intended market. Who is buying these drives? Who is using them for long-term storage?

When it comes down to it, I'd be willing to wager that far more people are using hard drives as their sole mechanism for storage because of their low cost, which therefore makes price a liability, not a benefit. As the price of drives drops, it only traps us tighter into relying on unreliable media.

To store the same amount of data on other mediums - tapes, really - we're already spending thousands of dollars. The low cost and high capacity of hard drives is exactly why ATA is used instead of SCSI, and it continues to have a draining effect on quality of any solution because of the lack of any other cheap storage media. As long as cheap drives push the market, that is what most people will use. Again, this points to the question of who the market really is. Are we talking about large archives who may or may not be able to spend money on developing proper, multiple technology storage solutions? Or are we talking about the "Prosumer" level where individuals migrate film or store native digital video on hard drives because there is simply no other reasonably attainable storage device? Since Fry's are offering cheap drives, and archival institutions probably aren't shopping there for serious storage solutions, it seems to me that we're really talking about individual consumers who need basic, fast, easy access storage. In short, folks who probably don't have tape drives, a multiple technology backup systems, or a well-planned rotating storage schedules.

Costing Data Failure

For JFKBits readers, I know our needs are all over the map: high res digital photos; Linux backups including both system configuration; personal creative works representing who knows how many hundreds of hours of labor. In most cases the scenario of losing our data affects mostly ourselves. I contacted Marcus for permission to quote him in this article, and he described his small business clients: lots of professionals, where other people depend on the data; losing information invokes the specter of a lawsuit. When we make our backup plans, we know the purpose is to prevent data disaster. But when we start costing it, sometimes the cost of the data disaster tends to be left out:

Determining the value of the data you want insured may not be easy. A single database or secret recipe may be the heart of your business, and estimating the worth of that intangible asset is a problem that can give CFOs nightmares
--Kevin Savetz in "Data Insurance", published in New Architect May 2002

RAID When Things Go Wrong

At this point, RAID comes forcefully to mind. Duplicate the data and add some automated error correction and detection. More caution, more consideration, urges Marcus:

RAID arrays can offer some benefit, but there are dangers here also, and they need to be addressed. RAID arrays do not technically have to be constructed with identical drives, unless one uses crappy hardware controllers. But for RAID to work efficiently, then identical geometry and capacity drives should be used to build the array. Most RAID systems are based on using identical drives. Here again money comes into play. RAID 5 is nice for the overall capacity, but using identical drives may be seen to have similar failure characteristics. Let's mentally build an array with five drives, and buy five more for replacement. It could very well be that within two years all ten drives may be exhausted and getting two year old drives can be very difficult to procure. Nearly impossible, actually. There are two obvious problems: (1) you're only as safe as the number of drives you buy, which is going to be many. Go ask Compaq for a 9.1gig hot-swap replacement for a Proliant server. For that matter, head over to Fry's and ask for an IBM 30-gig Deskstar drive. It will never happen. (2) If identical drives have similar failure characteristics, then it stands to reason that when one drive fails, the other drives in the array are not far behind. There are also two non-obvious problems: (1) If two drives fail before the operator knows there is a failure, all of the data on the array is lost. (2) Data recovery from missing RAID volumes is notoriously difficult and usually ends in failure of recovery.

Luckily there are other RAID options, like mirroring and striped mirrors. Again, be prepared to invest in drives. Lots of drives. These options use half the drive capacity of the total number of drives. I.e., for 500 gigs of space you need not five 100gig drives, but 10 100gig drives. You cannot do this with ATA drives because of the limitations of ATA controllers. I believe the largest possible number of drives on a single controller is the 3Ware 8-channel RAID adapter. Eight drives plus eight spares. Plus more if this is where the Company Policy places its long term future. Add a storage rack for this and a Very Expensive RAID adapter and suddenly the one-year warranties on the drives turn an expensive project into an expensive and unreliable project. The benefit over RAID 5 is that data recovery is much easier of this kind of array if it comes to it.

So yes, we need to think very, very differently about storage. RAID solutions are not as safe as their cracked up to be, and depend greatly on the vigilance of the systems administrator to keep things running smoothly.

Whew! The overriding message here I think is that RAID doesn't solve everything - you'll want to buy your replacement disks up front, as you probably can't get them in 5 years. You still need to keep watch over the health of the RAID array. (I had extra time to think about this on my way home tonight as I changed a flat tire.) And as the experience of one of our readers reveals, you still need a strategy to watch that your backups happen and that the backups themselves are OK. One JFKBits reader checked his backups after reading the recent article and discovered that his backup software actually scrozzled the data. Now how much would you pay to make sure this all gets done right? Ready to fork over that extra dime per gig per month? I thought you might be.

Conclusions

We all know we need backups. If we're thinking long term enough to realize that even the backup media may fail, we've started planning the cost over time. Making your plan depends on what you're doing; backups of a machine image or a database have different needs than backups of segmentable data such as are needed by pro photographers. Amazon S3's figure of $0.15/GB/mo is not unreasonable, but you still will want a plan to verify integrity for yourself. A $1000 RAID NAS box (maybe two) is reasonable, providing you buy the replacement drives up front and verify the backups. If your data is segmentable, DVD backups are quite reasonable, providing you have the time to burn plenty of copies, distribute them geographically, store them in a climate-controlled environment, and organize them well. Tape backups are not dead either, and we'll close by letting Marcus have the last word:

I personally am much more in favor of tape systems than hard drives. I just wish they were less expensive. My problem is the same as many others - I don't have $5000 to shell out for a tape drive to store 400gigs of materials. Unless, of course, someone would like to buy me a new LTO drive and a bunch of tapes.

Does anyone want to wax poetic about the beauty of Tar as backup software?

Wednesday, November 28, 2007

Fifty Cent Lecture on the Unification-Based Type Inference Algorithm

In December 1978 the Journal of Computer and System Sciences published "A Theory of Type Polymorphism in Programming." In 1992 I was introduced to Standard ML in compilers class, and used it to write a typechecker for our Ada subset compiler. I was in love with types, typecheckers, and type inference. In 1993 my good friend and I tried unsuccessfully to implement type inference as part of a much larger project, without actually researching the prior art. We needed to learn the fine art of finding and reading research papers. A few years later I was back in grad school, and got the help I needed to locate the right papers and understand them enough to implement an inferencing type checker. The paper for that project is dated December 8, 1997. I'm very satisfied that ten years later, type inference has not gone by the wayside but is poised to go mainstream, going the way of automatic memory management (GC) via Java.

In commemoration then, let's explore a bit this fascinating feature that may soon be coming to a programming language near you.

Cocktail Party Explanation (Type Inference in 30 Seconds)

The way most bloggers are explaining type inference lately is with a Java/C# assignment example. Instead of typing
WaitingPageHandler handler = new WaitingPageHandler(args);
which has this offendingly redundant "X x = new X" form, you can get away with something like
var handler = new WaitingPageHandler(args);
How does the typechecker figure this out? This doesn't look too hard to infer. What about an even simpler form:
var delay = 250;
Obviously, the literal 250 is an int, and so is delay. The type of literals like 250, 3.14159, -1L or "duck" are well-defined by the language. An expression involving the new operator is that of new's argument. Here we see the easiest part of understanding type inference -- that there are some expression tree leaves that can be immediately inferred.

Of course you don't just write assignments. As you know from studying lambda calculus, there are three fundamental operations to care about: lambda introduction (corresponding to introducing literals), variable reference, and function application. So how does type inference handle variable reference and function application?

Checking Variable References with Known Type

Let's tackle variable reference first. What if we were given this:
var timeout = 500;
var timeoutTime = System.currentTimeMillis() + timeout;
In the second line, what type does timeout have? It's an int, but how does the typechecker know it? It was inferred from the first line, you're right. But technically, the typechecker probably doesn't know or care that it was inferred. After the typechecker inferred the type of the timeout in the first line, it updated the environment so that in subsequent processing timeout's actual type is known. The typechecker did a lookup in the environment for the type of timeout before doing anything else, as it would in an ordinary static typing typechecker.

So we come to the clue that as with ordinary static typechecking, an environment is needed. The environment is a stack of identifier to type bindings. Environments will come very much into play as we develop the algorithm.

We have looked at how variable reference works when the type is known. There is another possibility, that the type of the variable being referenced is unknown, but first let's look at function application when the function's type is known.

Checking Function Application with Known Type

What about this simple line:
var delay = 15 * 60;
Given two ints, * returns an int, making int the type of the function application expression *(15,60), and thus the type of the variable delay. More generally, when we see a function call and the type of the function is known, the type of the expression is that of the function's return type.

Inferring a Type

The interesting part of type inference comes when a variable is referenced but not immediately inferred. A good example of this situation is when the variable is the argument to a function:
function(x)
{
print(x);
if(x < 2)
1
else
x*factorial(x-1)
}
This puts us into a little deeper water. But stay with me, because this is it; the heart of basic type inference using unification. First, notice that the type of x is obviously int from the comparison in the if test, and also from the expression x-1 which is similar to the 15*60 example above. The fun comes in how we can process this line by line and arrive at the right conclusions. The print(x) line is a function call where x's type is as yet unknown. We assume that like in most languages print is overloaded for all types, or at least all the basic ones like int and String. By the time the typechecker is done, we do need to know x's type so our code generator or interpreter has enough information to call the right print code. How?

When we encounter a reference to a variable with unknown type, we tag its type with a unique type variable. Wuzzat? Hello? you say. Yes, we will need a separate environment that tracks type variables. For example, we will say that x's type is that of type variable t1, and in our type environment we say that t1 is bound to an unknown type -- the variable is kept in symbolic form, it's unbound. That's all we can and need to do for the print(x) line, so we move on.

When we encounter the next line, the if expression, we will type check its three arguments: the test expression, the "if true" expression, and the "if false" or "else" expression. So far we haven't mentioned unification, but here's where it really becomes necessary. What we actually do to typecheck the x < 2 expression is run the unification algorithm. The types of the arguments to the less-than operator (which you should always remember to properly escape in your blog editor if you don't want to mysteriously lose portions of your text) are unified with their expected types. For example, we know the less-than operator takes two numeric operands. We'll skip over the overload resolution part here and assume that the literal 2 tells us we're using the int version of the less-than operator. So we know that this operator is a function which takes two ints. We unify the type of the first argument's expression x with the type of the function's first argument, which we just said is int:
Unify(t1,int)
In this case the unification algorithm is getting a type variable and a concrete type, indicating we've discovered what type the variable should be bound to. So at this point we can update t1 in the type environment:
t1 = int
That's it! That's perhaps the key part of the unification-based type inference algorithm.

From this point on, whenever we lookup the type of x in the environment, we'll still be referred to the type variable t1, but when we ask the type environment to evaluate that type expression, we'll get a concrete type of int. We can construct our typechecker to have two passes, one that runs this algorithm and guarantees the type environment has nothing but known types afterwards, and then a subsequent pass that updates the environment to concrete types.

Also, note an assumption in the algorithm: once we had any concrete type for x we used it. If later in your program x was used in a context of a different concrete type, the typechecker would flag a type error. This is where you get errors like "found a real when expecting an int".

Equivalent But Still Unknown Types

Another key part of unification comes when a variable of unknown type is unified with another variable of unknown type. For example:
function(x,y)
{
z = if(choose()) x else y;
2^z
}
In this case the first line of the function tells us only that x, y and z are the same type. This is important information that must not be lost. Let's step through these expressions. In the first line, we check the if/else expression. We assume the type of choose() is known to be boolean-valued to match the expected type of an if test expression (and if the return type of choose wasn't known, it would be inferred by its use here). We then check the two branches of the if, and since they simple variable references we generate a unique type variable for each and bind the variable reference appropriately. In other words:

In the type environment,
t1 -> unknown
t2 -> unknown
In the user variable environment,
x -> t1
y -> t2
This is the state of things after checking both expressions. The last part of checking the if/else is to unify those two expressions, since we know they must match. In other words,

Unify(t1, t2)
is evaluated. We saw what happens when unifying a type variable and a concrete type; we assign the type variable to the type. But here both are variables. In this case, unification says we simply remember the equivalence of these unknown type variables. The way I handled this was to augment the type environment to keep a list of known equivalences for a type variable:

In the type environment,
t1 -> equivalences[t2]
t2 -> equivalences[t1]
If we ever find a concrete binding for t2 we can bind t1 as well as t2 at that time. And that should explain the rest of the example.

When we evaluate the assignment statement z = if, we enter a similar type variable entry for z and make it equivalent to both t1 and t2:

In the type environment,
t1 -> equivalences[t2,t3]
t2 -> equivalences[t1,t3]
t3 -> equivalences[t1,t2]
The way you handle this equivalence list is an implementation choice but I hope you get the idea.

When we finally hit the function's final expression 2^z we again have an overloaded operator but the presence of 2 helps us resolve it to the integer power operator, so we can unify the type of z with int, letting us simultaneously bind the type variables t1, t2 and t3 since we know their equivalence.

Where Polymorphic Variables Come From

Sometimes when checking a function the parameter may never be used in a context specific enough to bind its type. What happens then? For example:
function(x)
{
(x, [])
}
This takes an argument and returns a pair of the original argument x and an empty list, possibly a normalizing step for a function that will visit x and accumulate some results seeded with the empty list. The pair construction works for values of any type, so it adds no information that can be used to deduce a particular type for x. When the typechecker gets to the end of the function declaration, x is found to still be bound to its type variable. When this happens, x is "promoted" to be a polymorphic-typed variable. In order words, since x is not used in any particular way, it can be used for any type. A language designer could choose not to support polymorphic types for some reason, and could flag this as an error.

This example also introduces the notion that type inference operates within a certain boundary. There has to be a point in the abstract syntax tree where the typechecker knows this is where to promote still-symbolic types to be polymorphic. In Standard ML, I believe this boundary is the named function definition introduced by the fun keyword.

Conclusion

The complete algorithm is a little more involved than I've described here, but I hope this helps you understand the essential pieces and the general flow. For more details please see "Principal type-schemes for functional programs" by Luis Damas and Robin Milner (ACM 1982) or the original "A Theory of Type Polymorphism in Programming" by Robin Milner (Journal of Computer and System Sciences, Vol.17, No.3, December 1978, pp.348-375), particularly section 4 "A Well-Typing Algorithm and Its Correctness."

Tuesday, November 27, 2007

Warehouse Robot Video Clips

My dear JFKBits readers, for your enjoyment I submit these video clips of Kiva Systems' warehouse robotic picking system:

Friday, November 23, 2007

Different type returned by getElementsByTagName in Safari, Others

The Javascript function getElementsByTagName recently gave me cross-browser grief. I had a web page with a bunch of checkboxes (a type of input element) and wanted to iterate over the ones with a certain class. So I wrote this:

var nodes = document.getElementsByTagName('input')
for(i in nodes)
{
if(nodes[i].className == 'aCertainClass')
; // process checkbox
}
This worked fine in Firefox 2 and IE 6. Later I started testing with Safari for Windows and noticed a problem. I tracked it down to this discovery: getElementsByTagName returns different types in Firefox and Safari.

In Firefox, getElementsByTagName returns an HTMLCollection, which works fine with the iterator style for(i in nodes). In Safari, it returns a NodeList, and iterating with for(i in nodes) takes you on a trip through the object's methods, rather than its collection contents.

The fix for this was to change the iteration style, since both HTMLCollection and NodeList support the length field:

var nodes = document.getElementsByTagName('input')
for(var i=0; i < nodes.length; i++)
{
// same as before
}
For what it's worth, I've only seen NodeList documented as the expected return type from getElementsByTagName. Why it doesn't work with the for(i in nodes) iteration syntax is something for which I'd like an explanation.

Thursday, November 22, 2007

Half a Terabyte, Hold the Gravy

What do you do when you've got a device that creates very valuable 2MB files with the press of a button, an action that you may repeat up to 600 times an hour (maybe 20GB a week)? You end up buying another device, such as this puppy: the 500GB Lacie D2 HD Quadra 7200RPM 16MB external drive. I had never before heard of LaCIE, a French concern, but this drive seemed to fit the bill as a backup device. It has a reasonable dollars per gigabyte number, the reviews seemed encouraging, and I went ahead and paid the extra $60 to get the Firewire-capable model as this seems to be recommended over USB for sustained reads and writes typical of an external drive.

PriceGrabber got me thinking about the whole backup plan from the dollars per gigabyte angle for different types of storage devices, summarized here:




Tech Sample Product Price w/ S&H Capacity $/GB
DVD-R Verbatim DVD-R 16x 100-pack $42.27 470GB $0.09/GB
external HD 500GB Lacie D2 HD Quadra 7200RPM 16MB $182.60 500 GB $0.36/GB
flash memory PNY 4GB USB flash drive $31.89 4 GB $7.97/GB
DVD's are an economical choice as a backup medium, especially in my case where making modifications is not important. The 100 disk spindle is almost a direct comparison to the LaCIE drive in terms of capacity, since the LaCIE's actual capacity is more like 460GB. The external drive has the advantage that you don't need find the right disk when you need to retrieve your backup, so it is a good "live" backup solution. The external drive also gives you a little bit of portability, and the model I bought has four connection technologies (hence the name "Quadra") giving some flexibility in working with other devices.

The USB flash drive, on the other hand, is a terrible choice for long term storage, as far as I can tell. That salesman at the electronics store certainly had a lot of nerve trying to talk my Dad out of buying blank CDs in favor of buying a flash drive. The advantage of flash drives is portability, not economics. They weren't looking much cheaper than $8 a gigabyte on PriceGrabber for 1GB and up.

I've left out some options here which I didn't research as much. I've left out the venerable tape backup as well as internal hard drives and NAS. Internal hard drives in a RAID might be a good alternative to the single external drive because you get some automatic failure detection. And the NAS, or Network Attached Storage, goes one better in simplicity.

For this exercise, I've included in my personal backup plan ditching the idea of an "archive quality" medium such as tape in favor of any storage medium which has a long enough life to copy the data onto something newer.

Another idea is to build layers of redundance; one backup is not good enough, and maybe not even two. Somewhere in the back of my head in all this is my dear Grandma's judgment of disgust at the idea of computers, where you can erase everything at the press of a button. This information vulnerability is certainly an Achille's Heel of computing, and it's been an interesting exercise to price out and plan a moderately serious backup system. I'm certainly a novice to this, so if you have some experience to share, I'm all ears.

Friday, November 16, 2007

Quick Thoughts: Programming Productivity

I was thinking hard yesterday about which languages let you be more productive. I'd like to say functional languages let you be more productive. It seemed that it would be so nice to have a few studies done on this, some clinical trials on Scala the wonder drug if you will, with comparisons to languages which have a high placebo effect (they make you think you're writing a program when really you're just creating more work for everyone).

Just now I connected that thought with an earlier one: software is difficult to estimate because it's always the creation of something that has never been done before (even when you're writing a rip-off of an existing program). I reckon that the more unknowns you have, the more risk that your initial estimates will be wrong.

The connection between these two ideas is that it's hard to judge the productivity of a programming language when the utility of the language is to write things that have never been done before. You really do have to actually use and test it in practice. Hence it seems especially useless to have online debates about the productivity of languages without trying them out. It's a little like trying to predict the time performance of a piece of code by examining its source; you really need to actually run it.

For further reading, see Ian Chai's doctoral thesis on documenting object-oriented frameworks (Ian studied under Gang of Four member Ralph Johnson), where he conducted experiments with human subjects to compare how successful different types of documentation were on training someone to use a new framework. It's the only research I know of (not that I'm an expert) where a psychology-esque experiment has been done to test for productivity of a software environment.

Update: added parenthetical remark identifying Ian's advisor.

Wednesday, November 14, 2007

It's Doing What You Tell It, or Concurrent Programming Rediscovered

How did you come by your expertise? If you're a seasoned programmer you probably have a treasure chest of "war stories", experiences that you use to do your job. Some of the best are those that help you "get" a concept in some new profound way. I want to share a very simple such story.

A coworker of mine was having some problems with some multi-process programming, and since the behavior looked like a race condition, we looked through the code for synchronization problems.

Right here, I said, you're reading this shared value without locking it, and writing the changed value back here.

But, my coworker objected, most of the time the two processes are not running at the same time. I don't really need to lock that, do I?

When put that way, I saw the basic synchronization problem in a new helpful light, and I responded "Yes, you're right, and most of the time you don't have the race condition. It's doing what you told it to do."

The problem is that with a computer our intuition of "most of the time" is skewed by the fact that they've got, like, billions of things to do before this second is up.

The program as it was written was behaving exactly as it was written - most of the time, no synchronization problems. Just some of the time. The frequency of opportunities for this to happen were simply too high for the application.

I think we make this type of tradeoff in programming all the time, where we reject developing a more complicated mechanism as being not worth the effort for the amount of time it would cost us. I think primarily of error handling, but any edge case outside the typical flow of control is a candidate for being "undeveloped." Indeed, sometimes this is the only way we can manage the prohibitive number of different cases that exist. But of course this was ordinary concurrent programming. I think when you get used to it, you develop the instinct to program for the contention points, knowing they must be handled. But it is very tempting for the novice to model the situation as we do these edge cases: the typical no-contention case and the infrequent contention case.

Got any stories about concurrent programming? Leave a comment or share it with me, jfkbits at gmail.

Wednesday, November 07, 2007

Remove This - A Search Interface Improvement

The computer world has a lot of search interfaces these days, and I don't mean just GOOG (732.94 down 8.85 in heavy trading). I'm talking about things like "Search Messages" in Thunderbird and product searches in pricegrabber. In the old library search style, you had to think about what property you wanted to search by: Title, Author, Subject, depending on which index you wanted. Google let us focus on the search term and let the computer search all the indexes.

Either way, the user interface for reporting results is the same: you get a big list to sort through.

And either way, if you find your search isn't working, you start a new, slightly different search.

But I'd like to a twist: a "remove this" button to exclude certain results.

Deleting...From What? with Thunderbird

I should explain how I got this idea: I thought Thunderbird already did it, and wound up deleting some of my email.

I set up some elegant query, and still got back a whole lot of results in the nice table view. I noticed the Delete button, so I selected a number of them that were not relevant, and clicked it, thinking that would remove them from the search view. Ho, ho, ho. They were list traffic archived elsewhere, so no biggie. But then I wondered why not also offer the feature I thought it was offering.

Remove This

This got me wishing that all search interfaces would work like that - with a Remote This button or link. If you know you don't want a result, eliminate it from consideration in future variations of your query. It helps remove the clutter of seeing the same undesired but superficially-related result come back each time you refine your query.

Of course this implies some state to track your list of exclusions, and some way to clear or manage the exclusions. For an application, e.g. Thunderbird, it's no problem, and for product searches as in PriceGrabber that already support a complex filtering system, it's a logical extension. For Google, I'm not sure if it makes sense, interface issues aside, but I think it's worth considering for the uncluttering effect it can have.

(Above is an Artists Conception of how Remove This could appear in a Google search result.)

As a footnote, Supercomputing 2007 kicks off this weekend in Reno, Nevada. I'm not going this year, but I hear my software is, and I wish everyone attending a fun and productive time.

Tuesday, August 14, 2007

Product Announcement Template

I came across a development tool product announcement today that was so perfectly cliched, I was compelled to make a template from it. "Will abstract for food" is the hand-lettered cardboard sign I would hold if I was out of work, so I'll share this one:

[Product name] is an innovative, high performance [product area] framework designed to accelerate your development while, at the same time, providing your application with a solid foundation based on best practice domain-driven design patterns. With features like [buzz-word feature from a competing product], [synonym for high performance], [feature that would be useful in most software products, such as checking user input] and [feature that any reasonable product in the category needs], it allows you to focus your efforts where they should be: Solving business problems and not writing tedious infrastructure and "plumbing" code.
Now, for the original, less the product name (we call this partial evaluation):
[Product name] is an innovative, high performance .NET domain model and O/R mapping framework designed to accelerate your development while, at the same time, providing your application with a solid foundation based on best practice domain-driven design patterns. With features like convention-over-configuration, fast data access, model validation and data binding, it allows you to focus your efforts where they should be: Solving business problems and not writing tedious infrastructure and "plumbing" code.
Sigh. Nothing particularly compelling in this announcement, unless you just woke up from a debilitating illness and didn't know what was on the market and needed something to mention at this afternoon's meeting.

The only other funny thing is the choice of the word "tedious." Tedious is the word that comes to mind when you don't use a framework at all, and you keep wondering "isn't there a way to abstract this out so I don't need to write almost the same 5 lines thousands of times?" But it's a good choice of word for selling anything automated, as "tedious and error-prone" is the accepted justification for the effort of writing a program to abstract the process in question.

Tuesday, July 24, 2007

Not programming for the ICFP 2007 Programming Contest

Earlier this year I read Tom Moertel's Seven Lessons from the ICFP Programming Contest. The appeal of a round-the-clock contest has dwindled over the years for me, but Tom's experience made me resolve to look into entering this year, and above all, not to write any code until it was patently clear why I needed to. I think this turned out to be a good thing. I did register but I didn't have the time to spend much more than a few hours Friday afternoon and a couple more Saturday afternoon. Still, in that time, I had the satisfaction of thinking about the problem and anticipating several things which turned out to be true. It's been interesting to compare my experience with those of people actually entering the contest.

Think, then code

The chief lesson of Tom's that I want to focus on is Lesson 3, understand the problem before solving it. This is so important that's it odd to see how often it gets broken. Tom relates:
So, doing the math, it would seem that in the space of thirty scant minutes, I managed to get the Task, print it out, shower, dress, read the entire excruciatingly detailed specification -- all of seven 10-point typeset pages -- and decide, as the next course of action, that it would be logical to begin writing a parser Right Freakin' Now.
Learning from this, I printed out this year's 22 page specs and took them to the library, away from my computer, where I could read. I got the idea that DNA was transformed to RNA by the execute task, followed by the transformation of RNA to a bit-mapped image by the build task. I noted that the given DNA input was supposed to contain hints and messages.

Keeping the goal at the fore

Section 3 of the spec, "From DNA to RNA" started moving me from the fairy tale about Endo and his spaceship into the realm of programming. I started reading the specification of DNA sequences and the operations on them:
xs [diamond] ys - concatenate two sequences, xs and ys
[m..] - subsequence of xs, starting at m (abbreviation of xs[m..len xs])").
I could feel gears turn and cams snap into place in my head. Upon turning the page, and the next, we started to get into real code.

When I hit page 5, decoding patterns I actually stopped reading. I did something that I'd started to do recently when assembling cheap furniture: I turned to the end of the instructions to see how it would turn out, and work backward from that. I'm glad I did, because I was starting to lose the thread of what the problem was about among the details of an algorithm.

On page 20, section 5 "How to make Endo live", it was revealed that the whole point, what we were supposed to submit, was a prefix to Endo's DNA which would transform the image. The shorter a prefix, the better, and the lower the number of wrong pixels, was ten times better. This helped me realize that whatever approach was taken, there was an optimization component to it.

After reading about the RNA graphics language, I had enough of the basics to proceed. I had the idea that the DNA represented a sort of programming language and that one needed to write code in it which would generate graphics commands to transform the images. I hoped there would be symbols to generate things like the blades of grass or the lettering or the flowers. I think this now turns out to be right, as the "genes" hinted at in the reference are subroutines for helping to transform the image, or at least generate the new target image.

Observing

The next thing I did, again without programming, was to study the two pictures, noting any differences I found. A framework I use for studying informs that "observation is the first step." This was useful, as I started to realize how many differences there were. Even similar elements, like the grass, the fish in the stream and the trees on the left horizon differ in the details (the fish are facing different directions). My office mate got pulled into the problem, and he postulated that perhaps even the blades of the windmill were not at the same angle; he was right. We started using Mathematica to test our hypotheses. This turns out to be a good tool for the problem, because imported images immediately are accessible as Raster objects, which contain a list of lists of RGB values, perfect for manipulating and running statistical analyses on. We wired up a Dynamic expression which would allow us to move the mouse over the source image and display information about that pixel position in the source and target images. With this interactive tool we quickly found that the night-to-day transformation was not a simple gray-level shift or color channel shift.

The conclusion of this observation, as I walked to my car late Friday afternoon, was that more knowledge would be needed to do the transformation. The target picture was simply too complex to think of the problem as a series of basic graphic operations. The result would be not much different than encoding the target picture as the DNA prefix. I remember telling my office mate, the best thing to do is probably to write a disassembler for the DNA and study the original string.

Disappointing Saturday

I wish I'd followed my advice, of studying the original DNA string, because I never did it. My two or three hours of work on Saturday consisted of going back to study the DNA and RNA specs in detail, followed by my first coding session which only got as far as a non-functioning pattern function.

Conclusion

Even though I submitted nothing, for the time that put into it the non programming approach at the outset was good. Of course to solve the problem you need the computer time, but the forced discipline of thinking about the goal and the problem at a higher level seemed to work. Others reported spending most of their weekend coding things that in the end did not help them solve the problem.

If I'd truly followed my own thinking on Friday, I could also have submitted something like a prefix to set all the pixels to the target's dominant color, or learned more about the problem by reverse engineering the DNA than I did by coding up an algorithm that I already understood.

Monday, July 23, 2007

Diffing Endo

In the brief time I had to look at the ICFP Programming Contest this past weekend, I came up with one result I'd like to share. For background, roughly speaking the task was to provide a prefix to Endo's DNA which would transform this nighttime picture of an alien saying "lambda" to this other daytime picture of an alien cow say "mu":



Source picture



Target picture

Without time to inspect the DNA to find how it generated the source picture, I resorted to merely studying the two pictures to imagine how the transformation could be done in a symbolic way that wouldn't devolve to enumerating all the pictures of the target. What I come up with was this image, which is what you get when you subtract the target picture's RGB values from the corresponding values in the source:


Diff of target-source

Monday, June 25, 2007

Speedlink: Compilers.net

Compilers.net was recommended this morning on the Types mailing list as a good source for compiler frameworks. It has lists of compilers organized by language, a potentially handy resource.

Friday, June 22, 2007

Using Groovy to profile .class vs. getClass()

In Log4j template code in Eclipse, I mentioned two different ways of constructing a Logger instance:

Logger.getLogger(getClass());
and
Logger.getLogger(Foo.class);
and mentioned that the latter is supposed to be faster. Wondering how much faster led to an impulse to start creating a new Eclipse project and...

Then I thought "why write a new Java file? This is a perfect time to try Groovy." (If my goal was not to produce something persistent, I've failed because now we have this post.)

I recently installed Groovy, the dynamic scripting language with Java-like syntax and doubtful name, and fired up the Groovy console, a simple Swing application that functions as a kind of read-eval-print loop environment with one very useful feature, an object field and method explorer.

Bottom line: .class vs. getClass()

The bottom line is that accessing the static .class object was faster by about 20 times, at least in my tests using Java 1.4 in Windows:
Operation Average time
getClass() 1281 ns
.class 62 ns
x++ for int x 1859 ns
loop overhead
(not counted in other numbers)
78 ns
I would surely like to find some explanation for this gap. I'm sure all it would take is knowledge of how these two calls translate in the JVM, but that alas lies outside my knowledge.

Groovy code for profiling

I follow a standard pattern for profiling a block of code where you simply repeat the block some large number of times, sampling a timer before and after the loop, and then subtracting the loop overhead by timing the same looping construct with an empty body:
limit=1000*1000
import java.util.Date
t0 = new Date().time
for(i in 0..<limit)
{
;
}
t1 = new Date().time
loopoverhead = t1-t0
println "loopoverhead=${loopoverhead/limit} ms"

s = "foo"

t0 = new Date().time
for(i in 0..<limit)
{
c = s.getClass()
}
t1 = new Date().time
tgetclass = (t1-t0)-loopoverhead
println "getClass()=${tgetclass/limit} ms"
The Groovy feature I liked here was the ability to write an expression inside the string literals. Also note the automatic getter translation of the phrase new Date().time, where the Java getTime() getter method call is available as a field reference time. One thing I wanted to do was to call System.currentTimeMillis() instead of constructing a Date object, but actually couldn't find in the docs how to invoke a static method.

Appendix: Groovy source code


limit=1000*1000
import java.util.Date
t0 = new Date().time
for(i in 0..<limit)
{
;
}
t1 = new Date().time
loopoverhead = t1-t0
println "loopoverhead=${loopoverhead/limit} ms"

acc=0
t0 = new Date().time
for(i in 0..<limit)
{
acc++
}
t1 = new Date().time
intincr = t1-t0
println "integer increment=${intincr/limit} ms"

s = "foo"

t0 = new Date().time
for(i in 0..<limit)
{
c = s.getClass()
}
t1 = new Date().time
tgetclass = (t1-t0)-loopoverhead
println "getClass()=${tgetclass/limit} ms"

t0 = new Date().time
for(i in 0..<limit)
{
c = java.lang.String.class
}
t1 = new Date().time
tstaticclass = (t1-t0)-loopoverhead
println "String.class=${tstaticclass/limit} ms"

// Try with a non-system class

class Bob
{
}
testbob = new Bob()

t0 = new Date().time
for(i in 0..<limit)
{
c = testbob.getClass()
}
t1 = new Date().time
tbobgetclass = (t1-t0)-loopoverhead
println "getClass() on non-system class=${tbobgetclass/limit} ms"

t0 = new Date().time
for(i in 0..<limit)
{
c = Bob.class
}
t1 = new Date().time
tbobstaticclass = (t1-t0)-loopoverhead
println "Bob.class=${tbobstaticclass/limit} ms"

Tuesday, June 19, 2007

This Impredicative Sentence Talks about Itself

In which we learn a new word and check in on recent research in type inference.

Word of the day

William Safire's "Fumblerules" is a book on writing, a collection of short essays on various points of grammar and style, each based on a memorable self-explanatory example: "A writer must not shift your point of view" and "No sentence fragments." Fumblerules make the mistake they name.

While it's not a mistake, the title of this post is my self-explanatory example of our programming language word of the day:
impredicative
which is said of a definition that is self-referencing. Russell's Paradox uses a well-known example: "the set of sets that don't include themselves."

Self-reference and types

This word of the day comes to us via the paper "Boxy Types: Inference for Higher-Rank Types and Impredicativity" by Simon Peyton-Jones (with Microsoft Research in Cambridge, England) along with Stephanie Weirich, and Dimitrios Vytiniotis (University of Pennsylvania) from ICFP 2006, to my knowledge the only paper containing the phrase "plain as a pikestaff." (My money is on Peyton-Jones for contributing that one. ) But what were higher-rank types and impredicativity, I wanted to know.

A higher-rank type only makes sense with some background in parametric polymorphism, but I'll take a stab at explaining it. A polymorphic type is expressed with a universal quantifier "for all types a...", as in "for all types a, list of a -> int", the type of a polymorphic list length function. In general a polymorphic type can have any number of universal quantifiers. In Hindley-Milner, these universal quantifiers all come at the beginning of the type phrase. For any particular invocation of a polymorphic function, this means the polymorphic variables are each bound to one particular type. The higher-rank type means that the universal quantifiers can appear inside function types, allowing those polymorphic variables to be bound to different types within the body of the function. The utility of this is apparently in a kind of generic programming, as explained in Peyton-Jones work "Scrap Your Boilerplate", work done with Ralf Lämmel, another Microsoft Research employee.

The impredicative part of type inference is interesting. At first, I thought it meant recursive types, such as are used for linked lists. But I was puzzled how that could relate to this statement from the paper:
In the Hindley-Milner system, type variables range only over monotypes, but to allow function g we must allow a type variable—a, in the definition of Maybe—to be instantiated with a polytype. In the jargon, we need an impredicative type system, whereas Hindley-Milner is predicative.
After some digging, I think perhaps an impredicative type system is as the Wikipedia "Type polymorphism" indicates, one allowing impredicative polymorphism, allowing an instantiation of a variable in a type with the type itself.

The difference comes down to what is meant by "for all types" in the universal quantifier. In the predicative type system "for all types" is ranging over the monotypes, anything without a universal quantifier. More to the point, for all types does not include other polymorphic types... and therefore cannot include itself. In the impredicative type system, "for all types" can range over the polymorphic types too, and can include itself, hence the term "impredicative."