Wednesday, December 19, 2007

Tree Traversal in an Object-Oriented Language

There are two methods I know of for representing an expression in classes: the event-model and the subclass model. The main problem I see is in how to traverse the tree without replicating verbose "if this type, then" boilerplate and without piling all the code that might ever use the expression into the same sets of classes. I'm merely observing in this post, if you have any opinions on which you like better, let me know.

Grammar

For both of these, w'll use the following grammar:

E -> INT | STRLIT | VARREF | List[E,...] | E(E) | FN(x){E}

An expression is either an integer constant or a string literal, a variable reference, a function call, or a function definition.

Subclass Model

In the subclass model, we'd start with a class for E itself:
class E
{
}
We'd augment E with lots of handy utility functions and fancy abstract (pure virtual) methods. Then we'd define a subclass of E for each of the possible "actual" expression types:
class Int extends E
{
int k;
}

class StringLit extends E
{
String s;
}

class Varref extends E
{
String x;
ScopeLevel level;
}

class EList extends E
{
List elts; // List of E objects
}

class Call extends E
{
E function;
E argument;
}

class FnDef extends E
{
String x;
E body;
}
One thing I don't like here is adding a virtual method to E to handle anything anybody might do to an expression. Type checking? Add a typeCheck method. Code generation? Add a generateCode method. Optimization or source transformation passes? Good luck with that. I think it's sticking too much unrelated code into each class.

And if you separate your concerns and have code elsewhere that gets handed E objects, you end up with this structure that suggests you haven't learned about old-code-calls-new-code virtual methods yet:

void f(E expr)
{
if(expr instanceof Int)
f((Int)expr);
else if(expr instanceof StringLit)
f((StringLit)expr);
else ...
...
else // Oh no!
;// It wasn't one of the types I thought it could be!
}

Event Model

The other model is already known in Java programming as the event model, as in the way SAX parsers work. I like to think of the True Origins of this concept, in lambda calculus encoding. Every basic type we know of in the functional languages like ML (tuples, lists, integers) has an encoding in the pure, 3-form lambda calculus. And the type we're talking about here is called a sum type. Sum is related to "OR" just as product is related to "AND." An expression is an int OR a string literal OR a variable reference, and so on.

Here's what I like about the event model: it factors out the "if expr is this choice, call this; else if expr is that choice, call that..." code to one place and you are freed up to write functions that are given the exact subtype:
interface EHandler
{
Object handle(Int i);
Object handle(StringLit s);
Object handle(Varref v);
Object handle(EList l);
Object handle(Call c);
Object handle(FnDef f);
}
Now your language processor phases can be written in terms of EHandlers:
class TypeCheck extends EHandlerImpl
{
Object handle(Int i)
{
return Types.INT;
}

Object handle(Varref v)
{
// look up v. if unbound, flag an error
// if bound, return its type
}

Object handle(Call c)
{
Object fnType = this.handle(c.function);
Object argType = this.handle(c.argument);
// check that fnType is a function,
// make sure fnType's range type matches argument type
// return fnType's return type
}
}
This is fine as long as you only need to look at one level of the tree at a time. And if we wanted these handle functions to return a type other than Object, or if we want to add another argument to each handle function such as an environment or accumulator object (not likely since we could make it a class member, but conceivable) we'd need to create another interface and/or abstract base class.

This second one appeals to me because I'm so accustomed to ML's pattern matching facility (which also appears in other languages like Haskell and Scala), which is the sweetest of syntactic sugar for sum types, and sum types in turn are the bread-and-butter of language processors.

Conclusion

Either of these two approaches seems fine for small and simple expression tree types (meaning the number of different types of nodes, not the size of a particular tree), but I think there's got to be a better way for representing sum types in classes that can handle larger sum types and more complex processing needs. Do you know what it is? Leave a comment or email to jfkbits at the mail domain of google.

Update 12/20: Fixed code formatting.

3 comments:

Anonymous said...

This is just the visitor pattern. I don't see how it eliminates subclassing. In fact you need both. You use subclasses to describe the hierarchy and the visitor pattern to provide a structure independent way to traverse this hierarchy.

JFKBits said...

peter: Yes, what I called the "event model" appears to be the visitor pattern, though there's a distinction in what object does the recursion (the visitor or the subclasses) that at the time of writing I didn't realize either can be considered visitor pattern.

I didn't mean to imply the second model excluded subclasses; in fact I originally called the first technique "subclass-and-cast", to emphasize the need to check types and cast.

Anonymous said...

Hmmm. It just this moment occurred to me (as I was working on code with a similar issue) that this is an excellent place for multiple inheritance.

For each related group of methods (e.g., a term than can contain other terms) you write an abstract protocol class. Then the type specific code is a single if - can I cast to the APC or not? This is also robust in the face of additional structure in the main class hierarchy. If you add another sort of containing term, you simply inherit the container APC and it works.

I suppose in Java this would be done by using Interfaces.