Friday, December 28, 2007

Revisiting Tree Traversal Patterns

Experts may disagree, but I like to agree with LtU that the killer app for functional languages is compiler writing. I propose that one simple reason for the success is structural pattern matching, because it lets you write compact and natural tree traversal code:

datatype Expr = Atom of string | Pair of {car : Expr, cdr : Expr};
fun textFormatter(Atom a) = a
| textFormatter(Pair{car=x, cdr=y}) =
concat["(",textFormatter x,",",textFormatter y,")"]
Do OO languages have any hope of approaching this kind of conciseness? That led to my question in Tree Traversal in Object-Oriented Languages "which of these two techniques do you like better and is there a better one?"

In that post JFKBits reader "peter" commented my "event model" was just the Visitor Pattern. I'd had another idea of what the Visitor Pattern was. After some digging I learned there are two variations both considered the Visitor Pattern, and today we'll look at what difference it makes for writing programming language processors.

Why Tree Traversal Matters

For the programming language tool writer, a tree traversal pattern is like the for loop (or other iteration pattern) for most other programmers -- after all, your bread and butter data structure is the tree, not a flat sequence.

In a word, tree traversal matters if you traverse a lot of trees - sum up the number of traversals you write in a single program or the number of processors you write over time.

You could be writing a series of toy interpreters each with a single eval loop, like this one for Scheme and this one for Lisp both written in Python.

On the other hand, you could have one large project with many phases. I have a 12,000 line compiler for Ila written in Standard ML with five major tree traversals areas in the compiler core, plus some utility traversals like error message pretty printing and a Javadoc-like HTML generator. Writing that compiler one time impressed me that tree traversal pattern matters, especially since there were quite a few node types.

Further, your choice of tree traversal pattern will immediately affect the way you think about your algorithm. I think one variation of visitor makes it unnecessarily difficult, and one makes it quite a bit more natural.

Visitor Pattern in General

The visitor pattern is generally described as a way to decouple algorithms that operate on a data structure of related node classes from the node classes themselves.

The double-dispatch technique is often described in presentations of visitor. To use the visitor pattern, take the tree object t to traverse, a visitor object v corresponding to the algorithm (type checking, syntax highlighting) and call the virtual method t.accept(v) (this is the first dispatch). The node subclasses (Atom, Pair) each have an implementation of accept that calls the visitor method corresponding to their particular class (the second dispatch, hence "double dispatch"). That way the visitor can be written as a collection of methods that each know the exact node subtype they're dealing with.

Visitor Pattern for Language Processors

What's often glossed over is the critical question where is the knowledge about a node's children? In other words, where does the recursion/iteration occur - in the acceptor or the visitor? I originally imagined the visitor pattern was defined to mean recursion occurs in the acceptor. Either way is considered an accepted variation of the pattern, but I claim it makes a substantial difference for most compiler algorithms.

Recursion in the Acceptor

Here's recursion-in-the-acceptor for the Lisp structure above:

public abstract class Expr
{
public abstract void accept(Visitor v);
}

public class Atom extends Expr
{
public String value;
public void accept(Visitor v)
{
v.visit(this); // Atom has no children, simple accept
}
}

public class Pair extends Expr
{
public Expr car;
public Expr cdr;

public void accept(Visitor v)
{
car.accept(v); // children visited in the acceptor
cdr.accept(v);
v.visit(this); // postorder traversal
}
}

public interface Visitor
{
void visit(Atom a);
void visit(Pair p);
}
Now, given those definitions, how would you write the TextFormatter iterator that produces the same output as the SML example given earlier?

Recursion in the acceptor introduces some problems. One is that it fixes the traversal order -- postorder in this case. Now the iterators need to know what the traversal order is, a dynamic property that must be remembered when writing each visitor. Also, what if postorder isn't what you need? In particular, in order to generate the grouping parentheses, which are associated with the Pair, you would like code associated with the Pair to run both before and after visiting the Pair's children. Further, you wanted to generate a comma in between -- that means a visit in between the two children. You're forced to write something like this:

import java.util.Stack;
public class TextFormatter implements Visitor
{
private Stack m_stack = new Stack();
private String m_result = null;

public void visit(Atom a)
{
m_stack.push(a.value);
}

public void visit(Pair p)
{
// We need to know the order of children traversal. Hmmm.
String cdr = (String)m_stack.pop();
String car = (String)m_stack.pop();
String value = "("+car+","+cdr+")";
m_stack.push(value);
}

public String toString()
{
if(m_result == null)
m_result = (String)m_stack.pop();
return m_result;
}
}
Using an explicit stack is not 100% always bad -- if you tend to process programs with extremely deep nesting, you may hit your implementation language's control stack limit, but overall this is not the style in which I'd prefer to write dozens of visitors.

Note that the SAX parser model, which follows recurse-in-the-acceptor, addresses the fixed-order problem by having multiple visit methods for nodes with children: it has startElement as well as endElement, and the like.

The other problem with this approach is that not every visitor needs to visit every node. Some may only need to go so deep. That brings us to the recurse in the visitor approach, which I would consider the recommended one for programming language processors.

Recurse in the Visitor

This approach has the advantage that the accept method is always the trivial v.visit(this), and the visitors can be written in a more natural style:

public class TextFormatter implements Visitor
{
private StringBuffer m_buff = new StringBuffer();
private String m_result = null;

public void visit(Atom a)
{
m_buff.append(a.value);
}

public void visit(Pair p)
{
m_buff.append("(");
p.car.accept(this);
m_buff.append(",");
p.cdr.accept(this);
m_buff.append(")");
}

public String toString()
{
if(m_result == null)
m_result = m_buff.toString();
return m_result;
}
}
Is this what I'd like to write as a functional programmer? Of course not. I'd prefer to have accept return a value, so I could write visit(Pair) more like this:

public void visit(Pair p)
{
m_buff.append("("+p.car.accept(this)+","+p.cdr.accept(this)+")");
}
Which of course means that visit should return a type, instead of accumulating results in a member variable. But then we need to know what type to return, which in Java most likely needs to be Object. It all gets ugly again.

Conclusion - Still Not Perfect

So, a visitor with recursion in the visitor is acceptable, but still not ideal, and it doesn't come close to the conciseness of pattern matching. The accept method needs to be added to each node class, which is a pain for the rapid prototyping, and the visitor of its nature requires that you accumulate state in member variables instead of using a functional style of combining returned results in a single expression.

Still, I'm just one guy with a certain slice of experience --- is there a better (more compact, less boilerplate) tree traversal technique for OO languages?

Or is this really the ultimate pattern, OO loses this particular battle, and Nice multi-methods or Scala case classes or other language extensions become necessary parts of an OO-capable language that supports programming language implementations as well as the functional languages?

7 comments:

Anonymous said...

I'm not so sure. Those kinds of tree traversal are like frictionless surfaces. Nice for illustrating a basic principle, but nothing like real life. What I find, when I do this kind of work, is that input error detection and description massively dominate the traversal logic. Making tree traversal simpler or even zero effort wouldn't make much difference to the total effort.

The only thing I have written in the last few years where the traversal was significant was a library where the traversal was so critical that it had to be hand rolled anyway, down to implementing the red-black logic.

JFKBits said...

aog: This is helpful to think of comparing one area to the overall effort. I'm sure I agree with you, the things you actually do to or with the nodes are more effort.

I have this memory of starting to write the umpteenth such algorithm in my compiler and wishing there was a way to abstract at least some of the boilerplate. And this was in ML, where I'm saying it's convenient. I think I did identify some algorithms where essentially an iterator is what was called for, so I could map a function over a tree.

It can also be argued that most of these traversal techniques become second nature after a while, so it really only matters when (a) teaching someone else (b) trying to remember how you did that stuff eight years ago.

Perhaps my point is simply that we do this stuff all the time, we might as well find an optimal way to do it, and what is that. It's a similar reason why folks care about map versus foreach versus for(i=0; i < max; i++) even though the body of the loop is what matters and where you put your coding effort.

Fabien said...

The two styles (tree data types + external functions, and objects with embedded methods) are dual, and one can be mechanically translated into the other (visitor patterns are the canonical way for tree->OO translation). The catch is that a given representation is easy to extend in one dimension, and hard to extend in the other:

- in tree style, it's easy to add new treatments (new functions), and hard to add new data types (every function must be touched to add a case)

- in OO style conversely, it's easy to add new node types (create a new subclass), but hard to add new treatments (methods must be added into every class).

Unifying both styles' strengths is still an open problem. if you google `The Exression Problem' you'll find plenty of unconvincing academic papers about it :)

In the case of a language, clearly, you're more likely to have to create new treatments than new AST nodes, so you'd better pick the tree approach. The best compromize I've found yet is to provide a patchable traversal function: it explores the tree methodically without doing anything, but you can add arbitrary treatments in it, triggered by customizable predicates. I've blogged something about it here: metalua code walkers. I wouldn't call it easy to use, but all the other approaches I've seen suck more :) Unsurpriaingly, patch functions use a lot of pattern matching (which is implemented as a metalua macro).
!

Anonymous said...

fabien;

Your approach sounds like a callback style variant, e.g. you have a general traverser to which you provide a callback which it invokes on each node. C++ functors (and the Boost.Function library) make that very easy to implement and is the style I am adopting these days. In particular, if you combined that with the multiple inheritance protocol class approach, you could even template the functor for maximal re-use. E.g., the functor takes a the protocol class type and a pointer to method and then calls the method on every node that casts to the protocol class. Then it's easy to add both new node types and new treatments.

Barry Kelly said...

What you're looking for is basically multiple dispatch. The fact that most OO languages don't have them is annoying, I agree with you there.

I deal with the problem in .NET by dynamically creating code at runtime to select the correct method. I have a function which takes the visitor class (the class implementing the algorithm), and it creates a DynamicMethod instance with the right code to dispatch the different argument types to the correct methods, and it returns a delegate referencing this dynamic method.

It's not as fast as it could be with free access to the underlying platform's OO implementation, or even with help from the node classes, but it does work a lot faster than most other schemes.

JFKBits said...

barry: If I understand you from our off-line conversation, your technique lets you write just the visitor class, without needing to add the acceptor into the node classes.

That would be a bit better, especially if it allows different method signatures for different visitors. E.g. maybe an interpreter visitor's visit methods all return an Expr and a code formatter's visitor methods all return String.

The dispatch speed is another issue which would be interesting to cover sometime. It may not be quite so critical for some types of applications, if rolling your own double-dispatch lets you write a more compact structure. Standard disclaimer: it depends on what you're doing.

JP said...

When you iterate in the acceptor, you want to use a hierarchical visitor pattern http://c2.com/cgi/wiki?HierarchicalVisitorPattern instead of a regular visitor pattern. It doesn't allow you to change the traversal order but does solve your other issues.