Friday, June 23, 2006

Yacking about a multi-language parser generator

If you've ever written even a toy-sized compiler or interpreter, the parser is one of the first things that slows you down. You have an implementation language in mind, and you cast about for the parser generator. Maybe you're old-school with C/Lex/Yacc, maybe it's Java/JavaCC, or maybe a less mainstream pair like SML/SMLYacc. Maybe you go the route of the self-written recursive descent parser and forego a parser generator. But this choice is a bit like choosing a bank for a CD, because once you have the parser written, you get substantial penalties for early withdrawal from the language and parser implementation choice.

Enter GOLD, a multi-language parser. The concept is dirt simple: generate the DFA state tables and the LR parse table as text-formatted data, and take your lexer and parse tables to any language that has the parsing engine. Even if an engine is not available for your platform, it is not hard to write that part which accepts tokens and consults the parse tables.

The key to this is in questioning the approach of most parser generator systems which put the parsing rules in line with the code that builds the abstract syntax tree. Of course by providing a scheme for building the AST, you've tied yourself to a particular language.

Reuse



By extracting the parse table data to a portable file, GOLD has opened up two different avenues of reuse:

1. GOLD parse tables can be executed in a large number of languages.

Right now, the GOLD page lists at least 7 different languages you can write your interpreter/compiler in while using a GOLD-generated parse table, including C, C++, C#, Java, and Python.

You've already written an interpreter in Java/JavaCC and want to switch to C++? Now you can switch without having to learn Lex/Yacc.

2. Because of the language portability, parsers can be shared more easily

If somebody has already written a Ruby parser or an SQL parser for sharing, you're now probably more willing to pick up the GOLD input and compiled grammar table, because you're free to drop it into the language of your choice. The GOLD site makes available grammars for about 14 major languages including C, LISP, Pascal, Smalltalk IV, and SQL 89.

GOLD as it is now



After all this talk about portability, the bad news is that the current GOLD parser builder is a closed-source Windows-only GUI application. Ouch. While obviously not a theoretical limitation, it's a practical barrier to entry for plenty of would-be adopters who may work only a Unix box or who don't want to run a Windows executable of unknown origin. The builder looks like an IDE, and one possible portable solution would be an Eclipse plug-in. There is also a command-line version of the builder, but for some reason it is also Windows-only at present.

The good news is that the builder looks fairly good. You can regression test your grammar and get fairly good visual feedback on your state tables.

Conclusion



GOLD looks like a great idea in a still-maturing incarnation. I like how it's taken the table-driven notion of parsing one step further. With more community support to make a portable builder to complement the portable tables and engines, I think this tool or something like it will be standard in the years to come.

No comments: