Wednesday, April 30, 2008

StreamTokenizer and Declarative Lexing

One thing I've noticed about marketing is how frequently an appeal is made to change your lifestyle to include more of whatever is being sold. I first noticed this when I read in a catalog that Ikea wants me out of that rat race of the outside world, so I can spend more time at home. And by the way, don't I want my home to be a comfortable inviting place to spend time? After that, I've seen it everywhere. Here at JFKBits we want to know why the world doesn't write more language processors. Everyone can enjoy a world of recreational symbol translation. I was on vacation last week, reading a programming language book by the beach. Maybe someday we'll run programming language designs in this space and JFKBits readers can implement them as a kind of crossword puzzle of the day, in an hour on the train home from work. So today, we're thinking about how to get a lexer up and going more quickly.

When you're designing a syntax, it's typically done at a high-level:
  • Comment syntax: line-oriented or delimited, or both? do comments nest?

  • Identifier syntax: Limited choice for the first character (e.g. letters and '_'), followed by mix of alphanumeric and underscore. In Lisp, symbols syntax is much more flexible.

  • Operators and grouping syntax: parentheses, brackets, multi-character operators like ++, ->, and &&=.

It might be useful, in terms of quickly prototyping a domain-specific language or language-aware utility (e.g. a tool that finds buggy code patterns), to have a lexer generator that lets you declare properties in these terms rather than translating to regular expressions.

Java's StreamTokenizer, while not a universal tool, is a step in that direction.

What I like about StreamTokenizer is that it raises the level of programming closer to the problem domain. Regular expressions are the traditional and well-understood means of specifying a lexer's accepted language, but this declarative style is possibly a better programmer's tool. (Note well that I'm talking about StreamTokenizer, not StringTokenizer which is a much simpler state machine.)

For example, in playing with Kamin's interpreters with their S-expression-based syntax, I've been using this configuration to scan LISP-type code in Java:

tokenizer = new StreamTokenizer(new BufferedReader(reader));
tokenizer.resetSyntax(); // We don't like the default settings
tokenizer.whitespaceChars(0, ' ');
tokenizer.wordChars(' '+1,255);
The basic idea is that StreamTokenizer is a state machine with hard-coded states and transition arcs, and you configure it with what characters are in the relevant character classes. For example, from the state of skipping whitespace, there's a transition to a string literal state, when a character from the "quoteChar" class is encountered. The string literal state accumulates, transitioning back to itself on characters not in the quoteChar class. It's simply up to you to configure which character or characters constitute a quote delimiter. The essential observation is that lexers for many useful languages share state machines of identical shapes, and they differ only in the definitions of character classes.

Of course, not every language fits the preprogrammed state machine shape. StreamTokenizer can't even handle Java, because it has no capacity for multi-character operators. It can be configured so that '+' is an "ordinary character", meaning it is lexed as a single-character token, but there's no way for a parser to know if two '+' tokens in sequence came from the input "++" or "+ +". This is what I mean when I say StreamTokenizer is not a universal lexing tool.

But I still wonder if there's room for this declarative manner of input, building on higher-level concepts like identifiers and operators, for a more general lexing tool. This could be combined with a programmatic knowledge base of some of the standard lexical idioms running around. You could say "give me Python-style identifiers, with the standard Java operators except these four which don't apply to my language." I'm not at all sure there is need for such generality, but I think it's worth writing down, as an idea for future inspiration.


James Iry said...

Take a look at Scala's parser combinator library for inspiration. It allows you to create a full blown EBNFish grammar in code and you can specify down to whatever level you want. But for many purposes you can explicitly pull out standard lexing concepts like keywords and delimiters.

JFKBits said...

james: Excellent pointer, and a good reminder of the suitability of Scala for implementing language processors (starting with case classes and pattern matching).