Friday, May 09, 2008

Thoughts and Code for the S-Expression Lexer

My recent project has been a tiny Schemish interpreter, and more recently have been considering a design that would work well as the Java counterpart to Kamin's chapter 1 interpreter, a calculator language using Lisp syntax, which is to be morphed into any one of a variety of different languages based on S-expression syntax.

So, this post is just for sharing some observations in working on the lexical analyzer part, as well as its code.

Just so it's abundantly clear, the problem a lexical analyzer solves, for S-expressions anyway, is to get a stream of parens, quotes, and symbols out of a character input source.

The apostrophe quote operator, as for creating list literals like '(lambda (x) x), appears to be a token not requiring whitespace. I've never seen the apostrophe not have preceding whitespace, but after testing MzScheme, ChezScheme and a few other minor Scheme implementations, it's apparent that a'b'c is the same as a 'b 'c. I've not done any serious development in Scheme, but I wonder whether this is common knowledge among Scheme programmers.

Similarly, Scheme read-eval-print loops appear to accept forests of S-expressions at the prompt, not just one expression. If you type 1 2 3 4 and the values get echoed back. This is useful for multiple defines, or for pasting code. Obviously, any REPL loop should support this behavior if at all possible.

I was happy to see that StreamTokenizer tracks line numbers. Having line numbers available "for free" helps error messages instantly. For what it's worth, I want my scanner to track column numbers too, but I understand if James Gosling (marked as the author of StreamTokenizer) didn't want to get into arguments about what a tab is worth.

The xUnit test frameworks are a much welcome tool for testing language processors. In 1998 I was fiddling with diff-based Perl scripts for automated regression testing. Writing JUnit tests is a lot more fun, productive, and exact than relying on textual comparison with blessed output.

Rather than subclass StreamTokenizer, I wanted to change the consumer's model of looping from "while the current token is not end-of-file" to a standard Iterator terminating on !hasNext(). This required making use of pushBack() on every token get, but I considered the ease of use for the client had a slight edge.

Using an iterator means you need an object to return, so there is a small Token class returned by the iterator. Token bundles the public fields of the StreamTokenizer that represent the token value. I opted not to have StreamTokenizer parse numbers. Since the intended use is ultimately interpreters for arbitrary languages merely based on S-expression syntax, I needed to let them have their own numeric literal syntax and domain of representation (double? float? BigInteger?). Now for some code. Token looks like this:


package jfkbits;
import java.io.StreamTokenizer;

public class Token
{
public static final int SYMBOL = StreamTokenizer.TT_WORD;
public int type;
public String text;
public int line;

public Token(StreamTokenizer tzr)
{
this.type = tzr.ttype;
this.text = tzr.sval;
this.line = tzr.lineno();
}

public String toString()
{
switch(this.type)
{
case SYMBOL:
case '"':
return this.text;
default:
return String.valueOf((char)this.type);
}
}
}
Here's how the type field, an int, as defined in StreamTokenizer, works: for "ordinary characters", the type is the character literal cast to an int. The LispTokenizer marks '(', ')', and the apostrophe quote operator '\'' as ordinary. In code using Token, this reads very naturally, as in if(token.type == '(') return parseExprList();. For "words", atoms in our case, the type is a negative int defined by StreamTokenizer.TT_WORD, which Token redefines as SYMBOL. If we read a Token t with t.type==Token.SYMBOL, the good stuff (like "42", "x", "eval", or "lambda") is in t.text. String literals have the type code of the delimiter, so t.type=='"' means we've got a string literal, the contents of which (without the delimiter!) are also in t.text.

And what about string literals? Strictly speaking, the same decision I made about numeric literals should also apply to string literals. Namely, that different languages have different syntaxes and potentially different representations. Perhaps I should not configure StreamTokenizer to recognize string literals. In that case, the parser would get atoms containing the double quotes, themselves, and the parser would be expected to split it apart. Currently, I don't expect this tool to be used for studying string literals very much.

And finally, for the code itself:

package jfkbits;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.Iterator;

public class LispTokenizer implements Iterator
{
// Instance variables have default access to allow unit tests access.
StreamTokenizer m_tokenizer;
IOException m_ioexn;

/** Constructs a tokenizer that scans input from the given string.
* @param src A string containing S-expressions.
*/
public LispTokenizer(String src)
{
this(new StringReader(src));
}

/** Constructs a tokenizer that scans input from the given Reader.
* @param r Reader for the character input source
*/
public LispTokenizer(Reader r)
{
if(r == null)
r = new StringReader("");
BufferedReader buffrdr = new BufferedReader(r);
m_tokenizer = new StreamTokenizer(buffrdr);
m_tokenizer.resetSyntax(); // We don't like the default settings

m_tokenizer.whitespaceChars(0, ' ');
m_tokenizer.wordChars(' '+1,255);
m_tokenizer.ordinaryChar('(');
m_tokenizer.ordinaryChar(')');
m_tokenizer.ordinaryChar('\'');
m_tokenizer.commentChar(';');
m_tokenizer.quoteChar('"');
}

public boolean hasNext()
{
if(m_ioexn != null)
return false;
try
{
m_tokenizer.nextToken();
}
catch(IOException e)
{
m_ioexn = e;
return false;
}
if(m_tokenizer.ttype == StreamTokenizer.TT_EOF)
return false;
m_tokenizer.pushBack();
return true;
}

/** Return the most recently caught IOException, if any,
*
* @return
*/
public IOException getIOException()
{
return m_ioexn;
}

public Token nextToken()
{
return (Token)next();
}

public Object next()
{
try
{
m_tokenizer.nextToken();
}
catch(IOException e)
{
m_ioexn = e;
return null;
}

Token token = new Token(m_tokenizer);
return token;
}

public void remove()
{
}
}
A token stream can be processed something like this:

LispTokenizer tzr = new LispTokenizer("(define x 42)");
for(Iterator it=tzr; it.hasNext(); ) {
Token token = it.nextToken();
processToken(token);
}
And finally, some unit tests.

package jfkbits;

import java.io.StreamTokenizer;
import java.util.Iterator;

import junit.framework.TestCase;

public class LispTokenizerTest extends TestCase
{
public LispTokenizerTest(String name)
{
super(name);
}

public void testLispTokenizerIterator()
{
LispTokenizer tzr;

tzr = new LispTokenizer("");
assertFalse(tzr.hasNext());

tzr = new LispTokenizer(" ");
assertFalse(tzr.hasNext());

tzr = new LispTokenizer("\n");
assertFalse(tzr.hasNext());

tzr = new LispTokenizer("7");
assertTrue(tzr.hasNext());
checkToken(1, "7", Token.SYMBOL, tzr.next());
assertFalse(tzr.hasNext());

tzr = new LispTokenizer("()");
assertTrue(tzr.hasNext());
checkToken(1, null, '(', tzr.next());
checkToken(1, null, ')', tzr.next());
assertFalse(tzr.hasNext());

tzr = new LispTokenizer("(newline)");
assertTrue(tzr.hasNext());
checkToken(1, null, '(', tzr.next());
checkToken(1, "newline", Token.SYMBOL, tzr.next());
checkToken(1, null, ')', tzr.next());
assertFalse(tzr.hasNext());
}

private void checkToken(int line, String text, int type, Object tokenObj)
{
assertNotNull(tokenObj);
assertTrue(tokenObj instanceof Token);
Token token = (Token)tokenObj;
assertEquals(line, token.line);
if(text != null && token.type == StreamTokenizer.TT_WORD)
assertEquals(text, token.text);
assertEquals(type, token.type);
}

public void testCharacterMapping()
{
assertEquals((int)'(', mkTokenizer("(").nextToken().type);
assertEquals((int)')', mkTokenizer(")").nextToken().type);
assertEquals((int)'\'', mkTokenizer("'").nextToken().type);

assertEquals(StreamTokenizer.TT_WORD, mkTokenizer("0").nextToken().type);
}

public void testSimpleLispExpressions()
{
test("",new String[]{});
test("()",new String[]{"(",")"});
test(" ()",new String[]{"(",")"});
test("\n()",new String[]{"(",")"});
test("() ",new String[]{"(",")"});
test("()\n",new String[]{"(",")"});
}

public void testLispExpressionsWithComments()
{
test(";Comment here\n()", new String[]{"(",")"});
}

public void testLispExpressionsWithStrings()
{
test("\"\"",new String[]{""});
test("\"a\"",new String[]{"a"});
test("\" a\"",new String[]{" a"});
test("\"a \"",new String[]{"a "});
test("(print \"Hello world.\\n\");",
new String[]{"(","print","Hello world.\n",")"});
}

public void testFactorial()
{
String src =
";;\n"+
";; Compute the factorial of a given number\n"+
"\n"+
"(defun fact (n)\n"+
" (if (< n 2)\n"+
" 1\n"+
" (* n (fact (- n 1)))\n"+
" )\n"+
" )\n"
;
String[] expected = {
"(","defun","fact","(","n",")",
"(","if","(","<","n","2",")",
"1",
"(","*","n","(","fact","(","-","n","1",")",")",")",
")",
")"
};
test(src,expected);
}

static void test(String src, String[] expectedTokens)
{
LispTokenizer tzr = mkTokenizer(src);
int i = 0;
for(Iterator it=tzr; it.hasNext(); i++)
{
Token token = (Token)it.next();
assertNotNull(token);
assertTrue("Expected "+expectedTokens.length+" tokens, got more",
i < expectedTokens.length);
assertEquals(expectedTokens[i], token.toString());
}
}

static LispTokenizer mkTokenizer(String src)
{
return new LispTokenizer(src);
}
}

2 comments:

Mike Kucera said...

This is a very nice and simple implementation that looks like it could be very useful. The only mistake I see, from a standard Java idiom perspective, is swallowing and ignoring the IOException thrown by StreamTokenizer.nextToken(). What if something goes wrong when reading from the input? The standard way of handing this in Java is try/catch.

I would argue that for this reason its not appropriate for LispTokenizer to implement Iterator. You can still have a nextToken() and hasNext() methods, so your class can be used just like an iterator, but nextToken() should throw IOException.

JFKBits said...

Hi Mike,

This is a reasonable point, I may change it (I intend to make the whole package available soon). I did want to point out that next() isn't swallowing and ignoring the exception, it's saving the exception, accessible with getIOException, and returning null from next(), turning it from an exception mechanism into a return code mechanism. But I agree this is not the most standard way of handling things.

I hope you can follow along with me for the rest of the interpreter basis!