Wednesday, May 14, 2008

Thoughts on an S-Expression Parser

In this post we look at a tiny Scheme parser, and generalize it to a non-language-specific S-expression parser that can be subclassed to handle any language design, as a base class for studying language varieties.

As I mentioned last time, I recently wrote a tiny Schemish interpreter, and then started extracting from it a basic interpreter platform that other people or myself could use to try out different language designs or implementations. One practical goal would be for instructors to provide the framework and have students modify it. The approach is not mine, it's from Sam Kamin's "Programming Languages: An Interpreter-Based Approach" (1990).

The original Schemish parser had these two parsing methods:


protected Expr parseExpr() throws ParseError
{
Token token = consumeToken();
switch(token.type)
{
case '"': return new StringLit(token.text);
case '(': return parseExprList(token);
default:
// Test the token to see if it's a numeric
Int intExpr = Int.fromString(token.text);
if(intExpr == null)
return new Atom(token.text);
else
return intExpr;
}
}

protected Expr parseExprList(Token openParen) throws ParseError
{
Vector acc = new Vector();
while(peekToken().type != ')')
{
Expr expr = parseExpr();
acc.add(expr);
}
Token closeParen = consumeToken();

// Handle special forms
ExprList retval = null;
if(acc.size() > 0 && ((Expr)acc.firstElement()).isAtom())
{
Expr head = (Expr)acc.firstElement();
if(head.isAtom("lambda"))
{
String headName = head.getAtom().getText();
String lambdaUsage = "Syntax error: "+
"expected ("+headName+" (arg ...) body)";
if(acc.size() != 3)
throwError(openParen, lambdaUsage);
Expr argExpr = (Expr)acc.get(1);
if(!argExpr.isExprList())
throwError(openParen, lambdaUsage);
ExprList argList = (ExprList)argExpr;
Expr[] args = argList.getElements();
HashSet argSet = new HashSet(); // to check for duplicates
for(int i=0; i < args.length; ++i)
{
if(!args[i].isAtom())
throwError(openParen, lambdaUsage);
boolean wasAdded = argSet.add(args[i]);
if(!wasAdded)
throwError(openParen, "Syntax error: argument "+
args[i].getAtom()+" appears more than once");
}
Expr bodyExpr = (Expr)acc.get(2);
retval = Lambda(argList, bodyExpr);
}
}

if(retval == null)
retval = new ExprList(acc);

retval.filename = m_filename;
retval.firstLine = openParen.line;
retval.lastLine = closeParen.line;
return retval;
}

The lambda-handling code is big, and obscures the structure of the list parsing part. That may not be so bad if you're working on this one interpreter, you don't forget the basic list-parsing structure, because it's simple. You'll care more about all the special forms. But for my purposes, of wanting to write many interpreters from the same code base, we want to be able to talk about different interpreters, and it's a little awkward: you're always sharing code patches, and having to explain where they go. Instead, why not give a sufficient base class, and share complete subclasses?

The improvement gives a parser class which handles only S-expressions; it knows only atoms and expression lists, it doesn't know any language constructs at all: it knows no keywords. You can think of it as generating just a tree structure, like an XML parser. Parsers for a particular language design will be written as subclasses, and will override the methods constructAtom and constructExprList. These methods are reminiscent of the actions in YACC-like parser generators, the blocks of code that construct parse tree elements given data named by the grammar rule symbols and from the general lexer and parser state (e.g. line numbers in our case).

Thus, parseExpr and parseExprList reduce to tiny fragments and subclasses can flesh out the meat of special forms in constructExprList:

public Expr parseExpr() throws ParseException
{
Token token = consumeToken();
Expr retval = (token.type == '(')?
parseExprList(token)
: constructAtom(token);
return retval;
}

protected Expr parseExprList(Token openParen)
throws ParseException
{
Vector acc = new Vector();
while(peekToken().type != ')')
{
Expr element = parseExpr();
acc.add(element);
}
Token closeParen = consumeToken();

Expr retval = constructExprList(acc, m_filename,
openParen.line, closeParen.line);
return retval;
}

protected Expr constructAtom(Token token)
{
return new Atom(token.text);
}

protected Expr constructExprList(
AbstractCollection exprs, String srcId,
int startLine, int endLine)
{
ExprList retval = new ExprList(exprs);
retval.filename = srcId;
retval.firstLine = startLine;
retval.lastLine = endLine;
return retval;
}

Now the Schemish interpreter can subclass Parser and call it SchemeParser, overriding constructExprList to handle special syntactic forms, and overriding constructAtom to handle language-specific literals, such as character literals or rational number literals (2/3).

1 comments:

Mike Kucera said...

Thanks for this post. I think I might find something like this useful for a parser testing framework I'm thinking of writing. I have a parser for C++ (written in Java) and I want to verify that the ASTs it outputs are correct.

The testing strategy I'm thinking of is this. Provide a piece of code in C++, then write the equivalent code using just S-expressions where the structure of the AST is explicit. Then parse both pieces of code and verify the the resulting ASTs are identical.

Right now writing test cases is very time consuming and verbose. I think something like this would make it a lot easier.