Wednesday, May 14, 2008

Thoughts on an S-Expression Parser

In this post we look at a tiny Scheme parser, and generalize it to a non-language-specific S-expression parser that can be subclassed to handle any language design, as a base class for studying language varieties.

As I mentioned last time, I recently wrote a tiny Schemish interpreter, and then started extracting from it a basic interpreter platform that other people or myself could use to try out different language designs or implementations. One practical goal would be for instructors to provide the framework and have students modify it. The approach is not mine, it's from Sam Kamin's "Programming Languages: An Interpreter-Based Approach" (1990).

The original Schemish parser had these two parsing methods:


protected Expr parseExpr() throws ParseError
{
Token token = consumeToken();
switch(token.type)
{
case '"': return new StringLit(token.text);
case '(': return parseExprList(token);
default:
// Test the token to see if it's a numeric
Int intExpr = Int.fromString(token.text);
if(intExpr == null)
return new Atom(token.text);
else
return intExpr;
}
}

protected Expr parseExprList(Token openParen) throws ParseError
{
Vector acc = new Vector();
while(peekToken().type != ')')
{
Expr expr = parseExpr();
acc.add(expr);
}
Token closeParen = consumeToken();

// Handle special forms
ExprList retval = null;
if(acc.size() > 0 && ((Expr)acc.firstElement()).isAtom())
{
Expr head = (Expr)acc.firstElement();
if(head.isAtom("lambda"))
{
String headName = head.getAtom().getText();
String lambdaUsage = "Syntax error: "+
"expected ("+headName+" (arg ...) body)";
if(acc.size() != 3)
throwError(openParen, lambdaUsage);
Expr argExpr = (Expr)acc.get(1);
if(!argExpr.isExprList())
throwError(openParen, lambdaUsage);
ExprList argList = (ExprList)argExpr;
Expr[] args = argList.getElements();
HashSet argSet = new HashSet(); // to check for duplicates
for(int i=0; i < args.length; ++i)
{
if(!args[i].isAtom())
throwError(openParen, lambdaUsage);
boolean wasAdded = argSet.add(args[i]);
if(!wasAdded)
throwError(openParen, "Syntax error: argument "+
args[i].getAtom()+" appears more than once");
}
Expr bodyExpr = (Expr)acc.get(2);
retval = Lambda(argList, bodyExpr);
}
}

if(retval == null)
retval = new ExprList(acc);

retval.filename = m_filename;
retval.firstLine = openParen.line;
retval.lastLine = closeParen.line;
return retval;
}

The lambda-handling code is big, and obscures the structure of the list parsing part. That may not be so bad if you're working on this one interpreter, you don't forget the basic list-parsing structure, because it's simple. You'll care more about all the special forms. But for my purposes, of wanting to write many interpreters from the same code base, we want to be able to talk about different interpreters, and it's a little awkward: you're always sharing code patches, and having to explain where they go. Instead, why not give a sufficient base class, and share complete subclasses?

The improvement gives a parser class which handles only S-expressions; it knows only atoms and expression lists, it doesn't know any language constructs at all: it knows no keywords. You can think of it as generating just a tree structure, like an XML parser. Parsers for a particular language design will be written as subclasses, and will override the methods constructAtom and constructExprList. These methods are reminiscent of the actions in YACC-like parser generators, the blocks of code that construct parse tree elements given data named by the grammar rule symbols and from the general lexer and parser state (e.g. line numbers in our case).

Thus, parseExpr and parseExprList reduce to tiny fragments and subclasses can flesh out the meat of special forms in constructExprList:

public Expr parseExpr() throws ParseException
{
Token token = consumeToken();
Expr retval = (token.type == '(')?
parseExprList(token)
: constructAtom(token);
return retval;
}

protected Expr parseExprList(Token openParen)
throws ParseException
{
Vector acc = new Vector();
while(peekToken().type != ')')
{
Expr element = parseExpr();
acc.add(element);
}
Token closeParen = consumeToken();

Expr retval = constructExprList(acc, m_filename,
openParen.line, closeParen.line);
return retval;
}

protected Expr constructAtom(Token token)
{
return new Atom(token.text);
}

protected Expr constructExprList(
AbstractCollection exprs, String srcId,
int startLine, int endLine)
{
ExprList retval = new ExprList(exprs);
retval.filename = srcId;
retval.firstLine = startLine;
retval.lastLine = endLine;
return retval;
}

Now the Schemish interpreter can subclass Parser and call it SchemeParser, overriding constructExprList to handle special syntactic forms, and overriding constructAtom to handle language-specific literals, such as character literals or rational number literals (2/3).

3 comments:

Mike Kucera said...

Thanks for this post. I think I might find something like this useful for a parser testing framework I'm thinking of writing. I have a parser for C++ (written in Java) and I want to verify that the ASTs it outputs are correct.

The testing strategy I'm thinking of is this. Provide a piece of code in C++, then write the equivalent code using just S-expressions where the structure of the AST is explicit. Then parse both pieces of code and verify the the resulting ASTs are identical.

Right now writing test cases is very time consuming and verbose. I think something like this would make it a lot easier.

eMBee said...

i am trying to implement a parser based on this right now. the tokenizer is straightforward but i am stuck here because your example is missing some bits. i assume the classes Vector and AbstractCollection are standard java.util classes.

Atom understand, but i am not sure what to do with Expr and ExprList.

what i know is needed is a Class to create a list or vector which can contain more lists and atoms.

is Expr the base class here that Atom and ExprList are derived from?

then further, consumeToken should simply drive the LispTokenizer to return the next token, but for peekToken the LispTokenizer is missing something. or am i missing something?

anyways, it looks like i can get it to work.

you can see my efforts here on my rosettacode page

i am new to java, and this is actually my first project in that language.

eMBee said...

i have slightly simplified the code, as i have discovered that if i make ExprList inherit from Vector i get nice properties to add elements to the list and thus conveniently manipulate the structure.

the code is now public on the S-Expressions Page

thank you again for sharing your code.

greetings, eMBee.