JFKBits: May 2008

Friday, May 30, 2008

Multiple Definitions

We had an interesting discussion at work yesterday debating whether Mathematica supports multimethods with its variety of pattern-matching. So I'm taking the opportunity to do a mini-survey of the multiple dispatch spectrum, starting with overloading. As far as Mathematica, it clearly gives you the power of selecting from multiple definitions based on runtime information; more on this in a minute.

Overloading allows you to write different definitions of a function or method, and the definition used when you call the function depends on the number and types of the arguments you pass them. That is, the overload resolution is done by the compiler at compile time with static type analysis. Whatever type your arguments are, and exactly those types, will determine which definition is chosen.

Multimethods let you write different definitions of a function or method, and the definition used when you call the function depends on the number and types of the arguments you pass to them, as examined at runtime. Now you can write one definition for a class, and specializations for its subclasses if so desired, and the definition will be chosen based on the actual type of the arguments at runtime.

You get into a fuzzy gray third area if the runtime values can also be used to select different definitions. This is where Mathematica lies, because its pattern matches can be used to differentiate between a list of at least one element and a list of at least two elements, or between the symbols Null and Infinity. What's useful for writing symbolic algorithms turns out to be useful for regular programmers.

It seems that ML's structural pattern matching is also in this fuzzy gray third area, and that helps me make an interesting connection. For my purposes, multiple dispatch is interesting because it's the way to do expression tree traversal. That is, it lets you write pretty printers and type checkers and the like without needing to code the dispatch yourself (if the node is actually a lambda abstraction, do this, but if it's a cons cell, do that). What I'm noticing now is that one way or another, multimethods and pattern matching are giving you the notational convenience that I enjoy in writing tree traversals, with still perhaps an edge to pattern matching on that score.

Wednesday, May 28, 2008

The ucc Compiler

Someone with the name dreamAnders has posted announcements to comp.compilers and comp.lang.c for his small open-source C compiler, ucc, the chief attraction of which is that the source code is meant to be small and self-explanatory.

Wednesday, May 21, 2008

On "I can't believe I'm praising Tcl"

Today we look at a refreshing use case for a programming language, where economy of expression in calling functions from a read-eval-print loop is prized. Raganwald recently tagged "I can't believe I'm praising TCL", in which the embedded systems author helped us understand how TCL made for a great debugger command environment, and the "pop infix languages (C/Java/Python/Ruby/you name it)" don't.

In this case the author wants to define some glue function or functions, and then in the language's interactive interpreter, call his function over and over. He's not programming; he's commanding, so the function calls need to be short and sweet, so he doesn't mind typing them for hour after hour as he thinks about the real problem, a buggy piece of embedded hardware. The author wants a command shell, where he uses his command interface to an embedded device as a kind of gdb replacement. An example session to set breakpoints, inspect memory, etc., looks like this:


$ pmem 0 stat
IDLE
$ pmem 0 bkpt 0 0xbff
$ pmem 0 bkpt 1 0xa57
$ pmem 0 cmd run
$ pmem 0 stat
DEBUG
$ pmem 0 pc
0xbff
$ pmem 0 rstack
3 return addresses
addr 0: 0x0005
addr 1: 0x05a8
addr 2: 0x0766
$ pmem 0 cmd stp
$ pmem 0 pc
0xc00

The question then is whether a language you may be designing or using could support something close to this syntactic economy for calling functions.

The argument for Tcl over the pop infix language may perhaps be best summarized by this quote:

And then we have interactive shells. And in Python it’s doit("xx","yy"). And in Lisp it’s (doit "xx" "yy"), or (doit :xx :yy), or (doit xx yy) if you make it a macro. And in Ruby it’s doit :xx :yy, if you use symbols and omit parens. And that’s about as good as you can get without using your own parser as in doit "xx yy", which can suck in the (more rare) case when you do need to evaluate expressions before passing parameters, and doesn’t completely remove overhead. Also note how all these languages use (), which makes you press Shift, instead of [] which doesn’t. Ruby and Perl let you omit (), but it costs in readability. And [] is unanimously reserved for less important stuff than function calls.

Analysis

First we see the emphasis is not on defining functions, on programming, but on using, on the syntax for calling, functions. The author wants an internal DSL (Domain Specific Language).

Second, it should be noted that in discussing () that Scheme lets you use [] as well as (). There's good Scheme style, where [] is reserved for the let blocks, but if you open up Dr. Scheme or Chez Scheme, define some choice Turtle graphics functions, and start typing commands like [penup] [rt 45] [pendown] [fd 100] it will work fine.

One thing the author noted is that TCL's preference for strings over variables makes bkpt a string and $bkpt a variable, whereas in the pop infix languages, it's the variables that get lexical economy and strings that need delimiters. Because of this preference, calling a Tcl command lets you pass in what look like symbols, but you treat them as strings in the command definition. Hence, for the author's use case, a chief consideration seemed to be a way to write symbolic arguments, where the command in question may take a one-of-several subcommand or option name, without lexical decoration like string delimiters or single quotes or colons. I wonder if this was really a language design goal of Tcl, because it's hard to understand the motivation for the string-vs-variable syntax any other way. For all that, enumerated types or sum types are a known language feature that meet the author's criterion. In Standard ML you could define a datatype Subcommand = bkpt | status | memset or the like, and now undecorated references like bkpt can appear as arguments.

Note if you do define your functions as Scheme macros, to address the symbol/string problem, and if you modified Scheme to accept an S-expression forest on each line (i.e. no need to delimit a top-level input line with parens), you'd have the economic expression of Tcl. I think this is worth considering in some circles where Scheme may be more familiar.

Footnote

This could be a nice motify for the "language design of the day": extend the basic Scheme-like interpreter to support an extensible debugger command interface.

Friday, May 16, 2008

Call by Need Lambda a Poor Man's Macro?

I've been seriously considering why more languages don't include a call-by-need lambda (hereafter called "lazy lambda"). With its delayed evaluation it offers macro-like powers for writing non-looping forms (they're bad for loop forms since the arguments are evaluated but once by design), but they don't have the bloating effect of macro expansion (if code size is more important to you than the time overhead of a function call), and they are hygienic. They're not a cure-all, but this seems to be an approach which can still be wielded effectively by trained professionals.

How to Use Lazy Lambda

Here's how a lazy lambda would work in an existing language like Scheme. You have a new abstraction lazy-lambda with precisely the same syntax as lambda. When applied, lazy-lambda follows call by need evaluation rules. That means arguments to the lazy procedure are not evaluated at the call site, but only when their corresponding formal parameter is first referenced inside the body. On this reference, the evaluated argument value is remembered for future references inside the body. Here's how you might write something like Groovy's Elvis operator:


(define ?: (lazy-lambda (expr default) 
  (if (null? expr) default expr)))

The thing I like is that this is automatically hygienic: it works fine even if you call it in an environment with variables named expr or default.

Implementation

I like to divide my thinking about new features into two phases: introduction and use. When lazy-lambda is introduced, the parser needs to create a structure essentially identical to that of a lambda, namely an arguments list and a body expression, but of course it needs to be marked as a different type from lambda so the evaluator can distinguish the two. Lazy lambda is used in two ways, once when it is evaluated (e.g. when a lazy-lambda expression is returned from a function) and once when it is applied.

Summary of Lazy Lambda Implementation
Introduction: '(lazy-lambda args body)
Evaluation: '(closure (lazy-lambda args body) env)
Application: Bind args to thunks, evaluate thunks in a memoizing way

When lazy-lambda is evaluated, it should create a closure, the pair of the lazy-lambda expression and the environment in which it was evaluated. This closure needs to be marked as a different type from a regular closure. Alternatively the evaluator can be arranged to check the type of the "lambda" expression: a closure may look like '(closure (lambda args body) env) or '(closure (lazy-lambda args body) env).

What happens a lazy-lambda closure is applied? We know you don't evaluate the arguments, but what then? As with eager closures, you first create a new environment scope. Then, instead of binding the formal arguments to the evaluated values of the arguments, you bind the formal arguments to some box (container) of the unevaluated argument expressions. The container needs to be distinct from any other language type so that the evaluator knows how to treat it. That is, once we set up the new environment scope, we will simply evaluate the body of the lazy-lambda under this new environment, and the references to the formal arguments need to be handled in this special memoizing way. So let's introduce an expression type delayed, which is not directly creatable in the language, and we bind each formal argument x with actual argument expression A to the "value" (delayed A env). The env value will be needed when we evaluate A, because we will need to evaluate it in the calling environment, not whatever environment is in effect when the symbol is first referenced. (Think about what gets returned by (lambda (x) ((lazy-lambda (x) (begin (set! x (+1 x)) x))) x).) Then when the evaluator handles variable references and gets a delayed value back from the environment lookup, it's time to do the memoizing: evaluate A in environment env, rebind (set!) the referenced variable with its evaluated value, and return that.

Conclusion

None of the popular scripting languages I can think of (Javascript, Perl, Python, Ruby) have a macro facility, but most of them have anonymous functions which evaluate to closures in the Scheme sense. On the other hand, they also tend to have a richer set of control structures (Perl's unless), and they have other mechanisms (dare I say classes and objects?) which address most of the killer applications for macros, and hence for lazy-lambda. But for all that, I'd have to figure that those languages, and the tinier languages or DSLs, could add this feature.

Macros are subtle things to get right, and I'm sure there are deficiencies I haven't addressed here. But that shouldn't stop us from thinking about these issues, and I think there's some potential value in the call by need lambda.

Wednesday, May 14, 2008

Thoughts on an S-Expression Parser

In this post we look at a tiny Scheme parser, and generalize it to a non-language-specific S-expression parser that can be subclassed to handle any language design, as a base class for studying language varieties.

As I mentioned last time, I recently wrote a tiny Schemish interpreter, and then started extracting from it a basic interpreter platform that other people or myself could use to try out different language designs or implementations. One practical goal would be for instructors to provide the framework and have students modify it. The approach is not mine, it's from Sam Kamin's "Programming Languages: An Interpreter-Based Approach" (1990).

The original Schemish parser had these two parsing methods:


protected Expr parseExpr() throws ParseError
{
  Token token = consumeToken();
  switch(token.type)
  {
  case '"': return new StringLit(token.text);
  case '(': return parseExprList(token);
  default:
    // Test the token to see if it's a numeric
    Int intExpr = Int.fromString(token.text);
    if(intExpr == null)
      return new Atom(token.text);
    else
      return intExpr;
  }
}
  
protected Expr parseExprList(Token openParen) throws ParseError
{
  Vector acc = new Vector();
  while(peekToken().type != ')')
  {
    Expr expr = parseExpr();
    acc.add(expr);
  }
  Token closeParen = consumeToken();
  
  // Handle special forms
  ExprList retval = null;
  if(acc.size() > 0 && ((Expr)acc.firstElement()).isAtom())
  {
    Expr head = (Expr)acc.firstElement();
    if(head.isAtom("lambda"))
    {
      String headName = head.getAtom().getText();
      String lambdaUsage = "Syntax error: "+
        "expected ("+headName+" (arg ...) body)";
      if(acc.size() != 3)
        throwError(openParen, lambdaUsage);
      Expr argExpr = (Expr)acc.get(1);
      if(!argExpr.isExprList())
        throwError(openParen, lambdaUsage);
      ExprList argList = (ExprList)argExpr;
      Expr[] args = argList.getElements();
      HashSet argSet = new HashSet(); // to check for duplicates
      for(int i=0; i < args.length; ++i)
      {
        if(!args[i].isAtom())
          throwError(openParen, lambdaUsage);
        boolean wasAdded = argSet.add(args[i]);
        if(!wasAdded)
          throwError(openParen, "Syntax error: argument "+
            args[i].getAtom()+" appears more than once");
      }
      Expr bodyExpr = (Expr)acc.get(2);
      retval = Lambda(argList, bodyExpr);
    }
  }
    
  if(retval == null)
    retval = new ExprList(acc);
  
  retval.filename = m_filename;
  retval.firstLine = openParen.line;
  retval.lastLine = closeParen.line;
  return retval;
}

The lambda-handling code is big, and obscures the structure of the list parsing part. That may not be so bad if you're working on this one interpreter, you don't forget the basic list-parsing structure, because it's simple. You'll care more about all the special forms. But for my purposes, of wanting to write many interpreters from the same code base, we want to be able to talk about different interpreters, and it's a little awkward: you're always sharing code patches, and having to explain where they go. Instead, why not give a sufficient base class, and share complete subclasses?

The improvement gives a parser class which handles only S-expressions; it knows only atoms and expression lists, it doesn't know any language constructs at all: it knows no keywords. You can think of it as generating just a tree structure, like an XML parser. Parsers for a particular language design will be written as subclasses, and will override the methods constructAtom and constructExprList. These methods are reminiscent of the actions in YACC-like parser generators, the blocks of code that construct parse tree elements given data named by the grammar rule symbols and from the general lexer and parser state (e.g. line numbers in our case).

Thus, parseExpr and parseExprList reduce to tiny fragments and subclasses can flesh out the meat of special forms in constructExprList:


public Expr parseExpr() throws ParseException
{
  Token token = consumeToken();
  Expr retval = (token.type == '(')? 
    parseExprList(token) 
    : constructAtom(token);
  return retval;
}

protected Expr parseExprList(Token openParen)
  throws ParseException
{
  Vector acc = new Vector();
  while(peekToken().type != ')')
  {
    Expr element = parseExpr();
    acc.add(element);
  }
  Token closeParen = consumeToken();

  Expr retval = constructExprList(acc, m_filename, 
    openParen.line, closeParen.line);
  return retval;
}

protected Expr constructAtom(Token token)
{
  return new Atom(token.text);
}

protected Expr constructExprList(
  AbstractCollection exprs, String srcId, 
  int startLine, int endLine)
{
  ExprList retval = new ExprList(exprs);
  retval.filename = srcId;
  retval.firstLine = startLine;
  retval.lastLine = endLine;
  return retval;
}

Now the Schemish interpreter can subclass Parser and call it SchemeParser, overriding constructExprList to handle special syntactic forms, and overriding constructAtom to handle language-specific literals, such as character literals or rational number literals (2/3).

Friday, May 09, 2008

Thoughts and Code for the S-Expression Lexer

My recent project has been a tiny Schemish interpreter, and more recently have been considering a design that would work well as the Java counterpart to Kamin's chapter 1 interpreter, a calculator language using Lisp syntax, which is to be morphed into any one of a variety of different languages based on S-expression syntax.

So, this post is just for sharing some observations in working on the lexical analyzer part, as well as its code.

Just so it's abundantly clear, the problem a lexical analyzer solves, for S-expressions anyway, is to get a stream of parens, quotes, and symbols out of a character input source.

The apostrophe quote operator, as for creating list literals like '(lambda (x) x), appears to be a token not requiring whitespace. I've never seen the apostrophe not have preceding whitespace, but after testing MzScheme, ChezScheme and a few other minor Scheme implementations, it's apparent that a'b'c is the same as a 'b 'c. I've not done any serious development in Scheme, but I wonder whether this is common knowledge among Scheme programmers.

Similarly, Scheme read-eval-print loops appear to accept forests of S-expressions at the prompt, not just one expression. If you type 1 2 3 4 and the values get echoed back. This is useful for multiple defines, or for pasting code. Obviously, any REPL loop should support this behavior if at all possible.

I was happy to see that StreamTokenizer tracks line numbers. Having line numbers available "for free" helps error messages instantly. For what it's worth, I want my scanner to track column numbers too, but I understand if James Gosling (marked as the author of StreamTokenizer) didn't want to get into arguments about what a tab is worth.

The xUnit test frameworks are a much welcome tool for testing language processors. In 1998 I was fiddling with diff-based Perl scripts for automated regression testing. Writing JUnit tests is a lot more fun, productive, and exact than relying on textual comparison with blessed output.

Rather than subclass StreamTokenizer, I wanted to change the consumer's model of looping from "while the current token is not end-of-file" to a standard Iterator terminating on !hasNext(). This required making use of pushBack() on every token get, but I considered the ease of use for the client had a slight edge.

Using an iterator means you need an object to return, so there is a small Token class returned by the iterator. Token bundles the public fields of the StreamTokenizer that represent the token value. I opted not to have StreamTokenizer parse numbers. Since the intended use is ultimately interpreters for arbitrary languages merely based on S-expression syntax, I needed to let them have their own numeric literal syntax and domain of representation (double? float? BigInteger?). Now for some code. Token looks like this:


package jfkbits;
import java.io.StreamTokenizer;

public class Token
{
  public static final int SYMBOL = StreamTokenizer.TT_WORD;
  public int type;
  public String text;
  public int line;

  public Token(StreamTokenizer tzr)
  {
    this.type = tzr.ttype;
    this.text = tzr.sval;
    this.line = tzr.lineno();
  }

  public String toString()
  {
    switch(this.type)
    {
    case SYMBOL:
    case '"':
      return this.text;
    default:
      return String.valueOf((char)this.type);
    }
  }
}

Here's how the type field, an int, as defined in StreamTokenizer, works: for "ordinary characters", the type is the character literal cast to an int. The LispTokenizer marks '(', ')', and the apostrophe quote operator '\'' as ordinary. In code using Token, this reads very naturally, as in if(token.type == '(') return parseExprList();. For "words", atoms in our case, the type is a negative int defined by StreamTokenizer.TT_WORD, which Token redefines as SYMBOL. If we read a Token t with t.type==Token.SYMBOL, the good stuff (like "42", "x", "eval", or "lambda") is in t.text. String literals have the type code of the delimiter, so t.type=='"' means we've got a string literal, the contents of which (without the delimiter!) are also in t.text.

And what about string literals? Strictly speaking, the same decision I made about numeric literals should also apply to string literals. Namely, that different languages have different syntaxes and potentially different representations. Perhaps I should not configure StreamTokenizer to recognize string literals. In that case, the parser would get atoms containing the double quotes, themselves, and the parser would be expected to split it apart. Currently, I don't expect this tool to be used for studying string literals very much.

And finally, for the code itself:


package jfkbits;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.util.Iterator;

public class LispTokenizer implements Iterator
{
  // Instance variables have default access to allow unit tests access.
  StreamTokenizer m_tokenizer;
  IOException m_ioexn;

  /** Constructs a tokenizer that scans input from the given string.
   * @param src A string containing S-expressions.
   */
  public LispTokenizer(String src)
  {
    this(new StringReader(src));
  }

  /** Constructs a tokenizer that scans input from the given Reader.
   * @param r Reader for the character input source
   */
  public LispTokenizer(Reader r)
  {
    if(r == null)
      r = new StringReader("");
    BufferedReader buffrdr = new BufferedReader(r);
    m_tokenizer = new StreamTokenizer(buffrdr);
    m_tokenizer.resetSyntax(); // We don't like the default settings

    m_tokenizer.whitespaceChars(0, ' ');
    m_tokenizer.wordChars(' '+1,255);
    m_tokenizer.ordinaryChar('(');
    m_tokenizer.ordinaryChar(')');
    m_tokenizer.ordinaryChar('\'');
    m_tokenizer.commentChar(';');
    m_tokenizer.quoteChar('"');
  }

  public boolean hasNext()
  {
    if(m_ioexn != null)
      return false;
    try
    {
      m_tokenizer.nextToken();
    }
    catch(IOException e)
    {
      m_ioexn = e;
      return false;
    }
    if(m_tokenizer.ttype == StreamTokenizer.TT_EOF)
      return false;
    m_tokenizer.pushBack();
    return true;
  }

  /** Return the most recently caught IOException, if any,
   * 
   * @return
   */
  public IOException getIOException()
  {
    return m_ioexn;
  }

  public Token nextToken()
  {
    return (Token)next();
  }

  public Object next()
  {
    try
    {
      m_tokenizer.nextToken();
    }
    catch(IOException e)
    {
      m_ioexn = e;
      return null;
    }

    Token token = new Token(m_tokenizer);
    return token;
  }

  public void remove()
  {
  }
}

A token stream can be processed something like this:


LispTokenizer tzr = new LispTokenizer("(define x 42)");
for(Iterator it=tzr; it.hasNext(); ) {
   Token token = it.nextToken();
   processToken(token);
}

And finally, some unit tests.


package jfkbits;

import java.io.StreamTokenizer;
import java.util.Iterator;

import junit.framework.TestCase;

public class LispTokenizerTest extends TestCase
{
  public LispTokenizerTest(String name)
  {
    super(name);
  }

  public void testLispTokenizerIterator()
  {
    LispTokenizer tzr;
    
    tzr = new LispTokenizer("");
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer(" ");
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("\n");
    assertFalse(tzr.hasNext());
    
    tzr = new LispTokenizer("7");
    assertTrue(tzr.hasNext());
    checkToken(1, "7", Token.SYMBOL, tzr.next());
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("()");
    assertTrue(tzr.hasNext());
    checkToken(1, null, '(', tzr.next());
    checkToken(1, null, ')', tzr.next());
    assertFalse(tzr.hasNext());

    tzr = new LispTokenizer("(newline)");
    assertTrue(tzr.hasNext());
    checkToken(1, null, '(', tzr.next());
    checkToken(1, "newline", Token.SYMBOL, tzr.next());
    checkToken(1, null, ')', tzr.next());
    assertFalse(tzr.hasNext());
  }

  private void checkToken(int line, String text, int type, Object tokenObj)
  {
    assertNotNull(tokenObj);
    assertTrue(tokenObj instanceof Token);
    Token token = (Token)tokenObj;
    assertEquals(line, token.line);
    if(text != null && token.type == StreamTokenizer.TT_WORD)
      assertEquals(text, token.text);
    assertEquals(type, token.type);
  }

  public void testCharacterMapping()
  {
    assertEquals((int)'(', mkTokenizer("(").nextToken().type);
    assertEquals((int)')', mkTokenizer(")").nextToken().type);
    assertEquals((int)'\'', mkTokenizer("'").nextToken().type);

    assertEquals(StreamTokenizer.TT_WORD, mkTokenizer("0").nextToken().type);
  }

  public void testSimpleLispExpressions()
  {
    test("",new String[]{});
    test("()",new String[]{"(",")"});
    test(" ()",new String[]{"(",")"});
    test("\n()",new String[]{"(",")"});
    test("() ",new String[]{"(",")"});
    test("()\n",new String[]{"(",")"});
  }

  public void testLispExpressionsWithComments()
  {
    test(";Comment here\n()", new String[]{"(",")"});
  }

  public void testLispExpressionsWithStrings()
  {
    test("\"\"",new String[]{""});
    test("\"a\"",new String[]{"a"});
    test("\" a\"",new String[]{" a"});
    test("\"a \"",new String[]{"a "});
    test("(print \"Hello world.\\n\");",
      new String[]{"(","print","Hello world.\n",")"});
  }

  public void testFactorial()
  {
    String src = 
      ";;\n"+
      ";; Compute the factorial of a given number\n"+
      "\n"+
      "(defun fact (n)\n"+ 
      "  (if (< n 2)\n"+ 
      "      1\n"+
      "    (* n (fact (- n 1)))\n"+
      "    )\n"+
      "  )\n"
      ;
    String[] expected = {
      "(","defun","fact","(","n",")",
      "(","if","(","<","n","2",")",
      "1",
      "(","*","n","(","fact","(","-","n","1",")",")",")",
      ")",
      ")"
    };
    test(src,expected);
  }

  static void test(String src, String[] expectedTokens)
  {
    LispTokenizer tzr = mkTokenizer(src);
    int i = 0;
    for(Iterator it=tzr; it.hasNext(); i++)
    {
      Token token = (Token)it.next();
      assertNotNull(token);
      assertTrue("Expected "+expectedTokens.length+" tokens, got more", 
        i < expectedTokens.length);
      assertEquals(expectedTokens[i], token.toString());
    }
  }

  static LispTokenizer mkTokenizer(String src)
  {
    return new LispTokenizer(src);
  }
}

Wednesday, May 07, 2008

Motivations for Little Interpreters

We're continuing a series about writing little interpreters. Although it is fun and cool, so are isolated guitar licks for about the first five minutes; long-term you probably will want to play whole songs if you want to have listeners. Similarly, you need an application, killer or not, for an interpreter or it will lie around collecting bit dust and require "maintenance" (right, Dijkstra?).

I suppose you could say I'm talking about an interpreter for a Domain Specific Language here, but really all I mean is that if you care enough to write an interpreter, you may as well take the opportunity to build into it something you can use, and for this post I'll address the area of a useful domain.

Specifically, if you care about an interpreter it means it will support a domain you care about: photos, 3D graphics, matrices.

In the implementation language of the interpreter, you will want to define
1. Data structures
2. Ways to construct and operate on them

and you'll make provisions for these in your language. Thinking these through carefully, and probably writing them into a runtime library complete with unit tests, is a respectable way to approach things before sticking an interpreter front end onto the data and code of your domain of interest.

Let me give you one simple example of an interpreter I wrote for personal productivity. In 1995 I wrote "Joel's Calculator" to help me write status reports for my manager, and also to do some basic statistical analysis. (No, I didn't have Excel; I was working with an HP-UX workstation.) Our embedded systems group had gotten a rare opportunity for a complete product rewrite, and we needed to track our time and estimates carefully. We had to track time spent on particular modules and count overhead, and in our weekly status report submit an account of time on our various modules, with overhead (anything not in a particular module) evenly divided among our projects. I would keep a log of time spent that looked something like this:

Tue May 7
0840-0920 Email
0920-1135 Serial driver
1250-1300 Ovr
1300-1400 Phase 2 mtg
1400-1600 Serial driver testing
1600-1715 Code review

It was an easy way to keep track, in a text editor. Then at the end of the week, I would sum the "billable hours" for modules like Serial driver. What I ended up wanting was a way to do "time arithmetic":


serial = 11:35-9:20 + 16:00-14:00
ovr = 9:20-8:40+0:10+14:00-12:50+17:15-16:00

This worked very well for me as a workflow. I could write simple expressions like this, assign them to variables, combine the results gradually, and so on. This was really a desk calculator, but I was pleased with the effect that I could enter times textually in a very natural way, use simple infix arithmetic notation, and have it just work.

In this case my chief domain of interest was pretty simple, a pair of hour and minutes, and the operations on the domain were also fairly simple, but a little challenging to get right.

Of course, you may be coming at this from the perspective that the goal and the domain are already well defined. For example, you're working with a library of interest and you'd like to play with it interactively; you can envision making Scheme bindings and writing abstractions to make it do something useful. Or you even have a very powerful system written in an existing but limited language, but you need to bust out of the limits. Using a lightweight Lisp/Scheme interpreter that you can modify to script some macros to generate code in another language may solve some scaling problems for you.

So, if you want to start a little interpreter project, which I'd like to encourage, before you get started, pick a goal, pick a domain, and refine the domain as the first step.

Tuesday, May 06, 2008

Google AppEngine Activation Arrives

I signed up for Google AppEngine, well-covered on programming.reddit.com, about a month ago, but got the "don't email us, we'll email you" message at that time. Today I got my email letting me know I can start farming on the Google spread.

Friday, May 02, 2008

Little Lisps: Programming Candy or Spinach?

Last time (StreamTokenizer and Declarative Lexing), I mentioned an idea of presenting language designs in this space as a puzzle to be solved, like a crossword puzzle. I invented the idea there and its been growing on me.

The previous week or so I'd been reading Sam Kamin's "Programming Languages: An Interpreter-Based Approach", so this didn't seem like such a crazy idea as it may sound. In "Programming Languages", Kamin starts with an interpreter for an extremely simple language using Lisp syntax, and procedes with each chapter to show what modifications need to be made to get interpreters for Lisp, Scheme, SASL, APL, Smalltalk, Prolog and others. They all use Lisp syntax, so the code changes are kept quite manageable, and the reader can focus on the essential differences in scoping, evaluation strategies, and the like.

It seems, as a first step, if the Language Design Crossword Puzzle were to be a reality, that making available a standard interpreter source such as Kamin's is a reasonable idea. "Given interpreter0, add macros" would be a puzzle. "Given interpreter0, add object serialization." This is similar in spirit to comments I made earlier about the utility of lambda calculus for studying language features.

But is the Language Design Crossword Puzzle a good idea? What's the point? Arcane Sentiment, a blog I discovered and subscribed to today, introspects on writing little language implementations. He describes certain parts of the exercise as "programming candy", and ironically, they're often the parts written in C, a series of little programming victories. The hard, ill-defined problems to be written in Lisp are the parts that tend to slow him down and demotivate him. (Arcane, I hope I'm fairly characterizing that post. Please correct me if not.)

I had to chuckle in self-recognition, as earlier this week I was watching the first few SICP videos, evaluating the examples in a Scheme-ish interpreter I'd whipped up on Monday and Tuesday, extending it during pauses in dialog while Sussman wrote on the blackboard. Until he hit the pi fraction example and I realized I wasn't at all sure if I wanted to right at that moment be writing code to rationalize denominators and factor fractions or whatever else Scheme might do to support exact rational number (fraction) arithmetic (e.g. (/ 1/2 (+ 1/8 1/3))). That problem at that moment was not interesting for me, and was not well-specified; are rationals always reduced to lowest terms? How is conversion to and from reals handled? I'd have to go study R5RS to learn the expected behavior. Handling the number tower was not my goal going into this project.

Why does Arcane Sentiment dabble in Lispy implementations? Why do I? Why reinvent the wheel? For me, it's a way to learn, to study. Toy implementations are rewarding, as they let you discard parts of a language implementation system that are indeed hard, and focus on particular points of interest. You need to be careful not to oversimplify, if you intend to take your lessons back to a real system. But this approach is something we advise junior programmers all the time: if you're struggling with how a library or language feature is working in the application, try writing a small example program first until you understand how it works.

So, I propose we have the best of both world. Language design problems can be programming candy, as well as programming spinach, that is something good for you. My wife has been making a spinach recipe from her Spain and Portugal cookbook which features raisins and pine nuts. It rocks.

The other question is, is there interest in a "language design puzzle" feature? Before we get to that, let me ask a more relevant question: what aspects of programming language implementation or operation are of interest to you? Macros? Evaluation strategy? Optimizations? Drop me a line to the Google mail account jfkbits and let me know.

Blog challenge: write a post using the phrase "free as in spinach".

JFKBits