Wednesday, April 30, 2008

StreamTokenizer and Declarative Lexing

One thing I've noticed about marketing is how frequently an appeal is made to change your lifestyle to include more of whatever is being sold. I first noticed this when I read in a catalog that Ikea wants me out of that rat race of the outside world, so I can spend more time at home. And by the way, don't I want my home to be a comfortable inviting place to spend time? After that, I've seen it everywhere. Here at JFKBits we want to know why the world doesn't write more language processors. Everyone can enjoy a world of recreational symbol translation. I was on vacation last week, reading a programming language book by the beach. Maybe someday we'll run programming language designs in this space and JFKBits readers can implement them as a kind of crossword puzzle of the day, in an hour on the train home from work. So today, we're thinking about how to get a lexer up and going more quickly.

When you're designing a syntax, it's typically done at a high-level:
  • Comment syntax: line-oriented or delimited, or both? do comments nest?

  • Identifier syntax: Limited choice for the first character (e.g. letters and '_'), followed by mix of alphanumeric and underscore. In Lisp, symbols syntax is much more flexible.

  • Operators and grouping syntax: parentheses, brackets, multi-character operators like ++, ->, and &&=.

It might be useful, in terms of quickly prototyping a domain-specific language or language-aware utility (e.g. a tool that finds buggy code patterns), to have a lexer generator that lets you declare properties in these terms rather than translating to regular expressions.

Java's StreamTokenizer, while not a universal tool, is a step in that direction.

What I like about StreamTokenizer is that it raises the level of programming closer to the problem domain. Regular expressions are the traditional and well-understood means of specifying a lexer's accepted language, but this declarative style is possibly a better programmer's tool. (Note well that I'm talking about StreamTokenizer, not StringTokenizer which is a much simpler state machine.)

For example, in playing with Kamin's interpreters with their S-expression-based syntax, I've been using this configuration to scan LISP-type code in Java:

tokenizer = new StreamTokenizer(new BufferedReader(reader));
tokenizer.resetSyntax(); // We don't like the default settings
tokenizer.whitespaceChars(0, ' ');
tokenizer.wordChars(' '+1,255);
The basic idea is that StreamTokenizer is a state machine with hard-coded states and transition arcs, and you configure it with what characters are in the relevant character classes. For example, from the state of skipping whitespace, there's a transition to a string literal state, when a character from the "quoteChar" class is encountered. The string literal state accumulates, transitioning back to itself on characters not in the quoteChar class. It's simply up to you to configure which character or characters constitute a quote delimiter. The essential observation is that lexers for many useful languages share state machines of identical shapes, and they differ only in the definitions of character classes.

Of course, not every language fits the preprogrammed state machine shape. StreamTokenizer can't even handle Java, because it has no capacity for multi-character operators. It can be configured so that '+' is an "ordinary character", meaning it is lexed as a single-character token, but there's no way for a parser to know if two '+' tokens in sequence came from the input "++" or "+ +". This is what I mean when I say StreamTokenizer is not a universal lexing tool.

But I still wonder if there's room for this declarative manner of input, building on higher-level concepts like identifiers and operators, for a more general lexing tool. This could be combined with a programmatic knowledge base of some of the standard lexical idioms running around. You could say "give me Python-style identifiers, with the standard Java operators except these four which don't apply to my language." I'm not at all sure there is need for such generality, but I think it's worth writing down, as an idea for future inspiration.

Thursday, April 17, 2008

Small Tokenizing World

I wanted to see what Java programs out there parse Lisp syntax using the handy StreamTokenizer, and the first one I found is Mike Bayne's, circa 1997.

Mike was one of the bright guys who did version 2 of the Prizm IDE that I helped work on for my senior project at Rose-Hulman. Now he runs his own online game company,, out of San Francisco.

Tuesday, April 15, 2008

AP Computer Science Language

I've been reading Sam Kamin's "Programming Languages: An Interpreter Based Approach" from 1990, which compares programming languages through the device of writing an interpreter in Pascal for a form of the language using Lisp-based syntax.

On reading all that Pascal source code, it takes me back. I reflect that I learned Pascal as a senior in high school, to prepare for the AP Computer Science test. That was in 1990. I see that the AP tests now use Java as their language. I wonder, was there any other language between Pascal in 1990 and Java in 2008?

If I had to bet on what language would succeed Java as that used for the AP exam, I'd bet on Python.

Friday, April 11, 2008

Variables as Objects in Ila

It's time for another in the series about Ila. This time we'll look at the var type and the Var[t] family of classes.

What is Ila? In January 1998 Uday Reddy presented Objects and Classes in Algol-Like Languages (refined in 2000). In Spring of 1998 I found myself Reddy's M.S. student, designing and implementing a call-by-value variant of the IA+ language presented there. I've taken to calling the implemented language Ila in this blog.

Second Class Objects

Let's look at another case of "everything is an object -- except when it's not." Pop quiz: how many objects do you see:

a = 1
If we take this to be from a language where "everything is an object", then there is at least one object: the integer 1. But what about a? Some would say it's just a name or reference to the object 1. But think about it this way. After this assignment, somewhere between the name a and the object 1 is a container. A container you can't "get your hands on": a container which can't be created on the fly anonymously or returned from functions; in short, a second-class citizen in the language, and furthermore not an object.

Notice I didn't say this language (which could be any one of "most" languages today) doesn't let you create containers. In fact, the simplest of all useful classes could be a simple container:

class Box
Object contents;
Object getContents() { return this.contents; }
Object setContents(Object src) { return this.contents=src; }

In order to be at all interesting, objects must have instance variables, some form of local state. Why not make variables be primitive classes?

References and Boxes

Before we look at what this even means and how this technique plays out in Ila, I did want to make a related work note. Explicit storage cell type is certainly out there, the ref types is part of what's biting Robot Monkey's shiny metal posterior, and there are others like box in Scheme. The difference with this approach is that var is an object. (I know, tortured sighs from certain parties. But for some people this represents the leveraging power of orthogonality. Ok. Groans from other corners. All right, let's move on.)

Var[t] - Variables as Objects

First, the formal introduction. In Reddy's IA+ paper, there was a var t polymorphic type and a family of classes Var[t] implementing it. The object signature for var t is {get : unit -> t, set : proc t}. So, given a variable object v we retrieve with v.get() and assign to it with v.set(k). That's the object syntax. As a built-in type it also can have special syntactic treatment to provide the more usual forms.

How do you declare a variable? Well, it's an object so you should be asking how you instantiate a new variable. More specifically, how you get a new Var[t] object. The Var[t] notation is generics syntax, you'd write Var[int] or Var[NuclearSubmarine] to talk about the classes for variables of specific types.) In Ila, the syntax for a new object has the form new id isa class-expr where id is the object name, in scope to the end of the basic block, and class-expr is a class-typed expression. More about class-typed expressions another time, here we'll just write the names of classes.

Putting it all together, here's a snippet of code using variables:

new usrname isa Var[str];
readstr("Hello, what's your name?",usrname);
println("Pleased to meet you, "+usrname+".");
println "Would you like to play a game?";
new score isa Var[int];
score := 0;
Aside from the new username isa Var[str] syntax, which is a little obnoxious and could be reduced to something like var username, the variable usage is entirely ordinary. Here we see

1. Implicit .get() call in the string expression "Pleased to meet you, "+usrname+".".

2. Implicit .set call in score := 0.

3. Passing the whole container so it can get updated in readstr("Hello, what's your name?",username).

To see how these work, we need to look at the types. In Ila, the type of an object is the type of its class signature, so usrname has the type {get : unit -> str, set : proc str}.

Let's look at these again in slow motion.

1. In username+".", we have something added to a string literal. In Ila, + is overloaded, so we only have a few choices, and from the string literal we resolve this to be the string + operator. So, the expression username is expected to be a string. But, it is not a string, it's a variable containing a string. What makes this work is a subtype relation in the type system and a coercion at runtime. The type var[t] is a subtype of t, and at runtime variables of t are coerced to be values of type t by calling the get method.

In other words, username+"." is implicitly turned into username.get()+".".

2. Assignments like score := 0 is simply syntactic sugar for score.set(0). The only other thing worth noting here is that there's nothing very vague or shadowy about what is an lvalue. The only things that can go on the left of the assignment are things with type var[t], or more correctly things for which var[t] is a subtype. If you wrote some arbitrary class and gave it a set method, you could put it on the left of an assignment operator.

3. Passing values of type var[t] to functions and procedures is similar to the first case, but you will want to know the type of the function or procedure's argument. If the type it expects is var[t], look out (as in the "bang" of Scheme's set!): it means the variable is being passed as a container object and its contents could be modified. In the case of the built-in procedure readstr this is expected.

Conclusion: Translation to Java

The Ila implementation needed to represent the Var[t] class in Java in a way consistent with the rest of the language. It was not possible to directly use Java local variables. As an optimization, yes, when their use matched the semantics of Java local variables. But as the general translation, I simply had to make a iap.lang.Var container class basically like the Box class mentioned earlier. This is probably where the idea breaks down, and I'm sure some readers have thought of this point already. Who wants to pay a method dispatch cost for every variable access?

If you tend to have a lot of getters and setters in your classes, or if you're attaching listener management methods for lots of an object's state, consider that a variable object could be one place to hang all that functionality, as methods of the variable itself instead of methods of its containing class. Try obj.field.addListener and obj.field.set on for size instead of obj.addFieldListener and obj.setField. To the degree that "everything is an object", that universal orthogonality, is useful, to the degree that it is useful for metaprogramming, this technique deserves some consideration.

Wednesday, April 02, 2008

Next Time Think of Some Tests

If you volunteer to help your sister-in-law with wedding preparations on Thursday, and you don't get to it until Friday, and you decide to write a program to do it, don't forget to think about some basic sanity checks on the final output. Otherwise on Sunday you'll hear from your father-in-law wondering why some people didn't have placecards and didn't know where to sit.

So yeah, wedding preparations in the sense of "can you research how to go from an Excel spreadsheet of names and table assignments to a Word document with the names in the right places to print on these pre-scored placecard sheets from Hobby Lobby, SKU #717694."

The more you think about Excel, you think "Not too bad, Mathematica imports from Excel," and the more you think about Word you think "Ugh". And the more you think about Mathematica, the more you remember the vector graphics language, and that you can trivially export to PDF or rasterized image files.

tables = Import["c:/tables.xls"];
sheet = First[tables];
(* Extract name from row by pattern match *)
nameOfRow[{title_, names_, ___}] :=
trim[title] <> " " <> trim[names];
(* extract table assignment text from row *)
tableOfRow[{_, _, _, tbl_, ___}] := trim[tbl];
(* filter out blank rows *)
rowsWithNames[sheet_] := Select[sheet,
Function[row, trim[row] =!= ""]];
Some Opening Lines of Code, Which May Become Important Later

Not very long after, it seems, the Excel spreadsheet is imported, blank rows are skipped, the names and table numbers are constructed and placed onto the vectorized layout of the placecard sheet, with or without preview construction lines. Very cool.

placecardPageTemplate[] :=
With[{w = placecardPageWidth, h = placecardPageHeight,
pgborder = placecardPageBorder},
(* Solid page border *)
Dashing[{}], Line[{{0, 0}, {w, 0}, {w, h}, {0, h}, {0, 0}}],

(* Scored page borders *) Dashed,
Line[{{pgborder, 0}, {pgborder, h}}],
Line[{{w - pgborder, 0}, {w - pgborder, h}}],
Line[{{0, pgborder}, {w, pgborder}}],
Line[{{0, h - pgborder}, {w, h - pgborder}}],

(* Scored between-card borders *)
(* vertical *) Line[{{w/2, 0}, {w/2, h}}]}
Table[With[{cardHt = pgborder + i*placecardHeight},
Line[{{pgborder, cardHt}, {w - pgborder, cardHt}}]], {i, 1, 2}]
Flatten[Table[placecardTemplate[r, c],
{c, 1, numPlaceCardColumns},
{r, 1, numPlaceCardRows}], 2]];
Sample of the Vector Graphics Code

More time is spent rasterizing, with long interruptions for meals, toddler nap and bedtime, and the necessary socializing that comes from being in the bride's family's house the day before the wedding (e.g. rehearsal and dinner).

By late evening the table assignments are finalized and the requested high-resolution JPGs are being rasterized from the vector graphic objects. Sometime the next morning, the placecards are printed. The ceremony comes off with only one hitch, and by evening the reception is ready to swing.

Except... for that one little hack. When you wrote that simple filter to remove blank rows, you did it by checking if the first column was blank. And as it turns out, the first column is for titles and some names, about 6, didn't have a title.

Fault? Everyone agreed it was the you the programmer's. Some issue can be made about the the given spreadsheet format, with a separate column for titles. But the bottom line is you had all the data, and simply spent too much time coding and not enough time planning. There was time to do it right, and no time to do it over. There should have been some time devoted up front to thinking about how to check for correctness of the final output.

Oh well. The next time someone asks to help print placecards you're all set...

If you are interested in the Mathematica notebook that can generate custom placecards with your database of names, send to the account jfkbits. And please note I've fixed the aforementioned bug.

Tuesday, April 01, 2008

April Curmudgeon

Disclaimer: The April[0] Post you are about to read is not meant to be amusing. Please adjust your expectations now. Lighthearted, yes, but it will not make you laugh aloud. Really, it will not be amusing unless you somehow make it so by assigning a meaning it was never intended to have. By reading this post hopefully you will gain insights that allow you to be a funnier blogger. When Frank Bunker Gilbreth's children, whom he was teaching to touch type, asked if he could touch type, he only replied "They tell me Caruso's voice teacher can't sing a by-jingoed note." May it be so with this post.

It's challenging to write truly good April Fool's material about programming issues, but almost programmatically easy to write lame material. One reason things are funny is that (1) they violate expectations, and that (2) they do so in a way that doesn't personally impact you, as in Mel Brook's observation "Tragedy is when I cut my finger. Comedy is when you walk into an open sewer and die".

We're immersed in dark humor by the nature of our process-based profession. Programmers with customers (the most interesting kind of programmers) keep huge databases of failed expectations. These only get funny with distance: either in time or in organizational chart hops. Your coworkers' bugs are more likely to be funnier than your own, but only until your manager decides to reassign to you. Another team's bugs are funnier. Unless you rely on that feature. And so on.

The other kind of expectation, that program P (browser, language, Turing-capable text editor) expects input format X and generates output format Y, are also targets for humor: "Bwahaha, a web browser that accepts COBOL as an HTML scripting language" if it's well-known that this combination doesn't exist and doesn't seem to make sense. Humor playing on this type of expectation is fragile, because the expectations are broken on purpose all the time by the current of software development progress. It's only a matter of time when "Ruby VM in Javascript" moves from being funny to being accepted way of life. Maybe COBOL is funny today, but legacy systems have a way of creating needs that can be solved by software. If you want a joke candidate generator, write a program that generates T diagrams with all possible combinations of known programming languages at the different points. The funny ones will be funny because they violate the intended design of the language: Logo as an implementation language for a Fortress to Haskell compiler. (It's probably a good sign for a language if it's never funny at any point on the T.)

Take a look at the April Fools gags out there related to programming and see how many are analogous to "that idiot tried to plug a USB cable into a Firewire socket". The DailyWTF material is kind of a catalog of such expectations:
  • You should write code without explicitly specifying every input case

  • Code should have a purpose, unlike this if check:
    if( nextIndex != currIndex )
    nextIndex = currIndex;

  • You should use a standard library function rather than writing your own buggy version

  • You shouldn't transform data through several unnecessary layers (the digital camera/wooden table technique)

  • Code should, ideally, do something productive, especially if you're paid to create it
The real humor, the best humor in programming topics and others, is the kind that also incorporates the self-recognition element, which I think is simple imitation or mimicry in its rawest form. This is the Jerry Seinfeld style, where you observe and describe a situation or behavior. There's something about hearing a description of a behavior, especially one we wouldn't readily admit to unprompted, that makes us laugh when we recognize we're guilty of it too.

Note: COBOL is one of those all-purpose humor words, kind of the "weasel" of programming humor. Source: Ask Mr. Language Person.

Lame subtlety spoiler: If you skipped the disclaimer, and thought this was supposed to be somehow funny just because it's April[0], check the title. Also, it's not realistic to expect to get a funny blog post in polynomial time from the T-diagram generator. Not that I tried.