Wednesday, February 07, 2007

XML-based Languages

Example Ant script
There's a thread on the Ant user list today which kicked off when someone expressed displeasure with the XML-based nature of Ant scripts:


Hans wrote:
What do you think about the XML format used for writing Ant scripts? I don't like it.

What about writing Ant scripts in a script language like Python or Jython instead of writing them in XML? I think it would be much more productive.

This is the topic I'd like to address: the choices of syntaxes for a language, and especially the choice of an unambiguous, easily-parsed, structured syntax like XML or Lisp, which I'll call the "model" syntax, versus a more conventional "pretty" syntax, or "front end" syntax. What is this choice? Let me explain.

When I first heard about XML, I thought it might serve a role in language implementations as a standard abstract syntax tree (AST) representation. There would be a separable parser component for the language, and the XML representation of the parsed form of the language would be standardized and published. This could make it easier to make tools like syntax-highlighting editors, because they could access the parser component. Automatic code generators could target the model syntax instead of the front end syntax because the model syntax has this unambiguous property. Why that you ask? Have you ever written a program that generates C source code using the minimum number of parentheses and braces? It's not easy. The easiest way to write this kind of auto-generated code is to wrap parens, braces, and other structural delimiters around every generated expression. Guess what? That's one of the defining characteristics of this "unambiguous structured syntax" I'm talking about.

This idea has another wrinkle: if there's a standard AST format in XML, you can write code in the language directly in the XML format, and bypass the pretty syntax altogether.

This Ant mailing list thread points out that Ant is an existing language with the model syntax, in search of a front end syntax.

You may wonder if it's kind of weird to have two syntaxes for the same language. Maybe it is. Maybe they could be considered separate languages in that case. But there is precedent for it. Mathematica has an ordinary programming syntax, and also something called FullForm. In the front end syntax, as I'm calling it, you can write an expression a^2 + b (c d + e), which has operator precedence and implicit multiplication going on. Then you can also view or write the FullForm representation of the same expression, Plus[Power[a,2],Times[b,Plus[Times[c,d],e]]]. This ability to call upon the FullForm view can give the programmer direct insights into the rules of the parser, in a way reminiscent of how WordPerfect's reveal codes function could sometimes explain mysteries of a document's formatting.

As we've mentioned, Ant has managed to grow and prosper with nothing but its cumbersome model syntax. Of course, one reason is that Ant is designed to be a build scripting environment, not a general purpose programming language. Ant scripts for 1000-file Java projects can be just a few hundred lines or less. The success of Ant though has given these scripts reason to grow, as more people make their build processes more elaborate.

So we've seen some of what the model syntax can give you. What does the front end syntax give you? It only takes a moment's thought to realize you don't want to write C code in XML, or worse yet, C++. (Any of my readers care to submit "ported" code examples? I'm sure we could have fun porting some Boost template stuff to an XML-ized C++ syntax.) In general, and without a shred of evidence, I'd venture to say the purpose of the conventional front end syntax is to provide an efficient means of expression that lines up with a considerable body of linguistic precedent including previous programming languages and notation from mathematics and other scientific studies. The problem is that this notation is usually awfully difficult to parse, and sometimes its more important to get some kind of computation going.

There are some things to learn from XML in particular when designing a front end syntax. In the Ant thread, Matt Benson offers that he is intending to design a "custom language for Ant". To this, Steve offers these thoughts about such a language design:

One problem I have with any DSL language (and that includes smartfrog) is remembering all the rules about whether to use colon or equals for assigment, whether you should terminate lines with semicolons or not, whether lists are allowed a trailing , at the end ["like","this",], what the unicode story is (or whether it just uses the local encoding), how to escape stuff, etc. All the little details that XML hands for you.

For all its ugliness (and if you think XML is bad, look at RDF-in-XML), it at least gets some things right
-tool neutral (though the way ant abuses XMLNS blurs this)
-good internationalization
-not as terse as perl

In mentioning things like ":=" vs. "=", I think Steve's comments point out the Tower of Babel effect in front end language syntax, where disagreement, a lack of convention, leads to confusion. (Speaking of Babel, notice that one of the areas lacking convention is internationalization.) Of course part of the fun of defining a pretty language syntax is expressing the world in your own favorite terms. It's making innovations like "let's use indentation for structural informational" and "let's make semicolons optional".

I think there's clearly a place for the front end syntax, the efficient way of expression, even if breaks with convention some. If people aren't having fun writing the syntax, they're not going to use it. But there's also a place for the structure and conventions of XML, or other structured model syntax.

This brings us back to the idea in the post's title, XML-based languages. It may be that in the future you follow Ant's path. You need to design some kind of language processor, and your first step is to define an XML format for it. You instantly can begin writing programs in the XML format, and can begin work on the interpreting or translating side. Maybe you'll write a compiler as an XSL file. Then once you get the breathing room, you can design your front end language and write the parser.

Or, perhaps, you'll just decide to hijack that ingenious syntax of parentheses of LISP/Scheme, the languages whose front end syntax is also a perfectly legitimate unambiguous tree-structured model syntax.

1 comment:

Susan's Husband said...

A good analysis. I still despise the decision in XML to require the end tag to be repeated. Get rid of that and you'd basically have LISP syntax. It would have extremely little effect on parsing and generation (if anything, they would be simpler) and greatly increase the ease of human use. Heck, you could probably solve the ANT issue by having a front end language that stuck the closing tags in for you.