Wednesday, February 21, 2007

XML-Driven Language Tools

In Java in XML, JFKBits reader Mark kindly pointed us to several tools of the sort I imagined in XML-Based Languages. These tools parse the "front end" syntax and produce an AST or other form of the source input in a form that is more accessible to other programs. The material on these tools put the issues and applications more succinctly than I did, so let's sample some quotes.

Why would you want to


From an FTPOnline article about JSIS, the Java Semantic Interface Specification:

Sample applications include browsing and navigation tools, code formatting and restructuring tools, source code generation tools, code coverage and test generation tools, code analysis and metric reporting tools, style and standard-compliance reporting, UML diagram and round-trip engineering tools, and interactive source code editing, to name just a few.


What's the issue


From the introduction to JavaML:

The classical plain-text representation of source code is convenient for programmers but requires parsing to uncover the deep structure of the program. While sophisticated software tools parse source code to gain access to the program's structure, many lightweight programming aids such as grep rely instead on only the lexical structure of source code.

The reference to grep is telling: early in my career I worked for two weeks on an Emacs-Lisp program that would convert the structured pseudocode we were writing to C source code. It was a successful program, since it gave everyone a head start on their C source files, but it definitely relied on grep-style (regular expression) parsing and thus had certain inherent limitations.

From the introduction to GCC-XML:

Development tools that work with programming languages benefit from their ability to understand the code with which they work at a level comparable to a compiler. C++ has become a popular and powerful language, but parsing it is a very challenging problem. This has discouraged the development of tools meant to work directly with the language.

There is one open-source C++ parser, the C++ front-end to GCC, which is currently able to deal with the language in its entirety. The purpose of the GCC-XML extension is to generate an XML description of a C++ program from GCC's internal representation. Since XML is easy to parse, other development tools will be able to work with C++ programs without the burden of a complicated C++ parser.

This is particular nice, considering the difficulty in constructing a C++ parser.

As I said in responding to Mark's comment, it would be nice if these parser modules, if you will, would become standard practice in new language development. Tool support helps acceptance of a new language, and it seems like providing programmatic access to the AST would lower the barrier to entry for tools development.

5 comments:

Susan's Husband said...

I think Wave is a better approach, at least for C++. XML is just a mechanism. The goal is to be able to operate on the semantic information in the source code.

jfklein said...

I agree, there's nothing super special about XML. You get existing tools to handle marshalling and unmarshalling (parsing and serializing), to format it, and all that, but you're right, that's not the main goal.

Wave is a "C++ lexer with ISO/ANSI Standards conformant preprocessing capabilities". I have to think about that a little bit, it's different than what I'd been imagining, namely a parser emitting an AST in the structured persistent format (XML, LISP, etc.). I'm trying to imagine what utility just the lexer would be.

I have heard of using just the lexer portion to do tricks with syntax-aware editors. If you're editing a large file, thousands of lines, you run the lexer pass on the whole file after some typing and see if the file's token sequence has changed, and if so where. This is faster than re-parsing the whole thing every time the user pauses in typing.

Susan's Husband said...

There are a lot of tools out there that want to do meta-programming with a C++ base (the QT graphics library comes to mind, but there are a lot of other tools that do that (e.g., the ITE, if you remember that). What Wave does is support that kind of activity by letting client code iterate over the syntactic elements like you would an XML document. E.g., client code can iterate until it gets to class definition, then iterate over the members and their types and emit ancillary code on the side. Wave (AFAIK) is designed with that kind of application in mind, although it might also be useful for development environments.

jfklein said...

I missed the part of the Wave description where it gave you access to the syntactic structure -- all I saw was it talking about lexing, i.e. emitting token streams.

I suppose in the example you gave, of a tool that want to discover the members of a class, you could write a recursive descent parser that simply skips lots of its input. When you hit a method definition, you skip it. Still, you need to "count parentheses", don't you?

A class like this:
class Complex
{
  double re;
  double im;
  double abs()
  {
    return sqrt(re*re+im*im);
  }
  double angle()
  {
    return atan(im/re);
  }
  // what if members are declared here?
};

would generate a token stream like this:

CLASS ID:Complex LBRACE DOUBLE ID:re SEMI DOUBLE ID:im SEMI DOUBLE ID:abs LPAREN RPAREN LBRACE RETURN ID:sqrt LPAREN ID:re TIMES ID:re PLUS ID:im TIMES ID:im RPAREN SEMI RBRACE DOUBLE ID:angle LPAREN RPAREN LBRACE RETURN ID:atan LPAREN ID:im DIVIDE ID:re RPAREN SEMI RBRACE RBRACE SEMI

The token stream lacks the syntactic structure, so even if you were only interested in gather members and wanted to skip the method declarations, you'd have to count left and right braces and make sure you didn't something out. I mean, leave something out.

Susan's Husband said...

Hmmm. I haven't read much of the Wave documentation, just a lot of discussions about it and anticipated applications, so it's possible that I have mis-read those.