JFKBits: March 2007

Wednesday, March 21, 2007

Introducing Ila

Today I'd like to tell you a little bit about a programming language you've never heard of. I know you've never heard of it, because I'm not going to use its real name. (That was comment bait: name the logical error I just made.) It wouldn't really help if I did use its real name, because it exists only in academic papers written by my Master's degree advisor and in the compiler work I did for that degree.

This language is the chief reason this blog exists at all: I want to share the things I learned from the work on Ila. From the design of the language, I learned about object-oriented programming: what it is, what it isn't, and how simple it actually is. The language design, which is mostly my advisor's, exposes the nuts and bolts of individual OO features. And even though I had worked on several interpreter and compiler projects before, with the Ila compiler I learned a lot about language implementation, such as how to do type inference in the presence of recursive types.

I don't claim to be an omniscient programming language expert. But the Ila project exposed me to the inner workings of some stuff you usually don't hear too much about, and I think I have something to contribute to both the programming community about things object-oriented and the language and tools community. I had hoped to write some things up for PLDI 2002, but life went on and I never did. Now this blog gives me an outlet to articulate some of what I learned.

My hope is that you can learn from what I did by reading about it, without actually having to do it, though perhaps my words to come on this topic will inspire you to do something with the ideas.

Let's get to a very brief high level introduction then.

The Ila Programming Language

Ila (pronounced "EE-la") is what I'm now calling the language accepted by the compiler I wrote for my master's thesis work between 1998 and 2001. Here's a quick run-down of what it is:

Ila is a statically-typed object-oriented language conducive to functional programming with SML-like syntax.
Hello, world looks like
```
println "Hello, World."
```
Ila is influenced by and conceptually related to the Pizza programming language and the Moby programming language.
Ila is implemented as a translation to Java source code.
Ila is type-crazy (static typing, no typecasts, elaborate subtyping rules) but has the type inference to make it endurable.
Side-effects are clearly isolated by the type system: code is divided into expressions (free of side-effects) and commands (allow side-effects). An Ila program can be either of these, an expression (purely functional) or a command (usually a sequence of commands).
Functions are pure and have no side-affects, in syntax something like fn x => x.
Procedures can have side-effects, and in syntax look something like prc n => println "Hi, "+n. Functions cannot call procedures (of course procedures can call functions).

Data structures are built from lists, tuples, and records. The syntax for these is thus:

[]                               (* empty list *)
["lax","sjc","ord"]              (* list with 3 strings *)
(54,40)                          (* pair, or 2-tuple *)
{lat=41.97944,long=-87.90417}    (* record with 2 fields *)

{dep=("21/3/2007 1020-8","lax"),
 arr=("21/3/2007 2312-6","ord")} (* tuples nested in a record *)

The most primitive object is the variable, which has get and put methods. Get is a function, and put is a procedure. Variables are instantiated by the Var class, which is a generic class that operates on any type, the type of the value you want the variable to store.

Some of the more mind-blowing, or at least esoteric, stuff about Ila:

When we say "conducive to functional programming", we include the notion that classes are first-class. This makes it possible to use a combining style of programming, writing functions that take classes as arguments and return a new class.
Ila has limited support for parametric polymorphism (or "generics") , in the sense that of necessity some built-in types and functions could operate on any type. (User-introduced generics were something that was on the hazy roadmap.)
Most objects are represented by records, which in Ila (as in many theoretical languages used for academic study of object-oriented theory) is an orderless collection of named values, like {lat=40,long=37}. This notion of record is a little like a dictionary or hash with a fixed collection of key-value pairs. Though an individual record value has a fixed number of fields, there are record operators for producing records with more fields (to implement subclassing) or with different values for some fields (to implement method override in a subclass).
There is also a fixed-point operator which works on records. While the need for the effect of the fixed-point operator is universal and included in all OO languages I'm aware of, I've never heard of its being exposed as an operator. Implementing this was quite interesting.

Since you're a smart reader who can fill in a lot of details without my explaining, let's end this introduction with a couple of real Ila programs.

The first, add2.ila, is an elaborate interface to your computer's ALU, but it also shows how variables (x and y) are created, which happens to be the same way any other objects would be instantiated, and it also shows a subtlety of subtyping in action. Bonus points to whoever can identify the subtlety. Hint: you will need to make an assumption about the types of promptread's parameters.

add2.ila

new x isa Var[int];
new y isa Var[int];
promptread("Enter a number:",x);
promptread("Enter another number:",y);
print "The sum of your numbers is ";
print(x+y);
println "."

The second program, redhen.ila, is an example fun to code up in a functional language because it emits a familiar children's story with a strong repetitive pattern, and the code shows data structures, functional style, and the command/expression interaction. The app function is like in SML, it takes a procedure and a list and applies the procedure to each element of the list.

redhen.ila

let
    val LRH = "Little Red Hen"
    val lazyanimals = ["duck","dog","cat"]
    val actions = ["plant the wheat","harvest the wheat", 
     "grind the wheat","bake the bread"]
    fun respond willing = prc animal =>
    (println("\t\""+
      (if willing then
          "I will"
       else
          "Not I")+
       ",\" said the "+animal+"."))
    fun questionResponse willing = prc action =>
    (println("\t\"Who will help me "+action+"?\" asked the Little Red Hen.");
     app (respond willing) lazyanimals)

    proc unwilling action =
    (questionResponse false action;
     println("\t\"Then I will do it all myself,\" said the "+LRH+".");
     println"")

    proc willingly action =
    (questionResponse true action;
     println("\t\"Hmm, that seems strange,\" remarked the "+LRH+",\"");
     app (prc action => (println("\t I did "+action+"."))) actions;
     println "\t I should also eat the bread.\"")
in
     println("\n\n\t\tThe Story of the "+LRH+"\n\t\t\tas told by Ila\n\n");
     app unwilling actions;
     willingly "eat the bread";
     println("\tSo the "+LRH+" shared the bread with her little chicks.");
     println"\n\t\t\t\tTHE END"
end

That's Ila in a tiny nutshell. Let me know what you think, and I'll try to pick a good place to begin discussing how Ila can help us understand the rest of the programming world.

Monday, March 19, 2007

Identifiers in any language

You might be a language geek if you ever thought...

Wouldn't it be great if there was a program that you could feed an identifier, tell it a language, and it would tell you if that identifier was valid for the given language? Or tell you useless stuff like all the languages the identifier is valid in? Wouldn't it be great if this program knew all about identifiers in any programming language you could think of?

You might also just be too lazy to memorize the rules or try running a program with it, and too particular to go with something safe, an identifier you know will work.

The inspiration for this idea: I want to know if hyphens are OK in CSS identifiers. It turns out they are (the W3.org syntax reference says so).

This sort of question pops up a lot in my experience, and I suppose the identifier question is not that critical. But the idea of a one-stop language reference with some level of interactivity is appealing. So, I think there's room for a C3PO "protocol droid" program that just knows stuff about all the languages out there. Rules for identifiers would be a decent place to start, followed by rules and information for simple stuff like constants and literals.

As is often the case with my ideas, they've already been done, marketed, and obsoleted by something better, so if you know of an implementation of this idea, leave a comment.

Saturday, March 17, 2007

Progress report: Reproducing the TIOBE PCI

In which reproducing the TIOBE PCI is found to be trickier than it first appeared.

Recently we looked at how TIOBE describes that its Programming Community Index is computed. The description, while detailed, somewhat elusive, and curiosity overcame me. I've been hacking at a system for reproducing the results.

I've now got results, and they don't match what's on the web. The current disconnect is that I forgot about the phrase "for the last 12 months" that comes at the end of "The search query is executed for the regular Google, MSN, and Yahoo! web search and the Google newsgroups and blogs."

So, before I give you some notes about what I've learned so far, let's compare my results (so far) with what TIOBE published for March:

Position	Language	My calculated ranking	TIOBE ranking

Thursday, March 15, 2007

XMLisp

Just when you thought the XML thread (see XML-Based languages) here was going to die...

Lambda The Ultimate points us to XMLisp, an integration of XML into LISP.

I had been toying with some ideas for a humorous post here proposing that HTTP be modified to accept LISP syntax for web pages, so we could write things like


(html
   (head
      (meta content = "text/lhtml; charset=utf-8" http-equiv = "Content-type")
      (title "JFKBits"))
   (body
      (h1 "A blog about languages and stuff")
   ))

But I quickly realized it was not as trivial as it first seemed. For one thing, I haven't attained sufficient literary control to attempt humor.

In the series on XML, we were talking about using XML and LISP syntax for representing parse trees. But XML and HTML (and SGML) are markup languages, fundamentally text marked up with semantic tags. LISP syntax is fundamentally a structure with text thrown in. Or is it?

XMLisp, which is really about integrating CLOS and XML, like an XML binding system for CLOS I *think*, does make use of integration XML syntax directly into LISP without the use of quotes:


(inspect

<HTML>

<font face="arial, sans-serif" size="-2">Small Text here</font>
 <font face="arial, sans-serif" size="+2">Large Text here</font>
 <a href="http://www.cs.colorado.edu">Go CU</a>

</HTML> )

It's fascinating to watch the interplay of the two simple syntaxes. They're able to take advantage of LISP syntax's minimal token list. Most XML, at least the stuff in the examples, doesn't itself include parentheses, as most other syntaxes rely on. The whitespace naturally present in the XML simply delimits which parts are atoms, and the original XML can trivially be reconstructed by inserting a single space.

Tuesday, March 13, 2007

Types of garbage

At some point, many a Java programmer has had a little bulb go on, and exclaimed with dropped jaw "I thought Java couldn't have memory leaks!"

Well, you don't have leaks like in C or C++, where leak occurs simply by forgetting to call free or delete. Objects leak by default:


   new std::string("hello");

is a leak. In Java, objects get deleted by default:


   new String("hello");

is instant garbage to be collected. You have to do something to remember an object, like assign it to a variable or store it in a live object. And that's actually the type of "leak" that can happen, where the programmer stores a reference to an object in a long-lived structure but never uses the object again.

I was pleased to discover that the Wikipedia article on garbage collection and on garbage give terms to distinguish these two cases:

Syntactic garbage: Memory that is unreachable from variables in the program.
Semantic garbage: Memory that is unusable by further execution of the program.

The latter category is not possible to algorithmically determine in the general case (halting problem type of stuff, bad idea to want by algorithm how the execution of a program will turn out), garbage collection algorithms tend to focus on the syntactic variety of garbage.

The next time someone asks you to take out the trash, you have fuel for a smart remark. Something like "which did you mean, the objects that we're not using because they're in the trashcan, or the objects that we could use but we haven't been using, and we never will, but we don't know that for sure now, so we better not discard them yet..." Let me know how far you get with this type of reasoning as a way of avoiding your GC chores.

Friday, March 09, 2007

Marketable language names

OK, it's Friday afternoon on a spring-like day and the Editorial Staff is in a goofy mood. In case you didn't realize, we had about a thousand visitors this week thanks to a couple of posts being picked up by reddit and dzone.

This discussion of marketable names for programming languages draws a parallel between the success of Coca Cola and its name, and the opposite effect for Slug Cola, despite the latter's being regarded as tasting better.

"A Nickel's Worth" approves of the names C, C++, FORTRAN, and Java (I would also add Ruby and perhaps Perl). He disapproves of most other names, apparently, including ERLang, Smalltalk (doesn't sit well with managers and executives), and LISP (obvious reasons?).

One of the reasons given for the approval of "Java" is "It's short, it's cryptic, it's friendly; it implies a connection with a caffeinated beverage." Other good ones that fit the same description could be Cola and Jolt. But I thought of another good one.

Darjeeling.

Stop that sniggering, and let me finish. It's Friday, remember? I gotta get home.

Darjeeling is a real place, like Java. It's associated with a caffeinated beverage. It has a great filename extension: drj. I like to think it's got a more sophisticated ring than Java, so obviously it needs to support the Thinking Programmer's language features like lambdas and sum types.

It does have some down sides. It's not short, that part is true. Nobody knows how to pronounce it. Rather than jars, people would want to know whether to ship their code in .bags or as loose leaf...

OK, so this is harder than it looked at first. You got any bright ideas?

CodeCodex

Today at DZone.com I ran across the CodeCodex which could subsume the Software Provenance Database. The latter is (will be?) a place to store code snippets that help your program figure out "where am I?". The idea of CodeCodex is simply to be a repository of any small bit of useful code, from sorting implementations to CSV import code, in any language.

Google still does a reasonable job at finding existing code of course, and for things like these they're usually in the format of articles with some good explanatory text. CodeCodex uses a Wiki so the code can also be presented as an article, and can include implementations in multiple languages. Anyone can contribute, refine, and discuss once they land on the page.

This thing has been submitted to Digg (twice) and reddit within the past year, but I hadn't heard anything about it before.

Tuesday, March 06, 2007

Small talk about Eclipse Shell

Today we're chatting with Werner Schuster about his Eclipse plug-in Eclipse Shell. Werner is giving a talk on his work at eclipseCON 2007, March 5th-8th in Santa Clara, California. I was able to get a few words with Werner about Eclipse Shell and Eclipse Plug-in development before he headed off to the conference.

Screencast: Beanshell editing

Eclipse Shell puts an interpreter environment for a dynamic language, like JRuby, inside Eclipse. BeanShell and JavaScript (Rhino) are also supported, and the language interface is abtracted so you can add more languages. What would you do with such a shell? You can prototype new Java classes. You can debug Eclipse plug-ins directly. You can script repetitive tasks inside Eclipse. I've included a couple of screencasts in this post such as the one at right, and there are more where that came from.

What does all this mean? A Javascript interpreter inside an IDE? A plug-in debugger? Let's take a step back and survey the landscape before returning to these questions.

Eclipse and software development

While we've had IDEs for a while, even ones you can extend, nothing has caught on like the zero-cost Eclipse "open development platform" with its system of plug-ins and rich-client applications.

"Eclipse is the new EMACS" is something we've been hearing lately. And there's a striking similarity. Both are useful programming tools right out of the box, intended to be widely extended. Out of the box, Eclipse has an extensive set of Java development tools complete with web browser. With plug-ins you can not only get development tools for other languages but email and ssh clients, tetris and sudoku diversions, and stock quotes. I haven't yet found an equivalent of M-x doctor but I'm sure it's out there somewhere or soon coming.

Eclipse may be the new EMACS, but it's headed toward a higher goal: becoming the old Smalltalk. A lot of the people involved with Eclipse, such as Erich Gamma, came from a Smalltalk background. Eclipse itself grew out of Visual Age for Java. If you compare the interfaces, you see similarities, like the Smalltalk class browser and the Eclipse Java browsing perspective. But where Smalltalk historically has combined the language and the development environment, that's not the case with Java and Eclipse (note the origin of the name "Eclipse"), where Sun first brought us the language and then IBM the environment. The Java language is not Smalltalk, and the Eclipse environment inherits the limitations. "You can fiddle with the bytecodes for a class," says Werner, "but you can't change the class hierarchy at runtime." Java is also something that is not generally interpreted, whereas Smalltalk has the workspace.

Smalltalk workspace

One of the nice things about the Smalltalk environment is that you can change the development environment while you're using it. This is a little mind-blowing, so let's focus on one aspect, the workspace. The workspace is a Smalltalk shell window, or interpreter window. You enter code, evaluate it, see the results.

An interpreter is nice to have, even for a compiled language. It lets you quickly experiment with language features without the overhead of creating a file and building it. What makes the workspace more powerful than a command-line interpreter is that it's running within the same virtual machine that your deployed code is running, giving you access to the code and data you're already working with. It's a little like the difference between the artificial practice situations in a foreign language classroom compared with the immersion experience of being in the country where that language is spoken; it's the power of the language plus the environment where the language is really needed and used.

That's what Eclipse Shell is, the Smalltalk Workspace for Eclipse.

Interacting with Eclipse internals

Screencast: Prototyping

You may have noticed the languages supported by Eclipse Shell are dynamic languages that run on the Java Virtual Machine. This allows you to do things like write Ruby code that creates, inspects and manipulates Java objects.

You can leverage the interactivity for prototyping. I know when I was developing Swing applications, compiling and running soon got very old for activities like tweaking layouts, and I started to long for the interactivity of HTML. Other people had the same idea that occurred to me at that point, and invented a slew of UIMLs (User Interface Markup Languages) for AWT/Swing. To help me figure out layout issues, I spent time writing an object inspector that walked the Component hierarchy and showed me all the settings, a tool which had to be compiled and built into the original application. An alternative technique to both of these is an interactive interpreter that allows you to inspect and tweak the layout at runtime without recompiling. When developing Eclipse plug-ins, which are based on SWT, Eclipse's own GUI widget toolkit, you have similar issues. Something like Eclipse Shell can be a welcome change to launching a new copy of Eclipse each time you want to tweak that great object heap walker GUI you're writing.

Guerrilla debugging

Prototyping and debugging are sometimes very similar. Russ Olsen describes the technique of using BeanShell to debug Java programs live in Guerrilla Debugging for Java. He talks about debugging a Java web application by creating a servlet wrapper for one of the scripting languages, allowing you to use something like JRuby to poke at the state of your web application. If you're developing Eclipse Plug-ins, the equivalent approach is Eclipse Shell.

"This is the way I most often use Eclipse Shell myself," says Werner. "When I'm running a plug-in and something doesn't look right, I can fire up Eclipse Shell and examine the live data."

Conclusion: towards a more Smalltalkish Eclipse

The Eclipse Shell is a useful utility and finds its place in the software development toolkit universe alongside Smalltalk Workspace and the Emacs-Lisp interpreter. It definitely finds its best use by Eclipse plug-in developers, but as Eclipse matures I think it also has room to grow as an Eclipse task scripting tool. And even if you're not a Java programmer let alone an Eclipse user, I hope you've gotten an idea or two about the advantages of interactivity in the programmer's development environment.

Werner Schuster (murphee) is a programmer with a focus on Java, Eclipse, and dynamic languages such as (J)Ruby and Mathematica. Among other things, he's developed Eclipse plugins since 2003, both commercially and in the OpenSource space. His blog is at http://jroller.com/page/murphee.

Friday, March 02, 2007

How TIOBE Says its Programming Community Index is calculated

Recently I mentioned and then featured the TIOBE Programming Community Index. This is one measure of the relative popularity of about 100 different programming languages.

As has been noted elsewhere, popularity is important to the user base of a particular language, because a popular language is an employable language. There are both considerations of vested interests ("I spent 2 years and $1400 in books, courses and conference fees learning C#") and personal preference ("I would love to write OCaml code all day long"). Unless you're a Paul Graham with your own business and the executive authority to decree All Will Be Done in LISP, you want to proselytize other developers to use your language, and monitoring popularity statistics is a way to track progress towards adoption of your pet language. (I should note that 'Siamese Fighting Fish' would make a great name for a programming language, especially if it was in a beta release.)

How TIOBE says the PCI is calculated

On the front page of the Index, TIOBE says

the ratings are based on the world-wide availability of skilled engineers, courses and third party vendors. The popular search engines Google, MSN, and Yahoo! are used to calculate the ratings.

However, there's a link to the "definition" of the PCI, and it says something a little different.

TIOBE rating comes from search engine result counts

If you follow the link in the sentence that says "the definition of the TIOBE index can be found here" (TIOBE loses points for web design as the definition page redirects itself if loaded outside a frame set), it explains how the various columns are calculated. The "rating" column is the one used to sort the table, and the one most people will quote when they say "Language X is number 12". It turns out this value is calculated from a cleaned-up page hit count for the search query

+"<language> programming"

Jobs, courses and vendors not mentioned in the definition

There doesn't seem to be a mention here in the definition page of those elements mentioned on the front page of "availability of skilled engineers, courses and third party vendors." The FoxPro community was kind of counting on that to be the case. I can't reconcile the discrepancy between the description on the index page and the definition given later. I've checked the TIOBE index via the WayBack Machine for 2004, the year of that FoxPro post, and the description and definition are unchanged in this respect.

"<language> programming" as a choice of search query

So now we all know what we need to do: stop writing about "programming in Ocaml", "OCaml hacking", and "writing a widget in OCaml", and starting writing about "making a widget through OCaml programming". I can see how this search query phrase might generate fewer false positives than using simply the language name (where "C", "D", "Natural", and "Logo" have an unfair edge), but is this really the best and only query to use? Fortunately TIOBE does a few more queries to fix up the results. They have a list of "groupings" and "exceptions", where searches for "J2EE programming" are counted towards "Java", and "3-D Programming" is excluded from results for "D programming". (There, I just artificially inflated some counts.) Still, I think there might be additional queries of different patterns that could be helpful.

Page hit count not well-understood

Technorati CEO David Sify wonders where the Google and Yahoo! page hit counts come from, if you can only view about 1000, and this thread is continued here in a discussion on blog.searchenginewatch.com. These numbers are also presented as an approximation, so I wonder how much these counts can be off and how that might affect the TIOBE rating.

Reproducing the PCI

Since TIOBE has given a recipe for the way the rating is calculated, it should be easy to reproduce it and compare the results to those published. The current recipe is a little unclear on the rating calculation, I think they're missing some parentheses, but the April 2006 description was a little more clear.

The algorithm seems to say the rating is an average of the ratings on each search engine, and the rating for a search engine is done by dividing the page hit count for the language in question by the sum of the page hit counts for the top 50 languages.

How do you know what the top 50 languages are until you calculate the ratings? Or do they start with the previous top 50? I also wonder how they handle languages not in the top 50.

So, if you have the inclination, you can try to recreate the TIOBE PCI and see what you get. I have great confidence I can do this quickly in my spare time with my favorite web scraping secret weapon, but for now, I leave you with this thought:
Logo (#27) beats out Haskell (#41), so all we have to do is add monads to Logo and we'll be cranking out MIT-bound schoolkids like there's no tomorrow.

JFKBits