Tuesday, July 18, 2006

Alphonse Bertillon and 19th Century Databases

Earlier I described an idea that I called the Software Provenance Database. The essence of the idea is a database of information on how to tell what version of a piece of software you're using.

I didn't like the name though, and remembered reading about Alphonse Bertillon, a French police employee who in 1882 presented "anthropometry", a technique of identifying a person based on body measurements and other observations of unchanging features such as scars.

Hence I'd like to consider using Messr. Bertillon's name or story for this database.

Further, I wanted to add that this database certainly should not be limited to determining versions of an executable. The idea is to gather in one spot any information about how you can dynamically determine the state of your computer, and this can include configuration information (how much memory is installed?) and data schemas or file formats.

If you have a few moments, it would be interesting to read this account of how Bertillon's method works, and to think about a database search which involves no computational machinery. Notice his method for partitioning the records into equal portions, for example to make the search quicker. It is to me a good example of what I think Dijkstra meant when he said "Computer Science is no more about computers than astronomy is about telescopes."

Monday, July 17, 2006

Rantlet: Look who (or what) knows so much

My CD is dirty or damaged. I knew this already. My computer knows it too:

Windows Media Player cannot play the CD. The disc might be dirty or damaged. Turn on error correction, and then try again. [Close] [More information]

So it knows the problem. I'm impressed. But I don't know how to turn on error correction. Why doesn't it do something about it? Why isn't there a button labelled "So turn on error correction already!"? I don't want to get my hopes up here, but maybe they could even just turn on error correction for me, if they think that might help.

OK, I'm calming down. I had popped in a familiar CD to help me focus, and now I'm having to write to get out my frustration. If you click "More information", it does at least tell you how to turn on error correction. And the error correction, as they tell you, doesn't always work and might cause your CD to skip when it wasn't skipping before. OK, so there's a legitimate choice involved that is best left to the user. And I'll allow that this dialog may be fitting into a standardized error dialog which always has those two buttons, "Close" and "More information". OK, no more to say. Just a vague, undefined dissatisfaction that will probably induce me to try something different if it comes along.

Friday, July 14, 2006

False friends and Picasa web movies

Yesterday I tried out Python for real, applied to a problem I have with Picasa's web page generator.

What problem? For movie files, Picasa generates an embedded Windows Media Player control that does nothing. This is easily fixed by substituting this simple embed link:

<embed src="movie-filename" width="320" height="256" />

It was an interesting, refreshing experience to write in a new language, bringing to it expectations and baggage from Perl, Scheme, ML, and Mathematica. It was a process of "how do I read command line argument?", "now how do open a file?", and "what about creating an empty list?" I could bring my assumptions about the language semantics and capabilities and ask "how do I..." over and over.

A little like learning one Romance language after knowing two or three others, yes?

I spent most of my time consulting the Quick Reference and occasionally to the full Library Reference, with a boost from Introduction to Python/Hello World! to get started.

The things that gave me the most frustration in writing my 89 line script were definitely novice mistakes. If your critical attention is on a piece of code and you say "hey, that's not right!" it's a step up from the novice who doesn't realize something is out of place.

Mistake #1: Using parens to surround list literals: list = (). Not sure where I picked this up. Just as in SML, my native language, Python lists are surrounded by square brackets: [], [1,2,3],["and","so","on"]. Parentheses surround tuples, which are fixed-length. Since Python uses dynamic types, I could write list = () and nothing bad ever happened to me until I tried to append something to it. The error message finally clued me in on what type my variable was.

Mistake #2: Using comma instead of colon in a slice. I had written an expression like line[i+1,j]. Again, not sure why my fingers glibly typed the comma. The error was TypeError: string indices must be integers. That communicated to me that a substring, which is what I meant, could only be done with literal integers, not with integer-valued expressions. Spent a bunch of time in the documentation for other ways to get a substring before figuring out I had a comma when I needed a colon: line[i+1:j]. I think a non-novice Python programmer would have spotted the mistake immediately.

I explained all this to my linguistically-minded wife, someone who actually does speak multiple languages, and she thought it sounded like I had made the "false friends" mistake. I used something with a meaning in one language, and instead of being gibberish in the current language meant something else. Something related enough to cause confusion for a while.

These trivial problems were fixed, and the fixpicasamovies.py script worked great.

Monday, July 10, 2006

Idea: Software Provenance Database

You know how Help -> About... tells you an application's version number? Wouldn't it be great if there was a similarly convenient way to find out, manually or programmatically, what version of something you've got whether it's an operating system, browser, virtual machine, or compiler? The idea here is a database of how-to-tell-what-it-is information. It would store ways for determining version info manually and also programatically, in as many different languages and environments as possible. For example, on a Mac you can get the OS name and version from a file. This works for both manual and programmatic access in almost any language. In Java you can also call System.getProperty("os.name") and get the OS name (not sure about the version number though). There can be both algorithmic and heuristic ways of determining the version of something. For example of a heuristic how-to-tell, I've heard that some machines and TCP/IP stacks can be identified by examining their output with a network sniffer. These are the kinds of things that would be useful to have cataloged somewhere.

Now somebody just needs to find or create such a database.

I did find something called "bitprints" which a collection of identifying information, such as hash values, for file images. This is certainly one way to identify a piece of software, by characteristics of its bit content, e.g. 1 million bytes long with an MD5 hash value of such-and-such. But I'd want a software provenance database to record as many ways to identify it as possible.

Hopefully later I'll post a mockup of a domain model for such a database.

If you have refinements on this idea, or know of places that realize all or part of this idea, please leave a comment.

Friday, July 07, 2006

Spell checkers for language processors

I've been finding the spell checking feature of Mathematica surprisingly helpful at identifying typos in variable names. It makes me wonder why more language processors (compilers and interpreters) don't make use of this.

The feature works like this. Each new identifier reference is checked for similarity to an already known symbol. If it is similar but not the same, a warning is generated, as follows:

In[1] := rootDir = "/tmp/foo";
setup = rootdir <> "/index.html"

General::spell1 :
Possible spelling error: new symbol name "rootdir" is
similar to existing symbol "rootDir".

(For more info see the documentation for General::spell1 and its examples).

The rootdir is flagged as similar to rootDir. In Mathematica this is important because variables don't need to be declared before use, which makes sense for supporting symbolic algebra. In a language like Java where variables must be declared, the compiler would catch rootdir as undeclared, but would leave it to you to figure out why. With a spellchecker to catch the similarity you could get a possibly helpful hint.

The chief drawback I see is the noise you get from false positives, when the spell checker finds a similarity between a symbol you just wrote with an existing symbol, and you know full well that they're different and it's OK. It's a little like the Dick Van Dyke show exchange when Laura asks Rob "Do you or do you not like reminders?" and Rob replies "Only when I forget." Mathematica let you disable the spell check feature, so you don't see the warning every time you load a bit of code with a spelling similarity.

It would be cool to combine this spellchecking feature with the "Quick fix" feature in Eclipse. You make a typo in a variable name, it gets a compile error perhaps because it's not declared in its current form, and the compiler additionally warns "similar to symbol X". The quick fix would be "change to X".