Wednesday, March 26, 2008

Planes Without Heaps

I ran across something interesting while reading up on abstract interpretation. I was prompted by a redditor comment that the approach in "Symbolic Disassembly" was an abstract interpretation technique. Patrick Cousot's team has a static analyzer for C code that they've applied to software in the Airbus A340 and A380:


The development of ASTRÉE started from scratch in Nov. 2001. Two years later, the main applications have been the static analysis of synchronous, time-triggered, real-time, safety critical, embedded software written or automatically generated in the C programming language. ASTRÉE has achieved the following two unprecedented results:
  • A340-300 In Nov. 2003, ASTRÉE was able to prove completely automatically the absence of any RTE in the primary flight control software of the Airbus A340 fly-by-wire system, a program of 132,000 lines of C analyzed in 1h20 on a 2.8 GHz 32-bit PC using 300 Mb of memory (and 50mn on a 64-bit AMD Athlon™ 64 using 580 Mb of memory).

  • A380 From Jan. 2004 on, ASTRÉE was extended to analyze the electric flight control codes then in development and test for the A380 series. The operational application by Airbus France at the end of 2004 was just in time before the A380 maiden flight on Wednesday, 27 April, 2005.

The analyzer has its limits, so by extension does the tested code: "ASTRÉE analyzes structured C programs, with complex memory usages, but without dynamic memory allocation and recursion." Of course we don't know from this whether there is other code run in the Airbus 340 or 380 that does use dynamic memory allocation, but presumably the 132,000 lines of C are an important and interesting part of the system. So, it's a largish useful program that doesn't use the heap.

Do you remember the last time you wrote a program with no dynamic memory allocation (in a language that doesn't do it as a matter of course)?

Wednesday, March 19, 2008

Symbolic Disassembly

In looking at OCaml-generated x86 assembly last week, I noticed that having a language which keeps undefined identifiers in symbolic form can be handy. I started making notes in a Mathematica notebook to keep track of some heap shuffling code I was trying to understand. To explain the technique, let's use this OCaml snippet:


((1.0 -. alpha) *. (Array.get y i))
where the floating-point arithmetic becomes this x86 code:

fld1
fsub REAL8 PTR [eax]
fmul REAL8 PTR [esi]
In Mathematica, I translated lines of assembly into real assignments, using the literal register names for my variable names, except at the very beginning of the assembly fragement. For example, when starting the above fragment, eax points to the alpha value, so I would write the read reference to eax as alpha, without defining alpha. Similarly esi points to Array.get y i, so I was able to write this sequence:


It was interesting, a little startling actually, to see the original expression re-emerge by simulating running the code. I didn't get very far into it, but you can also do things like simulate writes to memory with a register transfer definition:

Mem[eax] = 2301
In a traditional language, eax has to be bound to a value. But in a symbolic language, eax can be a symbolic expression like _caml_young_ptr + 4.

Do my partial-evaluation readers have any remarks about this?

Friday, March 14, 2008

List Append and List Representation

When writing "RC Filter with OCaml" I was surprised that list append performed so poorly (is not tail-recursive). As explained earlier, the RC algorithm is defined as processing elements from the input array from first to last, but in doing so new elements in the result list must be appended, not prepended (cons). Since OCaml didn't have a constant-time way to append to a list, I was forced to prepend and reverse the result later.

It depends on the list data structure I think. If lists are represented with a simple cons cell, namely a pair of head and tail pointers, then there's no other way to do it. But lists could be represented with a list header containing information such as length and last element pointer to speed up certain operations, as well as the pointer to a chain of traditional cons cells:


struct ConsCell {
struct ConsCell* hd;
struct ConsCell* tl;
};
typedef struct ConsCell* LinkedList;
struct ListHeader {
unsigned length;
ConsCell* first;
ConsCell* last;
};
typedef struct ListHeader* List;
Representing lists with the header makes several operations constant-time: appending or prepending an element, concatenating two lists (called append in ML), and getting the list length. It does so at an obvious memory overhead cost, as well as the time cost of an extra level of memory indirection: getting the head or tail has to read header->first as well. As I found in the generated assembly, OCaml calls a hd function anyway, so as for it I don't think that's a concern. And I imagine the Ideal Functional Language Optimizing Compiler (IFLOC) would identify places in the source where the simpler LinkedList would work and just use that.

Wednesday, March 12, 2008

RC Filter with OCaml

In "Dependent Iteration" we looked at a small numeric program given in imperative programming style and wondered how it might look in functional or hybrid environments, and whether it would perform well. I felt the RC filter algorithm posed some challenges for being elegantly expressed in a functional style, in a way that might be appealing to procedural programmers but that also is fast. In "RC Filter with Python and Java" we got some real numbers for an input file of about 1.2 million entries. What I really wanted to try was a compiled functional language, like OCaml. So, now we can add OCaml, procedural and functional variants, to our previous results (note Psyco-enhanced Python was added, see post comments):

Language Total Time/Iteration
Java 29 ms 24 ns/iteration
OCaml native procedural 56 ms 47 ns/iteration
OCaml bytecode procedural 547 ms 456 ns/iteration
OCaml native functional 940 ms 783 ns/iteration
Python with Psyco 588 ms 490 ns/iteration
Python 1507 ms 1255 ns/iteration
OCaml bytecode functional 1692 ms 1410 ns/iteration

I was surprised to see OCaml not beat Java. I am using the OCaml compiled for the Microsoft toolchain, for which there is a release note about degraded performance from lack of computed gotos; maybe that's it. It was interesting to see the bytecode times so similar to the Psyco-enhanced Python version. The functional version is just a bit disappointing, partly because I had to reverse the list.

Let's take a look at the code, and some of the practical things I learned about getting started with OCaml.

Fractal Benchmark Page

My favorite resource for this post is this Fractal Benchmark page, from which I gleaned two clues about OCaml:

1. Fine-grain timing (under a second) can be done with Unix.gettimeofday().

2. OCaml has for loops:
for i = 1 to 10 do
print_int i
done


Using OCaml Modules

If you bother to read the documentation ahead of time, you won't get caught here. But silly me, given the preceding benchmark code and the fact that the Unix module is in the OCaml library documentation, I assumed Unix.gettimeofday() would work out of the box.

No, if you want to use the Unix module, and you're running the ocaml interpreter, you need to first use the #load (that's "pound load", one word, not the "#" prompt followed by a load command) directive:

#load "unix.cma";;

You have to know to use the lower-case name "unix" to load a module with the capitalized name "Unix", and you have to know the special file extension ".cma". Did I mention that you need to do this with some modules, but not others which come pre-loaded, and I'm not sure I can tell you which ones are like this.

I didn't really find this the most user-friendly introduction to getting something done, you can draw your own conclusions, but the cherry on top was the next point.

If you want to compile source code, you must not have a #load directive in the source code. Presumably one can load source code into the interpreter, as you can with SML/NJ's use command, and if so, does that mean you need to manually type the #load(s) first? This seems incomprehensible. Obviously not a showstopper issue to the OCaml community, one way or other.

What you do need to do if compiling with a module like Unix is to mention its compiled module file to the compiler. I needed to list the module before my source file. Here's my bytecode compile command:

ocamlc -o lp.exe unix.cma lp.ml


Using the Native Compiler

I had installed OCaml for the Microsoft toolchain. I needed to do just a couple setup things:

1. Make sure the Visual C bin directory is on the path:
set PATH=C:\Program Files\Microsoft Visual Studio 8\VC\bin;%PATH

2. Make sure the LIB environment variable has a path of the Microsoft library directories:
set LIB=C:\Program Files\Microsoft Visual Studio 8\VC\lib;C:\Program Files\Microsoft Visual Studio 8\VC\PlatformSDK\Lib
If you don't have #1 right, you get a poignantly confusing error message:

'ml' is not recognized as an internal or external command,
operable program or batch file.
Assembler error, input left in file C:\DOCUME~1\jfkbits\LOCALS~1\Temp\camlasm705c28.asm
The "ml" mentioned here is the Microsoft Assembler, not any ML language thing.

The compile command for the native compiler is like that for the bytecode compiler, only you change .cma files to .cmxa:

ocamlopt -o lpopt.exe unix.cmxa lp.ml


Procedural Code


Without further ado, the procedural code:

let lowpassProcedural x dt rc =
let n = Array.length x in
let y = Array.make n 0.0 in
let alpha = dt /. (rc +. dt) in
begin
Array.set y 0 (Array.get x 0);
for i = 1 to n-1 do
Array.set y i (alpha *. (Array.get x i) +. ((1.0 -. alpha) *. (Array.get y (i-1))));
done;
y
end;;
This is about as close as it gets to the universal procedural implementation: arrays and for/do loops. Notice that overloading is not to be found in OCaml, so the floating-point operators have identifiers like +. and *..

Functional Code with Fold

The correct code:

let lowpassFunctionalFold x dt rc =
let x0 = List.hd x in
let alpha = dt /. (rc +. dt) in
let helper acc xi =
match acc with (yprev,ys) ->
let yi = alpha *. xi +. ((1.0 -. alpha) *. yprev) in
(* (yi,ys @ [yi]) don't do this *)
(yi, yi::ys)
in
match List.fold_left helper (x0,[]) x with
(_,acc) -> List.rev acc (* rev accounts for about 40% of the running time *)
;;
Turn your attention to the last line of helper where the returned list is constructed. The given implementation uses :: (cons) and the result is reversed at the end. The reverse is done to give the proper ordering of the output list when the list elements are visited first to last as with fold_left. Note that we must visit first to last, because the RC filter algorithm is specified that way. (Even if it works to filter a signal with time reversed, I'm looking for code that can be explained to those familiar with the subject.) This means we can't use fold_right.

As noted in the source, I originally wrote this using @ (list append) to tack elements onto the end of the list. But this is exactly what the documentation is warning about when it says "Not tail-recursive." I gave up timing that implementation. OCaml could use snoc here (append to the end of a list, counter to cons which prepends).

There may be a better way to do this in OCaml using fold or similar list iterator. I was simply trying to get a reasonably compact functional definition that didn't degenerate into hand-coded recursion.



Updated: Added native compilation numbers and commentary.

Friday, March 07, 2008

RC Filter with Python and Java

Following on to "Dependent Iteration", I tried the RC filter simulation on a semi-realistic problem with 1.2 million iterations in a few platforms. The two I want to report on seem to run the gamut of fast to slow:

Language Total Time/Iteration
Java 29 ms 24 ns/iteration
Python 1507 ms 1255 ns/iteration


I ran the filter function in a loop and averaged the times to report the numbers above. The Java code is reasonably close to the metal here: on a 3GHz CPU at 0.3ns/cycle, 24ns is 72 CPU cycles to perform 3 double precision arithmetic operations and two array lookups and some loop overhead, probably with cache block fetches amortized across the life of the loop:

public double[] lowpass(double[] x, double dt, double RC) {
double[] y = new double[x.length];
double alpha = dt / (RC + dt);
y[0] = x[0];
for(int i=1; i<x.length; ++i)
y[i] = alpha * x[i] + (1-alpha) * y[i-1];
return y;
}
There must be a better way to do this than my Python code (running on version 2.5.1). Aside from whether this uses single or double-precision arithmetic, I tried using both arrays and lists, and reported the faster (the list version, actually):

def lowpassArray(x,dt,RC):
y=array.array('f',range(len(x)))
y[0]=x[0]
yprev=x[0]
alpha=dt / (RC + dt)
for xi in x[1:]:
y[i] = alpha * xi + (1-alpha) * yprev
yprev=y[i]
return y

def lowpassList(x,dt,RC):
yprev=x[0]
alpha=dt / (RC + dt)
y = []
for xi in x[1:]:
yi=alpha * xi + (1-alpha)*yprev
yprev=yi
y.append(yi)
return y
One interesting thing in these exercises is that for the number of language platforms I'd like to try, it takes a nontrivial amount of time to figure out (a) how to load from a file in one-float-per-line format and (b) how to get wall-clock time within the environment.

What about your favorite functional or scripting platform? How fast can you get this dependent iteration loop with an input size of 1.2 million?

Wednesday, March 05, 2008

Dependent Iteration

Consider the following code:

// Return RC low-pass filter output samples, given input samples,
// time interval dt, and time constant RC
function lowpass(real[0..n] x, real dt, real RC)
var real[0..n] y
var real alpha := dt / (RC + dt)
y[0] := x[0]
for i from 1 to n
y[i] := alpha * x[i] + (1-alpha) * y[i-1]

return y
This is from the section "Digital Simulation" from the Wikipedia article "Low-pass filters" and it has an element that strikes me as fundamental in the "for loops considered so 20th century" mindset. Each iteration uses the results of the previous iteration.

The example code is a digital simulation of a physical phenomenon, a simple resistor-capacitor (RC) circuit which allows frequencies below the cutoff and blocks frequencies above the cutoff. The y[i-1] is helping model the physical phenomenon of the first derivative of output voltage which the capacitor introduces.

Similar physical reasons, like neighboring physical objects and derivatives, make it very natural to use the procedural paradigm where access to neighboring data and previous calculation results involve array access. The whole area of physical simulation programming, including digital signal processing, is centered around the procedural programming paradigm, and pseudocode and examples are largely in terms of x[i], y[i-1] notation. And when they write for or do loops, they might often be for a million, ten million, or more elements. Speed in everything matters on those kinds of input orders of magnitude.

In light of these requirements, can this type of programming be done in a functional style? Should it?

I helped a customer fix a performance problem in some code very similar to this in the "current iteration depends on previous iteration's results" sense. Access to his output data structure was slowing things down and we solved the problem by using a local variable to remember the previous iteration's output value. But I started wondering what a good general iterating idiom would be for this type of problem: one that lets the dependent iteration be expressed elegantly, but one that also performs well.

A directly recursive version might look like this:

fun lowpass(x,dt,RC) =
let
val n = length x
val alpha = dt/(RC+dt)
fun lowpass' i yprev acc =
if i >= n then
acc
else
let val yi = alpha*sub(x,i)+(1-alpha)*yprev
in
lowpass' (i+1) yi (append(acc,yi)
end
in
lowpass' 0 (sub(x,0)) []
end
With a good native compiler this should perform well, the tail recursion optimized to an assembly jump and the lowpass' function parameters assigned to registers. But is this as intuitive? It certainly looks longer than the procedural version.

Of course this is also a natural for fold, with the previous output value included in the accumulator spot. But I wonder if it's less intuitive than the explicitly recursive version, and I wonder if the performance can possibly come close to the metal lanugages unless the compiler can inline fold's function argument and reconstruct the implicit loop.

I think a list-comprehension version would be more compact, and provide both readability and performance. The Mathematica version using Table is reasonable:

lowPass[x_, dt_?NumberQ, RC_] :=
Module[{yprev = First@x, alpha = dt/(RC + dt)},
Table[
With[{ycurrent = alpha*x[[i]] + (1 - alpha)*yprev},
yprev = ycurrent;
ycurrent],
{i, 1, Length[x]}
]
]
But then this is not purely functional: it uses a stateful local variable yprev. The fold option may be the best purely functional one, but it still seems like a hybrid has the best chance for high performance while not needing the for/do loop token overhead.

Monday, March 03, 2008

Quickie: Sometimes Simpler Is Harder

The Microsoft Knowledge Base article "How to Automate Folder Permissions" recommends creating three files: Addperm.cmd, Addperm2.cmd and Yes.txt.

As you know or can guess if you've used the cacls command, Yes.txt contains just the letter "y" and is piped or redirected to cacls:

cacls %1\%2 /T /G Administrators:F MUG2000\%2:C "MUG2000\Domain Admins":F <\yes.txt
What gave me a chuckle was the line "The third file is a little more difficult" in the directions, which I've abbreviated here:

Addperm.cmd

(17 lines of script code here)


Addperm2.cmd

(6 lines of script code here)

Yes.txt
The third file is a little more difficult.

Open a command prompt (Cmd.exe) and change directories to the root directory of the drive to which you have saved the other two files.

Type the following:
COPY CON YES.TXT
y

This creates a text file with the Y and ENTER needed to automate the CACLS command.


It's funny, but I can't argue with them too much. What's the alternative? A gray box with the single letter "y" in it?