Wednesday, March 19, 2008

Symbolic Disassembly

In looking at OCaml-generated x86 assembly last week, I noticed that having a language which keeps undefined identifiers in symbolic form can be handy. I started making notes in a Mathematica notebook to keep track of some heap shuffling code I was trying to understand. To explain the technique, let's use this OCaml snippet:


((1.0 -. alpha) *. (Array.get y i))
where the floating-point arithmetic becomes this x86 code:

fld1
fsub REAL8 PTR [eax]
fmul REAL8 PTR [esi]
In Mathematica, I translated lines of assembly into real assignments, using the literal register names for my variable names, except at the very beginning of the assembly fragement. For example, when starting the above fragment, eax points to the alpha value, so I would write the read reference to eax as alpha, without defining alpha. Similarly esi points to Array.get y i, so I was able to write this sequence:


It was interesting, a little startling actually, to see the original expression re-emerge by simulating running the code. I didn't get very far into it, but you can also do things like simulate writes to memory with a register transfer definition:

Mem[eax] = 2301
In a traditional language, eax has to be bound to a value. But in a symbolic language, eax can be a symbolic expression like _caml_young_ptr + 4.

Do my partial-evaluation readers have any remarks about this?

4 comments:

Barry Kelly said...

Putting jumps etc. aside and staying within basic blocks, pretty much all low-level representations of code are just serialized expression graphs, as you've seen. It's similar and in some ways even simpler on machines like the JVM and CLR: the IR is just a post-order serializing traversal of the expression tree.

lorg said...

Arkon and I are writing a disassembler which will use similar ideas.
Check out http://www.ragestorm.net/distorm/ for the previous version of the disassembler, and my blog for posts about the subject.

We didn't yet write a lot about it, but it's coming.

jfklein said...

Barry: The surprising thing here for me is not to see the mapping from high-level to low-level as just that, but to see (1) the reverse mapping pops out by keeping some parts symbolic, and (2) a somewhat novel use for a symbolic language like Mathematica, as opposed to doing it all by hand. Sure (1) is not surprising if you think about it for a second, but for something with practical implications I am a bit surprised I hadn't heard of it before. Anyone can do it by hand, but since this is tedious and error prone, (2) is relevant.

Mike Kucera said...

I'm not familiar with Mathematic but I know this can be done with Maple as well. Just have to watch out for the built-in automatic simplification, the result may not be an accurate representation of what is actually being computed. For example if you input (a + a) Maple will respond with 2a. The symbols are supposed to represent machine numbers but Maple will assume they are real numbers which may lead to incorrect results.

And why are you asking for your partial-evaluation readers to remark, are you trying to get at something specific?