Wednesday, November 28, 2007

Fifty Cent Lecture on the Unification-Based Type Inference Algorithm

In December 1978 the Journal of Computer and System Sciences published "A Theory of Type Polymorphism in Programming." In 1992 I was introduced to Standard ML in compilers class, and used it to write a typechecker for our Ada subset compiler. I was in love with types, typecheckers, and type inference. In 1993 my good friend and I tried unsuccessfully to implement type inference as part of a much larger project, without actually researching the prior art. We needed to learn the fine art of finding and reading research papers. A few years later I was back in grad school, and got the help I needed to locate the right papers and understand them enough to implement an inferencing type checker. The paper for that project is dated December 8, 1997. I'm very satisfied that ten years later, type inference has not gone by the wayside but is poised to go mainstream, going the way of automatic memory management (GC) via Java.

In commemoration then, let's explore a bit this fascinating feature that may soon be coming to a programming language near you.

Cocktail Party Explanation (Type Inference in 30 Seconds)

The way most bloggers are explaining type inference lately is with a Java/C# assignment example. Instead of typing
WaitingPageHandler handler = new WaitingPageHandler(args);
which has this offendingly redundant "X x = new X" form, you can get away with something like
var handler = new WaitingPageHandler(args);
How does the typechecker figure this out? This doesn't look too hard to infer. What about an even simpler form:
var delay = 250;
Obviously, the literal 250 is an int, and so is delay. The type of literals like 250, 3.14159, -1L or "duck" are well-defined by the language. An expression involving the new operator is that of new's argument. Here we see the easiest part of understanding type inference -- that there are some expression tree leaves that can be immediately inferred.

Of course you don't just write assignments. As you know from studying lambda calculus, there are three fundamental operations to care about: lambda introduction (corresponding to introducing literals), variable reference, and function application. So how does type inference handle variable reference and function application?

Checking Variable References with Known Type

Let's tackle variable reference first. What if we were given this:
var timeout = 500;
var timeoutTime = System.currentTimeMillis() + timeout;
In the second line, what type does timeout have? It's an int, but how does the typechecker know it? It was inferred from the first line, you're right. But technically, the typechecker probably doesn't know or care that it was inferred. After the typechecker inferred the type of the timeout in the first line, it updated the environment so that in subsequent processing timeout's actual type is known. The typechecker did a lookup in the environment for the type of timeout before doing anything else, as it would in an ordinary static typing typechecker.

So we come to the clue that as with ordinary static typechecking, an environment is needed. The environment is a stack of identifier to type bindings. Environments will come very much into play as we develop the algorithm.

We have looked at how variable reference works when the type is known. There is another possibility, that the type of the variable being referenced is unknown, but first let's look at function application when the function's type is known.

Checking Function Application with Known Type

What about this simple line:
var delay = 15 * 60;
Given two ints, * returns an int, making int the type of the function application expression *(15,60), and thus the type of the variable delay. More generally, when we see a function call and the type of the function is known, the type of the expression is that of the function's return type.

Inferring a Type

The interesting part of type inference comes when a variable is referenced but not immediately inferred. A good example of this situation is when the variable is the argument to a function:
function(x)
{
print(x);
if(x < 2)
1
else
x*factorial(x-1)
}
This puts us into a little deeper water. But stay with me, because this is it; the heart of basic type inference using unification. First, notice that the type of x is obviously int from the comparison in the if test, and also from the expression x-1 which is similar to the 15*60 example above. The fun comes in how we can process this line by line and arrive at the right conclusions. The print(x) line is a function call where x's type is as yet unknown. We assume that like in most languages print is overloaded for all types, or at least all the basic ones like int and String. By the time the typechecker is done, we do need to know x's type so our code generator or interpreter has enough information to call the right print code. How?

When we encounter a reference to a variable with unknown type, we tag its type with a unique type variable. Wuzzat? Hello? you say. Yes, we will need a separate environment that tracks type variables. For example, we will say that x's type is that of type variable t1, and in our type environment we say that t1 is bound to an unknown type -- the variable is kept in symbolic form, it's unbound. That's all we can and need to do for the print(x) line, so we move on.

When we encounter the next line, the if expression, we will type check its three arguments: the test expression, the "if true" expression, and the "if false" or "else" expression. So far we haven't mentioned unification, but here's where it really becomes necessary. What we actually do to typecheck the x < 2 expression is run the unification algorithm. The types of the arguments to the less-than operator (which you should always remember to properly escape in your blog editor if you don't want to mysteriously lose portions of your text) are unified with their expected types. For example, we know the less-than operator takes two numeric operands. We'll skip over the overload resolution part here and assume that the literal 2 tells us we're using the int version of the less-than operator. So we know that this operator is a function which takes two ints. We unify the type of the first argument's expression x with the type of the function's first argument, which we just said is int:
Unify(t1,int)
In this case the unification algorithm is getting a type variable and a concrete type, indicating we've discovered what type the variable should be bound to. So at this point we can update t1 in the type environment:
t1 = int
That's it! That's perhaps the key part of the unification-based type inference algorithm.

From this point on, whenever we lookup the type of x in the environment, we'll still be referred to the type variable t1, but when we ask the type environment to evaluate that type expression, we'll get a concrete type of int. We can construct our typechecker to have two passes, one that runs this algorithm and guarantees the type environment has nothing but known types afterwards, and then a subsequent pass that updates the environment to concrete types.

Also, note an assumption in the algorithm: once we had any concrete type for x we used it. If later in your program x was used in a context of a different concrete type, the typechecker would flag a type error. This is where you get errors like "found a real when expecting an int".

Equivalent But Still Unknown Types

Another key part of unification comes when a variable of unknown type is unified with another variable of unknown type. For example:
function(x,y)
{
z = if(choose()) x else y;
2^z
}
In this case the first line of the function tells us only that x, y and z are the same type. This is important information that must not be lost. Let's step through these expressions. In the first line, we check the if/else expression. We assume the type of choose() is known to be boolean-valued to match the expected type of an if test expression (and if the return type of choose wasn't known, it would be inferred by its use here). We then check the two branches of the if, and since they simple variable references we generate a unique type variable for each and bind the variable reference appropriately. In other words:

In the type environment,
t1 -> unknown
t2 -> unknown
In the user variable environment,
x -> t1
y -> t2
This is the state of things after checking both expressions. The last part of checking the if/else is to unify those two expressions, since we know they must match. In other words,

Unify(t1, t2)
is evaluated. We saw what happens when unifying a type variable and a concrete type; we assign the type variable to the type. But here both are variables. In this case, unification says we simply remember the equivalence of these unknown type variables. The way I handled this was to augment the type environment to keep a list of known equivalences for a type variable:

In the type environment,
t1 -> equivalences[t2]
t2 -> equivalences[t1]
If we ever find a concrete binding for t2 we can bind t1 as well as t2 at that time. And that should explain the rest of the example.

When we evaluate the assignment statement z = if, we enter a similar type variable entry for z and make it equivalent to both t1 and t2:

In the type environment,
t1 -> equivalences[t2,t3]
t2 -> equivalences[t1,t3]
t3 -> equivalences[t1,t2]
The way you handle this equivalence list is an implementation choice but I hope you get the idea.

When we finally hit the function's final expression 2^z we again have an overloaded operator but the presence of 2 helps us resolve it to the integer power operator, so we can unify the type of z with int, letting us simultaneously bind the type variables t1, t2 and t3 since we know their equivalence.

Where Polymorphic Variables Come From

Sometimes when checking a function the parameter may never be used in a context specific enough to bind its type. What happens then? For example:
function(x)
{
(x, [])
}
This takes an argument and returns a pair of the original argument x and an empty list, possibly a normalizing step for a function that will visit x and accumulate some results seeded with the empty list. The pair construction works for values of any type, so it adds no information that can be used to deduce a particular type for x. When the typechecker gets to the end of the function declaration, x is found to still be bound to its type variable. When this happens, x is "promoted" to be a polymorphic-typed variable. In order words, since x is not used in any particular way, it can be used for any type. A language designer could choose not to support polymorphic types for some reason, and could flag this as an error.

This example also introduces the notion that type inference operates within a certain boundary. There has to be a point in the abstract syntax tree where the typechecker knows this is where to promote still-symbolic types to be polymorphic. In Standard ML, I believe this boundary is the named function definition introduced by the fun keyword.

Conclusion

The complete algorithm is a little more involved than I've described here, but I hope this helps you understand the essential pieces and the general flow. For more details please see "Principal type-schemes for functional programs" by Luis Damas and Robin Milner (ACM 1982) or the original "A Theory of Type Polymorphism in Programming" by Robin Milner (Journal of Computer and System Sciences, Vol.17, No.3, December 1978, pp.348-375), particularly section 4 "A Well-Typing Algorithm and Its Correctness."

No comments: