Monday, April 24, 2006

Bug Story: When Is 9+1 not supposed to be 10

It was a simple mistake, but in an embedded system it was hard to debug and had mysterious effects. Our team had a limited region of RAM for each task to store its own copy of a fixed-sized chunk of data. (A "task" in our operating system, pSOS, is the same as a thread or process.) In a C header, we defined constants to use as the start of each tasks's fixed region of memory, something like this:


#define TASK_ALPHA_STATICS BASE+0x1 * sizeof(STATIC_DATA)
#define TASK_BRAVO_STATICS BASE+0x2 * sizeof(STATIC_DATA)

And so on. Things tooled along, we got the first features going, and quite a ways into the project we started seeing weird crashes and unexplained error conditions. We found that some of the tasks were writing past that limited region of RAM. But why? While it was limited, we had calculated it had room enough for twenty-some tasks, and we only had 18 or so in our application.

What went wrong? Well, imagine you are an engineer who has just finished some feature development, and for the next feature in the application needs to code up a new task (thread), let's say to process a new stream of input from a real-time source. One of the various things you need to do is edit that header we talked about, and add your task, let's say it's the Golf task:

#define TASK_ALPHA_STATICS BASE+0x0 * sizeof(STATIC_DATA)
#define TASK_BRAVO_STATICS BASE+0x1 * sizeof(STATIC_DATA)
#define TASK_CHARLIE_STATICS BASE+0x2 * sizeof(STATIC_DATA)
#define TASK_DELTA_STATICS BASE+0x3 * sizeof(STATIC_DATA)
#define TASK_ECHO_STATICS BASE+0x4 * sizeof(STATIC_DATA)
#define TASK_FOXTROT_STATICS BASE+0x5 * sizeof(STATIC_DATA)
#define TASK_GOLF_STATICS BASE+0x6 * sizeof(STATIC_DATA)

Simple: it's a cut-and-paste job, you just change the name and the number of your task, so your pointer points to the next section in this self-constructed array.
Let's look at the next few entries in this progression:

#define TASK_HOTEL_STATICS BASE+0x7 * sizeof(STATIC_DATA)
#define TASK_INDIA_STATICS BASE+0x8 * sizeof(STATIC_DATA)
#define TASK_JULIET_STATICS BASE+0x9 * sizeof(STATIC_DATA)
#define TASK_KILO_STATICS BASE+0x10 * sizeof(STATIC_DATA)

and so on up to eighteen, which in this telling is the last task:

#define TASK_SIERRA_STATICS BASE+0x18 * sizeof(STATIC_DATA)

The keen-eyed C programmers among you will have seen the error. What we have here is the Sierra task addressing data that is actually 24 entries past the BASE pointer, not 18, because every one of those multipliers that a programmer had to edit has the "0x" hex prefix, that is the constant is in base 16, not base 10. So really those latter tasks were actually writing past their permitted region of memory, as if we really had 24 tasks in the system. It was a false lack of resources. We could have but didn't test for this at runtime, because we figured we had already accounted for it when we counted tasks and added up the memory.

It is interesting to see where the error crept it. The TASK_KILO_STATICS entry is most likely where things went awry, continuing the subtly wrong pattern of "8, 9, 10" when it should have been "8,9,A", or, "0x8, 0x9, 0xA".

How could this error have been prevented? Is writing our constants in hex not the best choice? What other code structures could have prevented this? The only reasonable alternative I can think of is the "previous+1" approach:

#define TASK_ALPHA_STATICS BASE+0x0 * sizeof(STATIC_DATA)
#define TASK_BRAVO_STATICS TASK_ALPHA_STATICS+sizeof(STATIC_DATA)

This eliminates the "9+1" problem, but did we introduce some other possible problem, like forgetting to link an entry to a previous entry, especially when rearranging the order of these entries?

If anyone can think of other ways to have avoided or tested for this problem (without resorting to another programming language), I especially invite your comments.

If you're interested in reading excerpts of other bug stories and a discussion of causal analyses, check out "My hairiest bug war stories" by Marc Eisenstadt. I like the way he refers to the "detective work" programmers do in debugging.

No comments: