Today I wanted to tell you about a very simple but interesting function, parallel_for, included in the Intel Threading Building Blocks (TBB) C++ library (available under either a commercial license for a fee, or as a GPL open source free library). It does the work of partitioning an embarrassingly parallel loop onto different threads, and all you need to do is write a couple of small adapter classes.
The sequential loop might be this:
for(i=0; i < vec.size(); ++i)The parallel version requires that we abstract the loop into a Range and a Body, both classes, and let TBB partition the vector and assign chunks of the loop to different threads. You don't need to create the threads. The call site above would change to something like
out[i] = vec[i]->serialize();
parallel_for(Range(vec), Body(vec));The algorithm uses methods you define in Range to partition the entire range into chunks of the vector that you (or the library code) decide is just small enough that there's no more parallelism to exploit but not so small that scheduling overheads eat into the benefit. A Range needs to implement methods
bool empty() constThe empty predicate is used to know if a range is empty and thus can be abandoned. The is_divisible method implements your grain size policy, e.g. is_divisible returns true unless there are less than 1000 elements.The copy constructor actually performs a split. This is a classic C++ idiom, the constructor creates the new object representing one part of the range, and the original object being destructively updated to represent the other part of the range. I imagine these Range objects as being quite lightweight, containing perhaps begin and end pointers or array indices into the original collection.
bool is_divisible() const
Range(Range& original, tbb:split)
Once TBB has gotten a range of the desired grain size, it gets put onto a thread queue for execution by invoking the Body's function operator:
void operator()(Range&) constInvoking the function operator on an object looks just like the object is a function; a Body object b's function operator is invoked b(r).
The trivial extra work in wrapping your collection in these Range and Body classes is rewarded by the fact that you don't need to write a line of thread-aware code to get your loop partitioned, and that you can take advantage of the thought someone else put into how best to do the partitioning and queuing.
TBB has taken some insights from Cilk for scheduling, and while it splits your range depth-first (splits a non-empty range and recurses on the left range until reaching a non-divisible range), but "steals" bread-first. The insight is that a thread should be scheduled with parent/child and similar neighboring ranges to take advantage of cache locality between the ranges in the same CPU core. Stealing refers to the way that a thread's queue is replenished when it is emptied, in that an entry from a randomly selected alternate thread's queue is removed.