Attached is a parallel merge sort that is substantially faster than TBB's parallel_sort - it uses parallel_reduce. The speed up comes from using Intel's IPP to do the sort. ParallelMerge (provided with TBB) merges the sorted data from each thread using parallel_reduce.
The sort works all the time with range size = height / (# of processors) and this is probably the most efficient setting for the tile size. However, in testing the smaller range sizes, the stack is corrupted in ParallelMerge. Parallel Studio reports that there is a race condition at the location of the crash. (Parallel Studio has been patched with Update 2 - the latest.)
To reproduce the problem, set TileSize = 1 in the attached code. It is failing on my 8-way Core i7 so TileSize = Height / 8 will work fine. This problem will not fail if TBB is initialized with only 1 thread (single-threaded mode.) The problem may also be circumvented by setting ParallelMerge's Is_Divisible method to always return false (but this leaves a lot of the speed up on the table.)
If anyone could help resolve this problem, itwill probablybenefit TBB or ParallelMerge. I don't believe the problem is in my code but I am willing to be test any suggestions.