In this product how-to article, Vector Fabrics’ Paul Stavers describes a more efficient way to parallelize code for embedded multicore designs illustrating the process using the company’s online tool to parallelize Google’s VP8 video decoder.
Many silicon vendors rely on multicore architectures to improve performance. However, some engineers might say that those vendors have not been able to deliver compilation tools that have kept pace with the architecture improvements. The tools that are available require both a good understanding of the application and a deep understanding of the target platform.
However, an alternative approach is possible. This article will highlight a way to parallelize complex applications in a short time span, without the need to understand neither the application nor the target platform.
This can be achieved with interactive mapping and partitioning design flow. The flow visualizes the application behavior and allows the user to interactively explore feasible multithreaded versions. The program semantics are guaranteed to be preserved under the proposed transformations.
In many cases the design of an embedded system starts with a software collection that has not yet been partitioned to match the multicore structure of the target ***a*** hardware.
As a result, the software does not meet its pe ***a***rformance requirements and hardware resources are left idle. To resolve this, an expert (or a team of experts) comes in to change the software so that it fits the target multicore structure.
Current multicore design flow practice
Figure 1 Traditional multicore design practice involves an
iterative process of analysis, partitioning, loop parallelization,
incorporation of semaphores, analysis, testing and retuning of code.
A typical approach, illustrated in Figure 1 above, would include the following steps:
Analyze the application. Find the bottlenecks.
Partition the software over the available cores. This requires a good understanding of data access patterns and data communication inside the application to match this with the cache architecture and available bandwidth of buses and channels on the target platform. Optimize some of the software kernels for the target instruction set (e.g. Intel SSE, ARM Neon)
Identify the loops that can be parallelized. This requires a good understanding of the application: find the data dependencies, find the anti- and output dependencies, and find the shared variables. The dependencies can be hidden very deeply, and to find them often requires complex pointer analysis.
Predict the speedup. Predict the overhead of synchronizing, the cost of creating and joining threads. Predict the impact of additional cache overhead introduced by distributing workload over multiple CPUs. If parallelizing a loop still seems worth it, go to the next step.
Change the software to introduce semaphores, FIFOs and other communication and synchronization means. Add thread calls to create ***a***and join threads. This requires a good understanding of the API’s available on the target platform. In this stage subtle bugs are often introduced, related to data races, deadlock or livelock that may only manifest themselves much later, e.g. after the product has been shipped to the customer.
Test. Does it seem to function correctly? Measure. Does the system achieve the required performance level? If not: observe and probe the system. Tooling exists to observe the system; The experts need to interpret these low-level observations in the context of their expert system knowledge, then draw conclusions.
Try again to improve performance or handle data races and deadlocks. This involves repeating the above from Step 1.
Close analysis of Figure 1 clearly shows there are many problems with this design flow. Experts that can successfully complete this flow are a rare-breed. Even if you can find them, at the start of a project it is hard to predict how many improvement and bug fix iterations the experts need to go through until the system stabilizes.
Multicore platforms are quickly becoming a very attractive option in terms of their cost-performance ratio. But they also become more complex every year, making it harder for developers to exploit their benefits. Therefore we need a new design flow that enables any software developer to program multicore platforms. This flow is depicted in Figure 2 below.
Figure 2 Multicore design flow that enables any software developer
In this alternative flow, a tool analyzes the program before it is partitioned. It finds the loops that can be parallelized and devises a synchronization strategy for these loops. The tool also has detailed knowledge of the target platform and it can estimate the cost of different partitioning and synchronization strategies.
In the remainder of this article we will show this design flow at work in a demonstration of it use in parallelizing Google’s VP8 video decoder by using a cloud-based tool from Vector Fabrics. We show that tools can answer the important questions that normally require experts to handle. Namely, which loops dominate the workload and which of these can be parallelized, how much speedup will I get, and how do I change my program to implement my parallelization choices?
After compiling and analyzing the VP8 source code, its behaviour is visualized as shown in Figure 3 below.
Figure 3 The dominant filter loop with all its loop-carried data dependencies
Here we see a 3D view on the function call and loop stacking hierarchy of the VP8 decoder program. The main() function is all the way in the back, with callers stacked up in front. The width of each function and loop invocation corresponds directly to its relative contribution to the program execution time. Therefore, wide boxes represent hot spots in the program, and they are displayed most prominently on the screen.
In VP8’s case the dark blue loop shown in Figure 3 dominates the execution time of the VP8 decoder. We therefore instruct the tool to parallelize this filter loop. It responds with a parallelization proposal for a range of CPU workloads. All the developer now has to do is choose how many CPUs he wants to dedicate to the filter job.
To provide the developer with insight in the pr ***a***oposed parallelization of the filter loop, the tool displays the loop carried dependencies as horizontal bars.
In the example of Figure 3 some of these dependencies really are induction expressions. Such expressions can be removed from the loop because their value only depends on the iteration sequence number and not on the actual computations going on inside the loop. Only the red-coloured data dependencies present a problem requiring deeper analysis.
The filter loop actually is a stack of loops. The outer loop iterates over macroblock rows, the next nesting level iterates over macroblock columns, and deeper loops iterate over individual pixels, as follows:
for (row = 0; row < rows; row += 1)
for (col = 0; col < cols; col += 1)
When the program is analyzed, the tool not only observes the presence of data dependencies between successive calls to filter_macroblock() but it also computes the corresponding dependency distance vectors. These distance vectors are an important clue for parallelization, because they tell the parallelization engine exactly how many iterations of each program loop are executed before a data value is consumed.
Even if parallelization is possible, it does not necessarily mean it will make the program run faster. Often the overhead of introducing threads and synchronizing them introduces more delay than can be offset by the concurrency gain. In the case of VP8’s 2-dimensional filter loop transformation (Figure 4 below), tools can compute a schedule that includes the overhead of synchronization.
Figure 4 Schedule of the 2-dimensional loop parallelization (using 4 CPUs)
Figure 4 shows what the schedule looks like with 4 CPUs. In this picture time flows from left to right. The ***a*** induction expressions do not appear in the schedule, but the red-coloured filter dependency with distance (1, -1) does. To resolve this dependency, rows start executing with a delay of 2 columns with respect to the immediately preceding row. At the beginning and at the end of the schedule the CPUs are only partially active, resulting in a speedup of 3.5×.
Figure 5 On a multicore x86 with two to eight cores, measured speedup is within 10 percent of the predicted estimate