Down & dirty with HW/SW co-design: Part 3 – Cosynthesis of multiprocessors

Following on Part 1 and Part 2 in this series, we now will reconsider hardware/software co-design for more general multiprocessor architectures.

Many useful systems can be designed by hardware/software partitioning algorithms based on the CPU+ accelerator template. Hardware/software partitioning can also be used to design PEs that are part of larger multiprocessors.

But if we want to design a complete application-specific multiprocessor system, we need to use more general co-synthesis algorithms that do not rely on the CPU+ accelerator template.

In the most general case, all these tasks are related. Different partitions of functionality into processes clearly changes scheduling and allocation. Even if we chose a partitioning of functions, scheduling, allocating, and binding are closely related.

We want to choose the processing element for a process – both the general allocation and the binding to a specific type – based on the overall system schedule and when that process has to finish.

But we can’t determine the schedule and the completion time of a process until we at least choose an allocation and most likely a binding. This is the Gordian knot that co-synthesis designers must face-the set of intertwined problems that must somehow be unraveled.

Kalavade and Lee [Kal97] developed the Global Criticality/Local Phase (GCLP) algorithm to synthesize complex embedded system design. Their algorithm performs the hardware/software partitioning-style task of determining whether a task should be implemented in hardware or software and the schedule of the tasks in the system.

It also allows for several different hardware and/or software implementations of a given task and selects the best implementation for the system. The application is specified as a directed acyclic graph. The target architecture includes a CPU and a custom data path.

The hardware is limited to a maximum size and the software is limited to a maximum memory size. The various hardware and software implementations of the task are precharacterized.

Each node in the task graph has a hardware implementation curve and a software implementation curve that describe hardware and software implementation costs, respectively, for various bins that represent different implementations.

Their GCLP algorithm adaptively changes the objective at each step to try to optimize the design. It looks at two factors: the global criticality and the local phase. The global criticality is a glo ***a***bal scheduling measure.

It computes the fraction of as-yet unmapped nodes that would have to be moved from software to hardware implementations in order to meet the schedule’s deadline. The global criticality is averaged over all the unmapped nodes in each step.

The local phase computation improves hardware/software mapping. Heuristics are used to determine whether a node is best implemented in hardware or software; bit manipulations have a bias toward hardware implementation and extensive memory operations have a bias toward software.

These heuristics are termed repelling forces-for example, bit manipulations repel a node away from software implementation. The algorithm uses several repeller properties, each with its own repeller value. The sum of all the repeller properties for a node gives its repeller measure.

The algorithm computes extremity measures to help optimize mappings. A software extremity node has a long runtime but has a small hardware area cost. Similarly, a hardware extremity node has a large area cost but a small software execution time.

The co-synthesis algorithm iteratively maps nodes in the task graph. After selecting a node to map, it classifies a node as a repeller, an extremity, or a normal node.

SpecSyn [Gaj98] is designed to support a specify-explore-refine methodology. As shown in Figure 7-13 below, SpecSyn transforms functional specifications into an intermediate form known as SLIF. A variety of tools can then work on the design.

A refiner generates software and hardware descriptions. SpecSyn represents designs using a program-state machine model, which uses a Statechart-like description that allows complex sequential programs at the leaf states. The SLIF representation is annotated with attributes such as area, profiling information, number of bits per transfer, and so on.

Down & dirty with HW/SW co-design: Part 3 - Cosynthesis of multiprocessors

Figure 7-13 The SpecSyn system.

The allocation phase can allocate standard or custom processors, memories, or busses. Partitioning assigns operations in the functional specification to allocated hardware units.

During partitioning, performance is estimated using parameters such as bus frequency, bit width of transformations, and profiling values. Hardware size is estimated using estimates for the processing elements, busses, and so on, while software size is estimated by compiling code into generic three-operand instructions, then estimating code size for a specific processor by multiplying generic code size by a processor-code size coefficient.

A refined design adds detail but is simulatable and synthesizable, so its components can be used for further verification and synthesis. A variety of refinements can be performed, including: