Thread synchronization techniques for better multicore system power/performance tradeoffs

It’s a common refrain heard among embedded software design teams everywhere, when the team manager declares, “We need better system power management from both the hardware and software, but we also need to optimize the design for increased functionality and performance.”

So, how can a software designer hope to accomplish such a feat when in order to maximize one side of the equation the other side has to be minimized? You just can’t have the best of both worlds.

Interestingly, within embedded multicore system design, unique situations present themselves in which power and performance can complement one another – rather than being in a state of perpetual competition. This article considers one such scenario where a symmetrical multiprocessing (SMP) RTOS, multicore hardware, and power management features combine to facilitate parallel embedded programming.

While power management features of a full-featured RTOS contribute significantly to power efficient design, this article takes an in-depth look at a thread synchronization mechanism based on hardware supplied primitives, which guarantees power savings while improving system performance.

Multicore to the rescue

The embedded world is quickly embracing multicore technology. Almost all major silicon vendors offer multicore chips such as ARM (Cortex A9), Freescale Semiconductor’s PowerPC (QorIQ), and MIPS (1004K MT).

Contrary to what you might think, increased performance is not the only factor behind this trend. Embedded systems cannot afford high performance at the cost of elevating the power budget or compromising legacy software. Embedded silicon providers, therefore, are focusing on a small number of cores (usually no more than eight). Already systems in mobile and networking with handheld devices and high speed switches/routers, respectively, fit this multicore paradigm quite well.

An embedded runtime software solution including the OS, middleware, and the end applications targeting multicore platforms has to take performance, power requirements, and existing software design into consideration. Although parallel computing is a fairly advanced field with many programming, architectural, and compiler technologies available – parallel embedded development poses new challenges because of the variety of hardware and tools involved that underline the need for flexibility.

Thread-level parallelism

With no widely accepted multicore programming standard available, most embedded application developers rely on the thread-level parallelism (TLP) supplied by the RTOS. In theory, this approach promises no change in the legacy code if the RTOS support for the multicore architecture is robust.

Unfortunately overhead is an unavoidable byproduct of TLP, existing in part because of the creation and termination of worker threads as well as the need of synchronization between individual contexts. Typical embedded systems have a limited number of threads and usually a software context is activated as a result of an external event such as an interrupt. This thread then has to complete its execution in stipulated time.

Based on these properties of embedded runtime applications, this article presents a scheme to reduce TLP overhead by using essential thread attributes of preemption and priority in a RTOS. The basic idea is to prioritize parent and child threads at different levels so synchronization time is not required. In situations where synchronization is unavoidable, the proposed scheme utilizes hardware-supplied, power-saving primitives.

Certainly similar problems have been tackled in terms of compiler dependent technologies which include speculative scheduling and OpenMP-based solutions. Another approach is to devise load balancing strategies and new real-time scheduling algorithms to get maximum performance/utilization from multiple resources.

But these schemes often require significant changes in either application logic or the operating system code, which is a significant drawback for a practical system in the sense that a lot of the legacy/certified/robust code is either jeopardized or rendered unusable. Instead of finding an optimal scheduling technique or counting on a fancy compiler technology, a portable scheme that will not affect application code at all and only involves changes in wrapper function of RTOS interface is proposed.