Multicore programming: it’s all in the messaging

As an embedded developer who may have an application running on a single processor and would like to improve your performance or performance per watt by moving to multicore, you’d likely want to know and answer to this question: Is there a magic bullet? In other words, can you just move your application on to a multicore platform and it will automatically run faster?


Looking for the magic bullet. If you’re using a personal computer or server with multiple applications, it is possible to automatically get performance improvements because the different applications are independent.


Yes, they share the file system on the hard disk and a few other things managed by the OS, but they don’t share data and need to synchronize with each other to perform their services. As long as they are independent and don’t interfere with each other, you likely get performance improvements. However, with an increasing number of cores, you see diminishing returns in this type of system.


In contrast, if you are working with an embedded system, your application would have to automatically be distributed across multiple cores to achieve performance improvement. All applications have some code that is inherently sequential and they generally have areas that can be run concurrently where synchronization is required.


Different parts of an application may also share global variables and may use pointers to reference those global variables. When distributing an application across multiple cores, you gain true concurrency.


This means that global variables that could safely be accessed in a single processor situation, have to be safeguarded to avoid the multiple cores accessing the variables at the same time and corrupting the data.


You may now be wondering how you can combine speed, planning, continuous assessment and adjustment in a multicore approach to achieve desired results in the shortest period of time.


You must first start by selecting an appropriate programming model for a current project that potentially offers long-term cost savings in future projects if done right.


Four common models that are familiar to embedded system programmers are OpenMP (Open Multiprocessing), OpenCL (Open Computing Language), as well as two message passing-based protocols: MPI (Message Passing Interface) and MCAPI (Multicore Communications API).


OpenMP is commonly used for loop processing across multiple cores in SMP environments. OpenMP is a language extension using preprocessor directives and is used in systems with general purpose SMP Operating Systems such as Linux, as well as in systems with shared memory and homogeneous cores.


While embedded systems may include general purpose computing elements, they often have some requirements better served with asymmetrical approaches, which go beyond the focus area of OpenMP. The OpenMP Architecture Review Board (ARB), a technology consortium, manages OpenMP.

OpenCL is primarily used for GPU programming. It is a framework for writing programs that span CPUs, GPUs (graphics processing unit) and potentially other processors. OpenCL provides a path to execute applications that go beyond graphics processing, for example vector and math processing, covering some areas of embedded computing. The OpenCL open standard is maintained by the Khronos Group.


Message passing: A familiar route to multicore

Message passing, an even more commonly used model, is used in networking, high-performance computing, and often included in operating system functional areas. In other words, it’s a ubiquitous model and familiar to a lot of programmers, as it is used in both SMP and AMP environments. MPI is native to high-performance computing whereas MCAPI is primarily focused on embedded computing.


Both use explicit communication functions such as send and receive. MPI provides capabilities for widely distributed computing, in dynamic networks, typically with GP OSes supporting the process model, whereas MCAPI is more focused on closely distributed (embedded) computing, in static topologies where there may be a GP OS, an RTOS, or no OS.


MPI and MCAPI have similarities as well as complementary differences and can therefore beneficially be combined in a distributed cluster of many-core computers.


What is message passing? By whatever name, message passing uses an explicit communication such as send and receive. Message passing can be connectionless or connected, and the functions can be blocking or non-blocking.


The message itself consists of a header, i.e. metadata such as source and destination address, message and/or payload size, priority, etc., and sometimes payload data. This is analogous with a letter where the address and return address represent metadata and the content is the payload. And just like a letter in the mail, the message is processed to determine the appropriate route, the priority and latency requirements. A network packet offers another example of a message.


If the sender and the receiver share memory, data sharing (i.e., the metadata with its reference or pointer) is exchanged between the sender and the receiver. The receiver can reference the payload data without actually moving the data, and the advantage of using referencing to share data is that less data has to be moved, which reduces transaction time. However, when the sender and receiver cannot see the same memory, the data must be moved.


Message passing is generally platform agnostic. In networking where the sender and the receiver may be using different kinds of computers and operating systems and where the packet may go through many computers along the way, you use the sockets API, which looks the same in all of the systems involved in the communication process. Using a similar paradigm, message passing can be applied to a number of different multicore platforms.


Message passing also is scalable, as you see with the networking and MPI used in high-performance computing, which uses an incredible amount of cores. Because it is transport agnostic, potential bus bandwidth issues, which can occur in SMP environments with many cores, can be avoided.


Something to pay attention to with message passing is the transaction cost. The transaction cost comes from processing the message and transferring the metadata and the data payload. Any time you can keep the data payload from moving, you save transaction time. Message passing is a time-tested method and expertise is widely available.