Using Java to deal with multicore programming complexity: Part 1

Summary: In this first part in a three part series on the use of the Java parallel programming language as the main method of doing multicore software development or or as an adjunct to traditional sequential C and C++ methodologies, Atego’s Kelvin Nilsen surveys some of the special issues that must be addressed when writing software for multiprocessor hardware and how Java can ease the transition.

Editor’s Note: Developers of embedded software are being forced to figure out ways to exploit the complexity of multicore platforms with languages and library software not designed for multiprocessor environments. In this first part in a series of four articles,Kelvin Nilsen surveys some of the issues that must be addressed when writing software for multiprocessors . In Part 2 he explains how Java, originally conceived with parallel programming in mind, can be combined with existing C and C++ software to preserve legacy software during the transition to multiprocessor execution. Part 3 deals with issues relating to combining C/C++ and Java for real-time multicore performance.

As semiconductor manufacturers continue to shrink silicon circuit sizes, computer engineers are able to pack more sophisticated logic and larger caches onto each chip. Over the years, these transistors have been used to increase typical data path size from 4 bits to 64 bits, add math coprocessors and specialized math instructions, implement multiple instruction dispatch, dynamic instruction scheduling, deep pipelines, superscalar operation, speculative execution, and branch prediction. They have also been used to expand the sizes of memory caches. Most of these hardware optimizations have served to automatically identify and exploit opportunities for parallel execution that is inherent in a single application’s instruction stream.

As more transistors are placed on a single chip, the connecting wires between adjacent transistors also shrink. This means less time is required to propagate a signal from one transistor to the next, enabling processors to run at higher clock frequencies, meaning less time is required to execute each instruction.

Prior to the turn of the century, nearly all high-volume microprocessors were single core solutions. Starting in about 1995, single core microprocessors began to show a scalability weakness. These processors were falling below the Moore’s law prediction (Figure 1) for ever-expanding transistor densities, sometimes significantly below it. This was due in large part to questions of balance. It is relatively easy, for example, to use the available transistor budget to expand on-chip memory cache. But expanding the on-chip cache is only worthwhile if the performance gains will be proportional to the increased chip costs, and chip designers have determined through simulation studies that further expansion is not cost effective without a balanced increase in other components.

Click on image to enlarge.

Figure 1. Microprocessor complexity as represented by transistor count. Moore’s law prediction is shown as black line, single core as blue., multi-core as red.

The typical balanced increase in other components would normally include expanding the word size and ratcheting up the clock frequency. Larger caches, wider word sizes, and faster clock rates all increase power consumption and heat dissipation requirements. Power and heat dissipation are limiting further growth along the traditional scalability paths. Faster CPU clock rates also increase the speed differential between on-chip activities and off-chip transactions with memory and I/O devices, which tends to decrease overall efficiency.

With uniprocessor platforms, each doubling of performance has required a doubling of power consumption. Switching to multicore architectures changes this dynamic. The dual-core AMD Opteron 275, for example, runs 1.8 times faster on certain benchmarks than its single-core equivalent, the Opteron 248, but dual core processor uses only 7% more power. Thus, the performance-per-watt rating of the dual-core architecture is roughly 68% better than the single-core architecture.

The motivations for the move towards multicore architectures are easy to understand and industry trends cannot be denied. For the foreseeable future, new processor designs will almost always feature multicore capabilities. And single-core architectures are being phased into obsolescence. Like it or not, software engineers are going to have to befriend multiprocessing platforms.

Multicore hardware demands on software
Developing software for deployment on multiprocessor platforms is very different than traditional development for uniprocessors. Kunle Okukotun, Director of the Pervasive Parallelism Lab at Stanford University, recently explained “Anything performance-critical will have to be rewritten. Eventually, you either have to rewrite that code or it will become obsolete” [7]. Alex Bachmutsky, a system architect and author of “System Design for Telecommunications Gateways”, adds “This is one of the biggest problems telecom companies face today. Their apps were not written for multiple cores and threads, and the apps are huge; they have millions and tens of millions of lines of code” [7].

The challenges are several fold. Among them, software engineers targeting a multiprocessor must:

  1. Assure that the application is represented as a collection of independent threads or tasks, with minimal information sharing and coordination requirements between these tasks;
  2. Design and implement efficient and reliable task coordination mechanisms;
  3. Address load distribution and load balancing issues;
  4. Analyze inter-task contention and schedulability in order to demonstrate compliance with real-time constraints; and
  5. Structure their software solutions so that they are easily maintained and ported between distinct multiprocessor implementations.

One of the challenges of porting software to a multiprocessor platform is that the maximum speedup available through running an application in parallel on multiple processors is limited by Amdahl’s law [9, 10]. That is, for any given application, certain parts of the application are inherently sequential and cannot run in parallel on multiple processors. Let Q represent the percentage of the application that is inherently sequential. Let N represent the number of processors comprising the targeted multiprocessor. Amdahl’s law states that the maximum speedup S available on this multiprocessor is limited by the following mathematical expression:
Suppose, for example, that Q = 5% and N = 4. Applying Amdahl’s law, we find that the maximum speedup S is 3.5. Increasing N to 8 changes the limit to 5.9. Though Amdahl’s law may limit overall speedup, it does not necessarily limit processing capacity. Consider, for example, a software-implemented telephone switch that involves a significant amount of sequential work to set up each call, and then a comparable sequential amount of work to tear down the call. Running this telephone switch software on a multiprocessor will not decrease the time required to set up or tear down each call, but it will very likely increase the number of calls that can be initiated and terminated every second.

In the quest to achieve the theoretical speedup limits of Amdahl’s law, most legacy software must be restructured to expose the potential for parallel execution paths. Tools that automatically detect parallelism inherent in numeric matrix algorithms provide limited success with automatic generation of parallel code for those specialized algorithms [8]. However, the generality required by typical information processing applications, with complex linked data structures and data structure traversals, makes it very difficult to do automatic parallelization of this code.

Thus, most experts believe that manual translation of sequential algorithms is necessary in order to effectively exploit multiprocessor parallelization of software systems. When manually translating sequential programs into parallel execution paths, software engineers need to target a programming language that allows them to express parallel constructs in concise notations that are robust, easily maintained, and scalable across a broad spectrum of multiprocessor hardware configurations. Since the Java language was invented after the importance of parallel execution on multiprocessors was well understood, it has features designed to facilitate multi-threaded programming.

“The ubiquitous C language is the worst [tool] because it is inherently sequential and obscures any parallelism inherent in an algorithm,” said Jeff Bier, President of DSP consulting firm Berkeley Design Technology [7]. Both C and C++ suffer because the design of these languages never considered the special needs of multiprocessing. Various real-time operating system (RTOS) vendors have found ways to extend C into the domains of concurrent and fully parallel execution. However, this approach is hindered because the extensions are bolted on to the language in the form of libraries, without any standards-based support from the language syntax or the compilers.

Over the years, C and C++ language compilers and the language definition itself have evolved to extract high performance from modern uniprocessors. Thus, the language allows caching values in registers and speculative and out-of-order execution. One important consideration, from the perspective of writing code for execution on multiprocessors, is there is no clean way to disable these optimizations in selected contexts that might involve coordination with other processors. Many of the mechanisms that might be used to implement reliable information sharing between processors using particular C compilers with particular hardware and particular RTOS are highly non-portable. Great difficulty is encountered when developers attempt to move the code to a new RTOS, CPU, or C compiler.