Using Java to deal with multicore programming complexity: Part 3 – Using Java with C and C++ for real-time multicore designs

Editor’s Note: In this final part in a three-part series, Atego’s Kelvin Nilsen provides guidelines on how to use the combination of the Java parallel programming language and traditional sequential C and C++ methodologies to achieve real-time multicore software operation.

When multiple processors work together to accomplish a particular work assignment, the goal to minimize coordination activities may conflict with the need to assure that each processor has plenty of worthwhile work to perform. Coordination is often required to check the status of each processor’s ongoing activities, as the work completed by one processor may affect the work assignments given to others.

Individual processors are most efficient if they can focus their efforts on a small set of tasks that remain resident in that processor’s local memory and cache. However, aligning the efforts of each processor with globally important objectives may require frequent redirection of each individual processor’s efforts. Trading off between productivity of individual processors and satisfaction of global objectives is an intricate balancing act. This section discusses several alternative approaches to achieving this, and ends with a discussion of how to use Java in combination with the C and C++ programming languages.

Throughout this series, we have assumed real-time operation is an important consideration. Thus, there is an expectation that priorities are honored by the Java virtual machine (JVM) and underlying operating system. Note that non-real-time virtual machines do not necessarily honor thread priorities and will sometimes schedule a lower priority thread to run even though higher priority threads are ready. Real-time virtual machines are generally configured to schedule in full accordance with thread priorities.

One JVM for each processor
With a four-way multiprocessor, you can start up four independent Java virtual machines (JVMs), each one bound to a different processor. One advantage of this approach is that each JVM sees itself as running on a uniprocessor, and can thus use simpler synchronization protocols. Another advantage is that there is strong isolation of behavior between the code running on the independent processors. Each instance of the JVM has independent memory and CPU budgets, with very little potential for interference from one JVM to the next.

The key disadvantage of this approach is that it is much more difficult to share information between threads running in distinct JVMs. The typical communication mechanism would use Java Remote Method Invocation (RMI), orders of magnitude more costly than synchronized access to shared objects residing in the same process space. This high cost for information sharing makes it especially difficult to balance load, because this would typically require copying of large amounts of data, and possibly code as well, from one JVM to another.

Single JVM with global scheduling of Java threads
An alternative configuration runs a multiprocessor-aware JVM across all – or some subset of – available processors. Individual threads are free to migrate automatically between processors. The JVM uses a global scheduling technique, assuring at all times that the N highest priority threads in the Java application are running on the N available processors.

This approach makes information sharing between threads (and processors) very easy because all objects reside in the same global address space. Load balancing is automatically performed by the global scheduler. The disadvantages of this approach are that (1) the isolation barriers around independent components are less rigid, making the application vulnerable to CPU or memory starvation from misbehaving components; (2) frequent global synchronizations are required to maintain the data structures required to support global scheduling decisions; and (3) scheduling the N highest priority ready threads in the system requires frequent migration of threads from one processor to the next, resulting in poor cache locality.

Single JVM with threads bound to individual processors
Standard edition Java does not offer an API to set the affinity of particular threads to particular processors. However, some virtual machines offer special support for this capability. The idea is that certain threads can be bound to particular processors, thereby improving the cache locality and reducing the burden on a global scheduler. Binding threads to particular processors also improves predictability of scheduling, enabling analysis of schedulability to guarantee that threads will satisfy their real-time constraints.

Since all of the threads are running within the same virtual machine, they all share access to the same objects. This makes information sharing and load balancing straightforward. Note that load balancing must be performed by application software logic and does not happen automatically, unlike the globally scheduled approach.

One difficulty with assigning thread affinity to processors is that this complicates the implementation of priority inheritance. Suppose, for example, that thread α, bound to processor 1 and running at priority 6, holds a lock on resource R. Suppose further that there is a higher priority thread β, also bound to processor 1, running at priority 10. Now suppose that thread
γ, running on processor 2 at priority 8, attempts to lock resource R. Since the resource is already locked by a lower priority thread, it will endow its priority to thread α because it currently holds the lock. However, raising thread α to priority 8 does not allow it to run, because it is bound to processor 1 and processor 1 is consumed by the demands of thread
β, which is running at priority 10.

If you are going to pursue an approach that binds individual threads running within a single JVM to different processors, it is important to know how the virtual machine deals with priority inheritance and take this into consideration when performing schedulability analysis. One JVM approach is to ignore a thread’s specified affinity whenever the thread is running with inherited priority. Another approach is to inherit not only thread priority but also thread affinity. A third approach is for the JVM to strictly honor thread affinity and simply document that priority inheritance does not work reliably when shared synchronization objects are accessed from threads bound to different processors.

Hybrid approaches. Various hybrid approaches are possible, each providing different tradeoffs between strong isolation of resources, ease of information sharing, and support for dynamic load balancing. A uniprocessor JVM may be bound to one processor, while a multiprocessor JVM may be bound to certain other processors. Each multiprocessor JVM may support a combination of global scheduling for certain threads and localized scheduling for threads bound to particular processors. In this sort of hybrid configuration, an unbound thread would be scheduled on any processor for which the highest priority bound ready thread has lower priority than the unbound thread’s priority. Note that allowing unbound threads to freely migrate between processors that are also scheduling bound threads adds complexity to the scheduling analysis performed on each processor. Address this complexity by limiting which processors an unbound thread is allowed to run on, or by assigning lower priorities to unbound threads.

Analysis of real-time constraints

An important activity in the development of any real-time system is analysis of compliance with real-time constraints. This activity, often described as schedulability analysis, requires a good understanding of the CPU time consumed by every real-time task, the frequency with which each task is triggered for execution, the maximum amount of time each task might be blocked contending for access to shared resources, and the deadline for completion of the task’s work measured from the moment it is triggered.

Many thousands of research papers have reported on various aspects of real-time schedulability analysis. One of the most pragmatic and common approaches, especially for software components that are independently developed and maintained, is rate monotonic analysis. Other scheduling techniques are able to extract higher utilization from the CPU, but these other techniques often incur higher run-time overheads, may require greater degrees of integration between the independently developed software components, and most do not deal as gracefully with transient work overloads as does the rate monotonic technique.

In hard real-time systems, which are typically characterized by an expectation that the system is proven to always meet all timing constraints with no tolerance for the slightest deviation from scheduling constraints, programmers are expected to theoretically analyze all of the worst case behaviors of each task and feed these worst-case numbers in as parameters to the schedulability analysis. In soft real-time systems, the expectation is less rigorous. Approximations of task behaviors based on various measurements are accepted as inputs to the schedulability analysis. The expectation is that scheduling analysis based on approximations of task behavior will usually satisfy all deadlines, but may occassionally miss certain deadlines by relatively small amounts of time.

RMA: uniprocessor vs. globally scheduled multiprocessing
Rate monotonic analysis (RMA) is much simpler, and provides higher guaranteed utilization of available CPU resources, on a uniprocessor. The basic approach of rate monotonic analysis is to assign task priorities in order of decreasing deadlines, so the tasks with the shortest deadlines get the highest priority.

In the absence of blocking behaviors, rate monotonic analysis promises that if the total CPU utilization of a workload for which task priorities have been assigned in rate monotonic order is less than 69%, all of the tasks will always satisfy their real-time constraints. Utilization is computed by multiplying every task’s worst-case execution time by its maximum execution frequency and summing the results. The analysis is more complex when tasks contend for exclusive access to shared resources. For a full description, see reference [13].

When an N-way multiprocessor is configured to always run the N highest priority ready threads in parallel on the N available processors, we call this global scheduling. Rate monotonic theory has been generalized to globally scheduled multiprocessors [14]. An interesting result is that the globally scheduled multiprocessor system needs much more slack than a uniprocessor in order to guarantee compliance with real-time constraints. One form of the result for globally scheduled multiprocessors is given by the following:

Corollary 1. Any periodic task system in which each task’s utilization is no more than one-third and the sum of the utilization of all the tasks is no more than N/3 is successfully scheduled by the Rate Monotonic Scheduling Algorithm upon N unit-capacity processors.

Note that the rate monotonic scheduling guarantee in this context applies only if the total utilization is less than 33.3% of the total capacity of all N processors. This is much lower than the 69% utilization that can be effectively scheduled on a uniprocessor. When applying this approach, it is also important to consider that the frequent migration of tasks from one processor to another is likely to increase the expected execution times of individual tasks compared to repeated execution of the tasks on the same processor because of loss of cache locality benefits.

Rate monotonic analysis for locally scheduled multiprocessors

When threads are bound to particular processors, the schedulability analysis for each processor is the same as the uniprocessor version of rate monotonic analysis. The greatest difficulty with this approach is deciding on the most optimal partitioning of threads between available processors as this is a form of the bin packing problem, known to be NP-hard. A 1998 paper describes a simple first-fit-decreasing partitioning heuristic and includes an analysis demonstrating that the use of this heuristic to assign threads to particular processors will result in a system that is rate-monotonic schedulable as long as:
where U is the total workload utilization and N is the number of processors [15]. Simplifying the math, this approach guarantees to satisfy real-time constraints as long as the total utilization is no greater than 41% of the multiprocessor capacity.

Real-time multicore garbage collection
As with uniprocessor real-time garbage collection, the key requirements for robust and reliable operation of real-time garbage collection are that it be

  • preemptible, allowing higher priority tasks to run even when garbage collection is active;
  • incremental, making appropriate forward progress during available increments of CPU time so that it does not need to restart itself whenever it is resumed following preemption;
  • accurate, guaranteeing to find and reclaim all garbage in the system;
  • defragmenting, relocating live objects to contiguous memory locations and coalescing contiguous free segments into larger free segments to improve reliability and allocation efficiency; and
  • paced, to ensure that the garbage collector replenishes the free pool at a rate consistent with the application’s appetite for new memory allocation.

The objective of garbage collection pacing is to make sure that garbage collection gets enough increments of CPU time to make sure it consistently replenishes the allocation pool before the available supply of memory has been exhausted.