Using Java to deal with multicore programming complexity: Part 2

Editor’s Note:In Part 2 in a three part series on how embedded
developers can more effectively exploit the use of multicore
Java, Kelvin Nilsen details the factors to consider in making a decision
to shift from C and C++ to Java, as well as provides some guidelines
for migrating legacy code to a Java-based multicore development
environment.

For the past several decades, huge legacies of
software have been created. Many embedded system industries have
experienced doubling of application size every 18 to 36 months. Now,
many of the companies responsible for maintaining these large legacies
are faced with the difficult challenge of modernizing the code to run in
multiprocessor environments.

One option: Keep your legacy code as is
Clearly,
the path of least resistance is to keep large code legacies as is, and
not worry about porting the code to new multiprocessor architectures.
Most code originally developed to run as a single non-concurrent thread
will run just fine on today’s multiprocessors as long as there is no
need to scale performance to more than one processor at a time.

Sometimes
you can scale to multiprocessor deployment by running multiple
instances of the original application, with each instance running on a
different processor. This improves bandwidth by enabling a larger number
of transactions to be processed in any given unit of time, but it does
not generally improve latency (the time required to process each
transaction).

Even if it is possible to preserve large portions
of an existing legacy application by replicating the application’s logic
across multiple processors, some parallel processing infrastructure
usually must be developed to manage the independent replicas. For
example, a supervisor activity might need to decide which of the N
replicas is responsible for each transaction to be processed. And at a
lower level, it may be necessary for all N replicas to share access to a
common database to assure that the independent processing activities do
not create integrity conflicts.

In general, it is good to reuse
time-proven legacy code as much as is feasible. However, in most cases,
significant amounts of code must be rewritten and/or added in order to
fully exploit the parallel processing capabilities of the more modern
hardware. We recommend the use of Java for new code development for
several reasons:

  1. In general, Java developers find themselves
    to be about twice as productive as C and C++ developers during the
    creation of new (sequential) functionality.
  2. During software
    maintenance and reuse activities, systems implemented with the Java
    language are maintained at one-fifth to one-tenth the cost of systems
    implemented with C or C++.
  3. Java has much stronger support than C and C++ for concurrent and parallel processing.
  4. There
    is a greater abundance of readily reusable, off-the-shelf
    multiprocessor aware software components written in the Java language.
  5. Most recently, it is much easier to recruit competent Java developers than developers expert in the use of C and C++.

Bug-free Java code
Since
Java is designed for concurrent and parallel programming, many
off-the-shelf software components and many legacy Java applications are
already written to exploit concurrency. If you are lucky enough to
inherit a legacy of Java application software, migration to a
multiprocessor Java platform may be straightforward.

A word of
caution, however: even though Java code may have been written with
concurrency in mind, if the code has never been tested on a true
multiprocessor platform, it may be that the code is not properly
synchronized. This means certain race conditions may manifest on the
multiprocessor system even though they did not appear when the same code
ran as concurrent threads on a uniprocessor.

Therefore,
budgeting time to test and fix bugs is advisable. As Java code is
migrated from uniprocessor to multiprocessor platforms, code review or
inspections by peers may help identify and address shortcomings more
efficiently than extensive testing. Enrolling your entire staff of
uniprocessor-savvy Java programmers for training on multiprocessor
issues may be an investment that pays for itself many times over as part
of the preparation for migrating existing Java code from uniprocessor
to multiprocessor platforms.

The remainder of this section
discusses issues relevant to creating Java code for or porting code into
a multiprocessor environment.

Finding opportunities for parallel execution
To
exploit the full bandwidth capabilities of a multiprocessor system, it
is necessary to divide the total effort associated with a particular
application between multiple, largely independent threads, with at least
as many threads as there are processors. Ideally, the application
should be divided into independent tasks, each having the relatively
rare need to coordinate with other tasks. For efficient parallel
operation, an independent task would execute hundreds of thousands of
instructions before synchronizing with other tasks, and each
synchronization activity would be simple and short, consuming no more
than a thousand instructions at a time. Software engineers should seek
to divide the application into as many independent tasks as possible,
while maximizing the ratio between independent processing and
synchronization activities for each task. To the extent these objectives
can be satisfied, the restructured system will scale more effectively
to larger numbers of processors.

Serial code is code that
multiple tasks cannot execute in parallel. As software engineers work to
divide an application into independent tasks, they must be careful not
to introduce serial behavior among their independent tasks because of
contention for shared hardware resources. Suppose, for example, that a
legacy application has a single thread that reads values from 1,000
input sensors. A programmer might try to parallelize this code by
dividing the original task into ten, each one reading values from 100
sensors.

Unfortunately, the new program will probably experience
very little speedup because the original thread was most likely I/O
bound. There is a limit on how fast sensor values can be fetched over
the I/O bus, and this limit is almost certainly much slower than the
speed of a modern processor. Asking ten processors to do the work
previously done by one will result in ineffective use of all ten
processors rather than just one. Similar I/O bottlenecks might be
encountered with file subsystem operations, network communication,
interaction with a data base server, or even a shared memory bus.

Explore
pipelining as a mechanism to introduce parallel behavior into
applications that might appear to be highly sequential. See Figure 2 for an example of an application that appears inherently serial.


Click on image to enlarge.

Figure 2. Soft PLC Implementation of Proportional Integral Derivative (PID) Control

This same algorithm can be effectively scheduled on two processors as shown in Figure 3.


Click on image to enlarge.


Figure 3. Two-Processor Pipelined Implementation of Soft PLC

Pipelined
execution does not reduce the time required to process any particular
set input values. However, it allows an increased frequency of
processing inputs. In the original sequential version of this code, the
control loop ran once every 20 ms. In the two-processor version, the
control loop runs every 10 ms.

This particular example is
motivated by discussions with an actual customer who was finding it
difficult to maintain a particular update frequency using a sequential
version of their software Programmable Logic Controller (PLC). In the
two-processor version of this code, one processor repeatedly reads all
input values and then calculates new outputs, while the other processor
repeatedly logs all input and output values and then writes all output
values. This discussion helped the customer understand how they could
use multiprocessing to improve update frequencies even though the
algorithm itself was inherently sequential.

The two-processor
pipeline was suggested based on an assumption that the same I/O bus is
fully utilized by the input and output activities. If inputs are
channeled through a different bus than outputs, or if a common I/O bus
has capacity to handle all inputs and outputs within a 5 ms time slice,
then the same algorithm can run with a four-processor pipeline, yielding
a 5 ms update frequency, as shown in Figure 4.


Click on image to enlarge.

Figure 4. Four-Processor Pipelined Implementation of Soft PLC

Remember
that a goal of parallel restructuring is to allow each thread to
execute independently, with minimal coordination between threads. Note
that the independent threads are required to communicate intermediate
computational results (the 1,000 input values and 1,000 output values)
between the pipeline stages. A naive design might pass each value as an
independent Java object, requiring thousands of synchronization
activities between each pipeline stage. A much more efficient design
would save all of the entries within a large array, and synchronize only
once at the end of each stage to pass this array of values to the next
stage.