Transitioning to multicore processing

Hesitating to make the shift from single- to multiple-core processing in your design? Here’s a guide to making the transition.

The transition to multicore processing requires changing the software programming model, scheduling, partitioning, and optimization strategies. Software often requires modifications to divide the workload among cores and accelerators, to use all available processing in the system and maximize performance. Here’s how you and your team can make the switch.

Networking systems, for example, normally include control-plane and data-plane software (shown in Figure 1). The control plane is responsible for managing and maintaining protocols (such as OSPF, SNMP, IPSec/IKE) and other special functions such as high-availability processing, hot plug and play, hot swap, and status backup. Control-plane functions include management, configuration, protocol hand-shaking, security, and exceptions. These functions are reliability sensitive but not extremely time sensitive. Normally, control-plane data packets/frames only occupy ~5% of the overall system load.

Transitioning to multicore processing
Click on image to enlarge.

Data-plane functions focus on high-throughput data processing and forwarding. Once the required connections and links are established by the control plane, most traffic is data-plane packets. Normally ~95% of the overall system load will be for data-plane packets and frames. Therefore, overall system throughput and performance primarily depends on data-plane processing capacity, and any optimization in this area can significantly increase system performance. The data plane’s software complexity is lower, primarily focusing on packet header analysis, table lookups, encapsulation/decapsulation, counting and statistics, quality of service (QoS), and scheduling, among others.

Example migration

A network router is a good example of the migration from single-core to multicore processing. The software architecture for these products has evolved over the last several years:

Unit routers–All software runs on a single-core CPU, including all the control-plane and data-plane modules. These modules are standalone tasks/processes/threads running on a real-time operating system (RTOS). Software integrators must carefully adjust the priorities of each task to achieve improved system performance. Certain high-performance functions such as table lookup actions (FIB, 5-Tuple Classify, NAT) are performed by software, often with the help of offline assistant engines, such as encryption/decryption/authentication, running on a FPGA or ASIC or other acceleration device connected to the CPU for IPSec-related application use. This architecture is for low-end and ultra-low-end unit routers. System performance is lower due to centralized processing on the CPU core.
Chassis routers–These products have a more distributed system architecture without significant support from an ASIC. The main processing unit (MPU) cards manage control-plane jobs. Line processing unit (LPU) cards manage data-plane jobs. Each MPU and LPU card contains one single-core CPU. These CPUs are connected through the backplane (normally FE/GE port switch) to each other. All user-end interfaces are provided on the LPU cards. The MPU cards only provide management interface and heartbeat/backup interfaces. The LPU cards may have optional acceleration engines (encrypt/decrypt/authenticate FPGA/ASICs) sitting beside the CPU. The master MPU will discover the routing topologies and generate the FIB (forwarding information base) entries to each LPU. The LPUs will do the data-plane jobs (forwarding and so forth) for user’s data packets. Both MPU and LPU are running multiple tasks on top of an RTOS. Overall system performance is much better than in unit routers due to the distributed processing and LPU scalability features.
Chassis high-end routers–these routers use a distributed system architecture with an ASIC or network processor (NP). Each LPU card contains additional acceleration (ASIC or NP) powerful enough to perform the data-plane jobs at high speed. Normally, the backplane connecting all ASICs/NPs is composed of some specific crossbar or fabric. And the general CPU on each LPU card will do the IPC (inter processor communication) jobs and configure the ASIC/NP tables. Some differences between the ASIC architecture and the NP architecture exist: The ASIC can provide higher and steadier data-processing rates than the NP, while the NP can provide more flexible functionality. The MPU and LPU will run multiple tasks over an RTOS.
For all three architectures described, the software running on each CPU is still a logical standalone system–the programming model is still single-core. Even for distributed systems, the key system resources are still managed by each CPU, with limited IPC between the CPUs.

Making the switch

When porting to a multicore system, you and your team will be addressing:

The overall system partition (mainly cores, memory, and port resources).
The operating system (control-plane OS section and migration, data-plane bare-board or light-weight run-time environment).
The working architecture of data-plane cores (functionality bound to each core/core-group).
A mutex mechanism.
The system sharing of data-plane tables among all data-plane cores (what shared memory mechanism to use).
The intercore communication mechanism.
Whether to use system and CPU global variables.
How to migrate the Rx/Tx driver.
The architecture-specific accelerators.
Communications between control-plane and data-plane partitions.
Partitioning software system
The software system must be partitioned into two parts–control plane and data plane. First decide how many cores to assign for control-plane use and how many for data-plane use. You can use engineering estimates of standard software performance to determine number of cores.

Migrating control-plane software

The control-plane partition will normally run an OS such as Linux or even an RTOS, to provide a multitasking environment for the user software components. Migrating the OS is fairly straightforward, and most legacy control-plane software components will not require large changes for this migration. But a few key points need attention:

For the single-core architecture system, control-plane software shares all the data-plane tables inside the same one CPU memory space. Updating these tables requires a direct “write” with a semaphore-like mutex protection. On a multicore platform, the table-update actions are different–table updates are performed either by sending self-defined messages to data-plane cores for the update or via a direct write to the shared table (memory shared between partitions/cores) with splinlock/RCU mutex protection.
When using more than one core in the control-plane partition, the most common configuration is the symmetric multiprocessing (SMP) mode. You should check the legacy multitasking software to make sure it will run correctly and efficiently in an SMP environment, especially the inter-task communication (mutex or synchronization) mechanisms.
Migrating data-plane software
Migrating data-plane software to multicore is more difficult. The data-plane partition will typically perform:Data-packet processing.
Data communication with the control-plane partition ***a***.
Management proxy processing.
The legacy data-plane software typically runs on an RTOS that supports a multitasking environment. Data-packet processing is a run-to-completion execution model, executing in one single task/process/kernel-thread. For example in VxWorks, the data-packet processing is done in the tNetTask environment. In Linux, the data packet processing is done in the NET_RX_SOFTIRQ software interrupt environment. Whether using tNetTask or softirq, the priority must be high to prevent being preempted during processing and to keep overall system performance as high as possible.

The management-proxy component in legacy software is typically composed of one or more tasks running in parallel with the data-packet-processing task. The proxy component waits for management or configuration instructions from the control-plane modules to update the data tables or to perform other high-priority tasks. These tasks must have a priority as high as or even higher than the data-packet-processing task. Since the management-proxy task doesn’t execute often, the data-packet-processing task will not be preempted often, and impact on system performance will be minor.

When migrating to a multicore environment, the most efficient way to configure the data-plane partition is to run in a “bare-metal” mode or a similar light weight executive (LWE) mode. These are run-to-completion environments and are more efficient than a multitasking environment.

At first glance, it may seem relatively straightforward to migrate legacy data-packet-processing task code to a multicore environment. These tasks are run-to-completion written in standard C code. However, this is true only from a functional perspective. But on the data plane, performance is king and the number one concern for th ***a***e data-plane partition. To achieve the highest performance possible, some additional optimization is necessary.