Transporting bugs with virtual checkpoints

One of the most challenging problems in software debugging is to correctly and reliably reproduce a problem found by someone other than the software developer.

Typically, test departments and software users have to type long and brittle “instructions to reproduce the error” in bug tracking systems, along with a list of the version of the software that contained the bug and anything else in the software environment that seems relevant.

More often than not, the developer has to iterate a series of questions with the reporter to get more information about the system where the bug manifested itself and the precise steps taken to cause the bug to trigger.

Such iterations can take days for a globally distributed development effort. The questions are often hard for the reporter to answer in the way the developer wants. In some cases, developers can do remote logins to the failing system in order to get their hands on the precise failing setup; but more often than not this is not possible due to security restrictions or the fact that the failing system is no longer available or has been used for some other important test.

For embedded systems, the problem gets compounded by the availability of the precise hardware needed to run the software and reproduce a bug. In the worst (but fairly common) case, both the reporter and the developer fail to reproduce the bug, making it a glitch that will never get properly fixed (quite possibly impacting an important customer as soon as the software is released).

Virtual platform checkpoints to the rescue

Developers can overcome the issues of reliable bug reproduction and communication of the target state using virtual platform checkpoints. By using a virtual platform to conduct software development and test [1], any bug can be captured, communicated, and reproduced any number of times, in any location.

As shown in Figure 1 below, the basic concept is as follows. The bug reporter runs the target software (operating system and custom applications) using a virtual platform rather than physical hardware.

When a bug occurs, the reporter saves a checkpoint (R) of the combined hardware and software state and sends the checkpoint to the software developer. Opening the checkpoint using the same virtual platform, the developer is able to reproduce the bug as well as investigate the target system state for clues as to what went wrong. There is no need for the developer to get back to the reporter to gather more facts, as everything is encapsulated within the checkpoint.

Transporting bugs with virtual checkpoints

Figure 1: Following the bug with checkpoints

Virtual platforms are becoming a standard tool in the development of embedded systems. Modern virtual platforms are fast and scalable and can run the target system fast enough to replace or complement physical development platforms [2}. Virtual platforms reduce risks and shorten time-to-market for embedded systems. They decouple hardware and software development and offer debug, test, and development environments superior to physical hardware systems.

Bug Reporting with Checkpoints

A virtual platform with checkpoints capability can store the complete state of the virtual platform to the host computer disk [2]. When the checkpoint is later loaded into the virtual platform, the exact same target system state results. The checkpoint includes the hardware setup (boards, networks, plug-in cards, and other configuration aspects), hardware state, and software state.

Specifically, it contains the contents of memories and disks, the state of processor registers, memory management units (MMUs), peripheral devices, and network connections. It also stores some core platform states, such as the current time and events queued for later execution, allowing the virtual platform to continue its execution seamlessly from a checkpoint.

For this kind of use, it is best that the virtual platform chosen is designed to be a deterministic and repeatable simulator [3]. Each time a checkpoint is opened, the target system will execute in the exact same way. The execution remains perfectly repeatable even as a developer reruns the bug, adds debug probes and traces in the simulator, sets breakpoints, and stops, single-steps, and reverses the execution.

Repeatability applies equally to single-processor systems as shared-memory multi-core systems and distributed multi-board systems. All parts of the virtual system stop and run in synchronization. Single-stepping interrupt routines and code in a multiprocessor does not affect the system execution.

Checkpointing and repeatable execution thus achieve perfect bug reproduction by any person, at any point in time, anywhere in the world, irrespective of the target hardware needed to run the software of interest. Additionally, virtual platforms make target hardware availability a nonissue.

There is a virtually infinite supply of every type of board, with no need to physically ship hardware around. The checkpoint contains the information necessary to give the developer a perfect copy of the hardware setup used by the bug reporter.