Geospatial visualization helps manage million-line-plus embedded code bases

Node and edge diagrams used to visualize software often become difficult to interpret when they contain more than a few dozen nodes. When a large software system contains hundreds of thousands of entities linked in complex ways, those difficulties are greatly compounded.

This article describes an approach to visualizing large software dependency relationships using scalable and intuitive techniques developed originally for geospatial visualization. This approach allows us to show graphs representing twenty million lines of source code while maintaining a smooth user experience.

The appeal – and limitations – of simple node and edge diagrams
Node and edge diagrams are a common sight on whiteboards when software engineers meet. On a small scale, they succinctly capture dependency relationships (such as the calls relationship) between functions and files. With appropriate layout, such diagrams can quickly give a sense of larger dependency patterns, separating utility code from high level business logic, or highlighting fragile cyclic dependencies.

However, the utility of simple node and edge diagrams diminishes at larger scales; graph-layout algorithms struggle with large graphs, and the dependency edges of a typical project overwhelm a user. Large software projects can have multiple millions of lines of code, containing hundreds of thousands of functions. Unfortunately, such systems are those that need visualization the most, since they are the hardest to understand and monitor using other techniques, and the most costly when problems are missed.

The problem of creating a concise visual representation of dense information is not new; cartographers have been dealing with it in a slightly different form for millennia. Mapmakers select a level of detail based on the scale of the map, and leave out information that will not fit. This approach has been continued in online map and geospatial applications like Google Earth [1], with the added feature that the user can change the scale interactively (by zooming in or out). The result is an easy-to-use visualization of vast amounts of data. We adapted this approach to visualizing software.

Using mapping methods to visualize large software projects

Our tool gives engineers and managers a canvas for exploring and monitoring software projects. This canvas gives users a sense of the overall structure of the software and facilitates the display of additional data. For example, problematic modules or modules experiencing rapid change can be highlighted and explored.

The tool also supports more detailed investigation when warranted and was explicitly designed to target large systems with millions or tens of millions of lines of code. Currently the tool supports C/C++ programs, though the approach can be applied to other languages.

Learning from Google Earth. We adapted techniques employed by modern geospatial applications like Google Earth and NASA World Wind [2] to show the dense information that makes up the surface of the Earth. The key techniques are:

  • Divide information into levels of detail, where large scale maps (or maps where the virtual camera is at a higher altitude) show less detail than small scale maps.
  • Pre-compute the levels of detailso information can be retrieved at various levels of detail quickly when requested later.
  • Retrieve information on-demand based on what is on screen and the scale of the current image(or the height of the current camera position).
  • Use modern graphics hardwareto allow smooth zooming and panning of complex scenes.
  • Include a text-based search capabilityso the user can quickly locate information whose position may not be known.

Step #1: The Map View
The Map view is the tool’s primary mode for viewing large projects. In this view, the software is divided into components that form a hierarchy. The leaf elements of the hierarchy are procedures. Procedures are grouped into components, which can themselves be grouped into supercomponents, and so on.

By default, the tool will infer this hierarchy based on file structure. Procedures are grouped by their containing file, and files are grouped into directories; users can optionally specify an alternate component hierarchy. The complete call graph of a software project is extracted using static analysis.

It is not feasible to display the entire call graph for large graphs, so the tool computes bundled edges that summarize dependencies between files and components. For example, if a procedure read_char() in parser.c calls procedure log_msg() in the logging.c, we add an edge from parser.c to logging.c (Figure 1). We also add extension edges linking procedures to the bundled edges that contain their calls. Bundled edges give us scalability at the cost of some ambiguity; for example, in Figure 1, it is not clear if main() or read_char() calls the insert() procedure.

Click on image to enlarge.

Figure 1. A Map view for a simple program

The call graph is displayed using a hierarchical node and edge diagram, similar to that of Figure 1. Components are drawn as boxes. The size of a component is proportional to the cumulative number of functions contained within that component.

Edges between sibling components are drawn as arrows or wedges. Components are positioned according to one of several graph layout algorithms, each of which uses edges and node sizes to determine a compact layout. The contents of components are drawn in a similar manner-a component’s subcomponents and procedures are drawn as boxes within the outer component box, and positioned using the same layout algorithm, and edges are drawn as arrows or wedges.

If a procedure calls another procedure that is not in the same subcomponent, we draw an extender edge linking it to the bundled edge between the components that contain the caller and callee. Every node is labeled with a name, usually the name of the procedure, file or directory. In most real projects, the procedure nodes and many interior components are too small to be seen unless the user zooms in to a subsection of the graph. (We exploit this by not rendering boxes if they fall below a certain width on the screen.)

The Map view is intended to be used with controls for easy zooming and panning; the user can ‘dive in’ for details, or ‘back out’ to get an overall picture. When a node is selected but is too small to see, its enclosing component is colored to indicate that it contains a selected node; this allows users to easily locate selected nodes deep in the graph by zooming into the components that are colored. Similarly, colorings indicating other data about small nodes ‘bubble up’ to the smallest visible enclosing component so the user can locate the nodes when zoomed out.

Additional interactive features give access to other information: when a node is selected, information about it is shown in a sidebar, including a source code excerpt; when a bundled edge is selected, the sidebar shows the source and destination and the procedure-to-procedure edges it summarizes. A menu allows the user to select groups of nodes that are connected by a single step or transitively.

Step #2: Zooming in for more detail

The tool offers several other views that allow a user to choose a subgraph and investigate it in more detail. The user can hide nodes that are not of interest from the full map and use the same zooming and panning controls. The user can also switch to views where call graph edges are not bundled (that is, all edges between procedures are shown, even if the procedures are in different components). The tool also supports tree views.

In all the supporting views, the user can show nodes by clicking on the nodes that are already visible. The user can also show nodes by clicking on search engine results. In addition, the tool incorporates multiple tabs, each of which can show a different view of the dependency graph. For example, the user can view the whole Map on one tab, a partial Map on a second tab, and a call tree on the third tab.