(Editor’s Note: Li Mei is a character in a non-fiction book by Lisa Simone about the adventures of a fictional team of software developers working on various project and the lessons they learned.
In the book, Li Mei develops a List of Debugging Secrets while debugging her own mysteries, participating as a sounding board for her teammates as they struggled with their debugging challenges, and listening to everyone’s “Lessons Learned” at the team’s regular status meetings, The action plan she develops includes:
1. Gather the Facts – Learn all you can before diving in.
2. Classify the Symptoms – Characterize the bug based on the facts.
3. Brainstorm Root Causes – Identify what could cause each symptom.
4. Understand the System – Target where to search based on possible root causes.
5. Hypothesize, Test and Verify – Be logical and methodical – nail the bug!
What follows here is the same action plan – but filled in with truisms she experienced – “Resist the Urge!”, “Think with your brain, not your debugger.” – and specific guidelines for writing better code.
It concludes with a list of symptoms and bugs gathered from the individual mysteries the team solved. Use it as a reference for your own work, and as a starting point for your own list of hard-earned Debugging Secrets.)
Step #1: Gather the Facts
* Interview problem reporter.
* Interview anyone who saw the system fail.
* Observe the system behavior – find out what is normal for the system.
* Isolate relevant facts about product, customer, environment, hardware/software, materials used, priority, safety.
* Reproduce the problem if possible.
* Realize bug report descriptions can be misleading.
* Be wary of assumptions when gathering facts – identify inputs and symptoms only. Draw no conclusions yet.
Step #2: Classify the Symptoms
* One-time repeatable events – occurrence has a pattern (e.g., only start up, different behavior first time through function or feature).
* Periodic events – regularly repeating or occurs every time (e.g., tied to timer, interrupt, repeated calls to function/feature, HW or SW heartbeat).
* Sporadic events – seemingly random failures (e.g., boundary condition violations, parameters that change with state, loop counter limits, ranges in logicals, unexpected input/output conditions, unhandled error conditions, faulty logic, hardware, timing, memory corruption, performance issues).
Step #3: Brainstorm (Initial Root Causes, and When You’re “Stuck”)
* Dream what could cause each symptom. The bug’s location in the code can sometimes be determined before looking at the software.
* Identify the set of inputs that causes unacceptable behavior, then find what must be changed to make the behavior acceptable.
* Create a truth table to classify inputs (user, hardware, software, configuration, etc.) and resulting behaviors.
* Patiently watch the system’s behavior.
* Periodically stop and summarize findings.
* Find a sounding board (doesn’t have to be a live person!) – talk through ideas to quickly identify good ideas and discard bad ones.
* Consult the gurus.
* Talk to internal groups (engineering, testing, marketing, etc.) and external groups (vendors, customers, beta testers, etc.).
* Go home. Or somewhere else. Let your brain chew at the problem while you’re doing something else.
Step #4: Understand the System (hardware, software, mechanics)
* Focus on understanding main() first for overall program flow – don’t get lost in the details.
* Divide the code into logical chunks based on structure and flow.
* Use visual aids (flowcharts, graphs, function call trees) to show functional elements (blocks) and program control logic (connectors). This reveals what the program does, reveals structure and testing, and identifies missing logical and functional elements.
* Play Computer. Doggedly step through the code line-by-line because sequential logic performed by a computer does not always match human assumptions. (Trace assembly language noting the contents of registers and stack pointer.)
* Debugging tools allow simple timing characterizations without stopping the program execution.
* When reverse-engineering code, check off functions (e.g., know where you’ve been, identify unused functions).
* Remember sometimes the comments are wrong.
Step #5: Hypothesize, Test and Verify
* Decide exactly what information you would like to get from the embedded system, and choose the best tool accordingly.
* Consider nontraditional and low tech debugging methods (e.g., auditory, pin wiggling).
* Simple methods like pin wiggling can be powerful and unobtrusive.
* Auditory cues – sense of hearing – can discern fine differences in tone and rhythm, and can also be used as a heartbeat or a code coverage flag.
* Use black-box testing to fully explore behavior without looking inside the device/software.
* Hypothesize what should happen in order to choose which variables to watch.
* Apply stressors to induce rare bugs to occur more frequently (more/larger inputs, increased loading, larger memory allocations, faster timing, etc.).
* Use breakpoints configured as watch points to check when values of variables and memory locations change without stopping code execution until the exact conditions you specify are met.
* Use patterned memory (e.g., DEADBEEF, 0x55) to verify memory operations and to identify memory overruns.
* Test bug fix with methods originally used to induce the bug in the first place.