Out of Order Memory

I’ve been thinking and talking about Linux memory management the last few weeks and, in particular, how it works with embedded systems. Remember, Linux is primarily meant to look like a nearly 40-year-old operating system that was made to run user programs from a terminal. Turns out it is great for some embedded systems (especially with a few enhancements here and there). On the other hand, you can’t expect it to be totally perfect.

One area that is a real sore spot for embedded systems is the default behavior for memory. Last week I mentioned that, unless you change some options, you can allocate memory successfully even if you are out of memory. The kernel only tries to fill your request “on demand,” so to speak.

If you have a swap partition or file, the kernel can roll dirty data pages out to it to make room. It can always throw away clean pages (like most code pages, for example) because it can reload them from disk anyway. As I mentioned before, though, a lot of embedded systems don’t have swap files. Even if you do, you can still run out of swap. It just gives you more room, but it can’t give you infinite memory, even virtually.

What does the kernel do when it really runs out of memory? That depends on the version of the kernel. The exact behavior is very controversial and it changes frequently compared to some parts of the kernel.

Early 2.4 kernels employ a hit man, of sorts, called the OOM (Out of Memory) killer. The OOM killer’s job is to find the most naughty process it can and terminate it. Naughty, of course, is a relative term. The OOM killer scores each process, and the process with the highest score gets the boot. Here are the comments from the kernel about OOM killer’s scoring algorithm:

 * oom_badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @p: current uptime in seconds
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)

You can learn more by reading the kernel source, but the basic idea is that large processes are worse than small processes. Processes that have a lot of children score higher, too. Processes that have been “niced” (lower priority) get more points, but processes that have run for a long time get a decrease in score. Processes running as root or directly accessing hardware get a break and will score lower. There is also a scaling factor for each process in /proc. For process 1009, for example, that factor is in /proc/1009/oomadj.

There are a few problems with this approach. The biggest problem is that the poor-scoring process may be the one thing you really don’t want to kill. On a desktop machine, killing the X server, for example, is quite rude. In an embedded system, killing something that is doing safety-critical processing may be highly undesirable. Older kernels allowed you to turn the killer off at your own risk (/proc/sys/vm/oom-kill could be set to 0). You could also set /proc/XXXX/oom_adj to -17 to prevent the killer from taking a process with PID XXXX.

Newer 2.4 kernels (2.4.23 and newer) don’t try to score the bad process. They just kill whatever process asked for enough memory to tip the scale. The 2.6.36 kernel gets another version of the OOM killer that is more like the original version but with different heuristics and some different actions. For example, the naughty score is almost entirely based on how much memory a process is using. Further, if a bad process has children, the OOM killer tries to kill any child processes first to free up memory.

The newer kernels also augment /proc/XXX/oom_adj with /proc/XXX/oom_score_adj, which you can use to control the killer. This ranges from -1000 to 1000 and is added to the processes’ actual score. This lets you weight important processes with a very negative adjustment, or suggest processes to kill with a positive one. You can also change behavior to cause the kernel to just panic if it runs out of memory. I’m not sure if that’s better or worse.

The flip answer, of course, is don’t run out of memory. You can also set /proc/XXXX/oomadj to OOM_DISABLE (-17), which will protect a specific process from the killer. You can see that how the kernel manages the out-of-memory conditions changes frequently. The really important take away from this is that if you deploy a Linux system, you really need to know what kernel you are using and how it handles out-of-memory situations and then make sure you understand how it will affect what you are doing.

There is a patch, by the way, for an embedded-friendly OOM killer at http://is.gd/iKlmAk. I haven’t tried it, but if you have used it, leave a comment about how you liked it.

If all this isn’t confusing enough, Android, which uses the Linux kernel, uses a different way of scoring bad processes. I’ll have more to tell you about that next time.