Linux load averages track not just runnable tasks, but also tasks in the uninterruptible sleep state. Why? I’ve never seen an explanation. In this post I’ll solve this mystery, and summarize load averages as a reference for everyone trying to interpret them.
Linux load averages are “system load averages” that show the running thread (task) demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing. Most tools show three averages, for 1, 5, and 15 minutes:
Dstate in the output ps and top. The ps(1) man page calls it
uninterruptiblesleep (usually IO)”.
Adding the uninterruptible state means that Linux load averages can increase due to a disk (or NFS) I/O workload, not just CPU demand.
He was looking for the code with made the change in system average file.
loadavg.c, but the change was older than 2005, when Linus imported Linux 2.6.12-rc2.
He found it elsewhere, and confirmed that the change was in Linux 0.99 patchlevel 14, Nov 1993, but the changelog was
"Changes to the last official release (p13) are too numerous to mention (or even to remember)..." – Linus
Based on the date, he looked up the kernel mailing list archives to find the actual patch, but the oldest email available is from June 1995, when the sysadmin writes:
"While working on a system to make these mailing archives scale more effecitvely I accidently destroyed the current set of archives (ah whoops)."
Fortunately, he did finally find the change, in a compressed mailbox file from 1993 on oldlinux.org. Here it is:
From: Matthias Urlichs <firstname.lastname@example.org> Subject: Load average broken ? Date: Fri, 29 Oct 1993 11:37:23 +0200 The kernel only counts "runnable" processes when computing the load average. I don't like that; the problem is that processes which are swapping or waiting on "fast", i.e. noninterruptible, I/O, also consume resources. It seems somewhat nonintuitive that the load average goes down when you replace your fast swap disk with a slow swap disk... Anyway, the following patch seems to make the load average much more consistent WRT the subjective speed of the system. And, most important, the load is still zero when nobody is doing anything. ;-) --- kernel/sched.c.orig Fri Oct 29 10:31:11 1993 +++ kernel/sched.c Fri Oct 29 10:32:51 1993 @@ -414,7 +414,9 @@ unsigned long nr = 0; for(p = &LAST_TASK; p > &FIRST_TASK; --p) - if (*p && (*p)->state == TASK_RUNNING) + if (*p && ((*p)->state == TASK_RUNNING) || + (*p)->state == TASK_UNINTERRUPTIBLE) || + (*p)->state == TASK_SWAPPING)) nr += FIXED_1; return nr; } -- Matthias Urlichs \ XLink-POP N|rnberg | EMail: email@example.com Schleiermacherstra_e 12 \ Unix+Linux+Mac | Phone: ...please use email. 90491 N|rnberg (Germany) \ Consulting+Networking+Programming+etc'ing 42
“It’s amazing to read the thoughts behind this change from almost 24 years ago.”
His example of using a slower swap disk makes sense: by degrading the system’s performance, the demand on the system (measured as running + queued) should increase. However, load averages decreased because they only tracked the CPU running states and not the swapping states. Matthias thought this was nonintuitive, which it is, so he fixed it.
Likewise, which load averages? CPU load averages? System load averages? Clarifying it this way lets me make sense of it like this:
On Linux, load averages are (or try to be) "system load averages", for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren't completely idle. Advantage: includes demand for different resources. On other OSes, load averages are "CPU load averages", measuring the number of CPU running + CPU runnable threads. Advantage: can be easier to understand and reason about (for CPUs only).
Note that there’s another possible type: “physical resource load averages”, which would include load for physical resources only (CPU + disk).
Perhaps one day we’ll add additional load averages to Linux, and let the user choose what they want to use: a separate “CPU load averages”, “disk load averages”, “network load averages”, etc. Or just use different metrics altogether.
mpstat -P ALL 1
$ iostat mpstat -P ALL
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from “CPU load averages” to what one might call “system load averages.” His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
The use of the uninterruptible state has since grown in the Linux kernel, and nowadays includes uninterruptible lock primitives. If the load average is a measure of demand in terms of running and waiting threads (and not strictly threads wanting hardware resources), then they are still working the way we want them to.
In this post, I dug up the Linux load average patch from 1993 – which was surprisingly difficult to find – containing the original explanation by the author. I also measured stack traces and time in the uninterruptible state using bcc/eBPF on a modern Linux system, and visualized this time as an off-CPU flame graph. This visualization provides many examples of uninterruptible sleeps, and can be generated whenever needed to explain unusually high load averages. I also proposed other metrics you can use to understand system load in more detail, instead of load averages.
I’ll finish by quoting from a comment at the top of
kernel/sched/loadavg.c in the Linux source, by scheduler maintainer Peter Zijlstra:
This file contains the magic bits required to compute the global loadavg figure. Its a silly number but people think its important. We go through great pains to make it work on big machines and tickless kernels.