User and system time per process

Long time no blog post - so let's change that :-)

Today I've come across a system with a pretty high load, but couldn't quite make out which processes were responsible for the load. The ps command showed:

$ uptime
 20:16pm  up 1 day  2:40,  12 users,  load average: 132.81, 132.79, 133.43

$ ps -eo user,pid,rsz,vsz,pmem,pcpu,time,args --sort -pcpu
USER       PID    RSZ      VSZ  %MEM %CPU   TIME COMMAND
user0    18428   3124    58236   0.0 1.1 00:00:00 LLAWP /www/user0/eam/WebAgent.conf -A
user0     8387  14976   154220   0.0 4.2 01:07:44 LLAWP /www/user0/eam/WebAgent.conf -A
user0     4508  14828   152864   0.0 4.2 01:07:21 LLAWP /www/user0/eam/WebAgent.conf -A
user0     8045  15000   154220   0.0 4.2 01:07:33 LLAWP /www/user0/eam/WebAgent.conf -A
user0    23814  14892   152868   0.0 4.2 01:06:47 LLAWP /www/user0/eam/WebAgent.conf -A
user0    18384   3124    58236   0.0 0.8 00:00:00 LLAWP /www/user0/eam/WebAgent.conf -A
user0    17224  14932   152952   0.0 4.1 01:05:39 LLAWP /www/user0/eam/WebAgent.conf -A
So, while each of these processes is using some CPU time, it didn't quite explain the high load. top(1) shows:
Tasks: 1439 total,   9 running, 1430 sleeping,   0 stopped,   0 zombie
Cpu0  : 14.1%us, 49.5%sy,  0.0%ni, 34.3%id,  0.0%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu1  : 19.0%us, 51.4%sy,  0.0%ni, 28.6%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 15.2%us, 47.6%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
Cpu3  : 16.8%us, 45.8%sy,  0.0%ni, 37.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 17.4%us, 39.4%sy,  0.0%ni, 41.3%id,  0.9%wa,  0.0%hi,  0.9%si,  0.0%st
Cpu5  : 14.7%us, 39.4%sy,  0.0%ni, 14.7%id,  0.0%wa,  0.0%hi, 31.2%si,  0.0%st
Cpu6  : 14.5%us, 30.0%sy,  0.0%ni, 52.7%id,  0.0%wa,  0.0%hi,  2.7%si,  0.0%st
Cpu7  :  9.4%us, 34.9%sy,  0.0%ni, 55.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
So we have 1439 processes, but only 9 are running and 1430 are sleeping - and there's a quite a lot system time being used. But there's no ps option to display user oder system time. However, this being a Linux system, we still have the /proc filesystem. The /proc/PID/stat file of a 2.6.30 kernel contains 52 fields, among them are:
   2: tcomm   filename of the executable
   3: state   state (R=running, S=sleeping, D=uninterruptible wait, Z=zombie, T=traced/stopped)
  14: utime   user mode jiffies
  15: stime   kernel mode jiffies
So let's try this:
$ for PID in `ps -eo pid`; do 
   awk '{print "user: " $14 " kernel: " $15 " state: " $3 " process: " $2}' /proc/$PID/stat
done > foo
$ sort -t: -nk3 foo | tail -5
user: 71232 kernel: 335302 state: D process: (LLAWP)
user: 71105 kernel: 342684 state: D process: (LLAWP)
user: 71290 kernel: 346119 state: D process: (LLAWP)
user: 71009 kernel: 347570 state: D process: (LLAWP)
user: 71278 kernel: 348388 state: D process: (LLAWP)
This is only the top-5, but LLAWP was using the most system on this system while really doing nothing - it was in uninterruptible sleep ("D"). Restarting those processes helped and the system load returned to normal :-)