当前位置：网站首页>Troubleshooting of high CPU load but low CPU usage

Troubleshooting of high CPU load but low CPU usage

2022-07-03 06:49:00 【The way of research and development】

Tell a story

Recently, services have always appeared cpu load High alert , And the alarm often appears in the early morning of low peak period , Therefore, it is obviously not the high load caused by user traffic , however cpu buzy But very low . View memory usage ：mem.memused near 100%, Check the disk status ：swap.used periodic (30 About minutes ) Higher , disk.io.util low , however disk.io.avgqu-sz（ Average request queue length ） periodic （30 About minutes ） Higher , And and cpu load high Same frequency . The machine was checked later crontab -l, The viewing cycle is 30 Minutes of scheduled tasks , It is found that the scheduled task is puppet, And check the execution time and cpu load High is also right . Therefore, many of the above phenomena resonate at the same frequency , We can only show that these phenomena are strongly correlated , It's like “ The story of beer and diapers “, But what is the specific logical attribution chain ？ Every link in the chain needs evidence .

Conclusion

`mem.memused`  high (OS Out of memory )
			-> `swap.used` high  -> `disk.io.avgqu-sz` Disk operation queued  -> "cpu load" high  ->  Trigger alarm 
`puppet` Periodic tasks a large number of disk reads

To analyze problems

Our machine memory 8G.JVM Parameters ：

-Xmx6g -Xms6g -Xmn3g

Question 1 ： Why? mem.memused Have been steadily approaching 8G？ and jvm Definition 6G Only half used , It's impossible to fill up 8G？memused = MemTotal - MemFree - Buffers/Cached. Look at the formula of statistical method , as long as jvm Do not release memory to the operating system ,Buffers/Cached and MemFree The size of will not change .jvm Of GC Just logical memory release , But still jvm Managed by , It's not a physical release （ therefore top View the Java process RES Columns use memory 6G about ）. So it's like jvm.memory.used Indicators will be sensitive to tracking GC It brings jvm Memory changes . From the operating system level, it is close to use 6G 了 .
Question two ： Why is memory usage so high that swap Partition ？

When applying for the machine, you installed tomcat（ In fact, you don't need ）, After service deployment , There are two on the machine Java process , One of them is tomcat Starting up , Observe its memory usage through the following command 1.5G about .

[[email protected] ~]$ ps -p 3408 -o rss,vsz
  RSS    VSZ
1554172 8672328

With business services JVM Memory more and more memory is requested from the operating system , Can pass top Command to see RES The columns gradually grow to close 6G. Total memory usage = JVM1（6G） + JVM2（tomcat 1.5G）+ Not JVM Memory . Lead to OS Finally, the available memory is insufficient , And then use swap Partition

Question 3 ： Why? cpu load High and high cpu usage low ？
Waiting disk I/O Too many processes completed , The length of the process queue is too large , however cpu Very few processes are running , The load is too large ,cpu Low usage .
Question 4 ： Why are there many disk request queues It can lead to cpu load high ？

uptime and top You can see it when you wait for orders load average indicators , Three numbers from left to right represent 1 minute 、5 minute 、15 Minutes of load average：

$ uptime
11:44:47  up 46 days 14:54,  2 users,  load average: 2.98, 3.08, 3.02

If the average is 0.0, It means that the system is idle
If 1min The average is higher than 5min or 15min Average , Then the load is increasing
If 1min The average value is lower than 5min or 15min Average , Then the load is decreasing
If they are higher than the system CPU The number of , Then the system is likely to encounter performance problems （ As the case may be ）

stay Linux in , For the whole system ,load averages yes “system load averages”, Measure the number of running and waiting threads （CPU, disk , Uninterrupted lock ）, Include uninterruptible sleep The number of processes . Unlike other operating systems cpu load The definition of ,Linux It's not just about CPU The load of resources . advantage ： It includes the demand for different resources .

When you see load average When it's high , You don't know it's runnable Too many processes or uninterruptible sleep There are too many processes , It is impossible to judge CPU Not enough or IO The device has a bottleneck .

The process is in cpu The above operation requires access to disk files , This is the time cpu Will make a request to the kernel to call the file , Let the kernel pass DMA Way to get files from disk , At this time, it will switch to other processes or idle , This task will be transformed into uninterruptible sleep state . When there are too many read and write requests, it will lead to uninterruptible sleep There are too many processes in the state , This results in a high load ,cpu Low case .

sched/loadavg.h:

#define LOAD_FREQ   (5*HZ+1) /* 5 sec intervals */

sched/loadavg.c

* The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.
 *
 * Once every LOAD_FREQ:
 *
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *  nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
 *
 *   avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

HZ is the kernel timer frequency, which is defined when compiling the kernel. On my system, it’s 250:

% grep "CONFIG_HZ=" /boot/config-$(uname -r)
CONFIG_HZ=250

solve the problem

Remove the pre installed tomcat Software
Reduce JVM Maximum heap usage configured

Problem solved . ️ ！

Reference material ：

appendix ：

top command ：

[[email protected] ~]# top
top - 12:13:22 up 167 days, 20:47,  2 users,  load average: 0.00, 0.01, 0.05
Tasks: 272 total,   1 running, 271 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65759080 total, 58842616 free,   547908 used,  6368556 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 64264884 avail Mem
................
  
 The explanation for the third line above ：
us（user cpu time）： In user mode cpu Time ratio . When the value is high , Explain what the user process consumes  CPU  More time , such as , If the value exceeds for a long time  50%, We need to optimize the program algorithm or code .
sy（system cpu time）： System state cpu Time ratio .
ni（user nice cpu time）： Used as a nice Weighted process assigned user state cpu Time ratio 
id（idle cpu time）： Idle cpu Time ratio . If the value continues to be 0, meanwhile sy yes us Twice as many , Generally speaking, the system is faced with  CPU  The shortage of resources .
wa（io wait cpu time）：cpu Wait for disk write completion time . When the value is high , explain IO Waiting is more serious , This may be caused by random access to a large number of disks , Or there may be a bottleneck in disk performance .
hi（hardware irq）： Hard interrupts take time 
si（software irq）： Soft interrupt consumes time 
st（steal time）： Virtual machines steal time

原网站

版权声明
本文为[The way of research and development]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202150612065493.html