当前位置：网站首页>[stonedb fault diagnosis] system resource bottleneck diagnosis

[stonedb fault diagnosis] system resource bottleneck diagnosis

2022-07-27 22:07:00 【51CTO】

When there is a bottleneck in the resources of the operating system , Not only the application services on the operating system are affected , Moreover, executing simple commands in the operating system may not return results . Before the operating system is completely rammed , You can use related commands to CPU、 Memory 、IO And the use of network resources , Then analyze and confirm whether these resources are reasonably utilized , Is there a bottleneck .

CPU

top、vmstat Can be checked CPU Usage situation , but top The results are more comprehensive .top The returned result has two layers , The upper layer is the statistical information of system performance , The lower level is the process statistics , The default in accordance with the CPU Sort by usage . top An example of the return result is as follows ：

top - 10:12:21 up 5 days, 22:31,  4 users,  load average: 1.00, 1.00, 0.78
Tasks: 731 total,   1 running, 730 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257841.3 total,   1887.5 free,  45581.6 used, 210372.2 buff/cache
MiB Swap:   8192.0 total,   8188.7 free,      3.3 used. 210450.4 avail Mem 

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                
908076 mysql     20   0  193.0g  42.4g  44088 S 100.3  16.8 228:10.34 mysqld                                                                                                                 
823137 root      20   0 6187564  83772  51636 S   6.6   0.0   6:36.12 dockerd                                                                                                                
822938 root      20   0 3278696  58500  35420 S   0.7   0.0  38:37.69 containerd                                                                                                             
1483 root      20   0  239280   9260   8136 S   0.3   0.0   0:19.16 accounts-daemon                                                                                                        
928343 root      20   0    9936   4576   3240 R   0.3   0.0   0:00.04 top                                                                                                                    
  ......
     1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

first line

10:12:21： Current system time up 5 days： The number of running days since the last system startup 4 user： Number of users logging in to the system load average： In the past 1 minute 、5 minute 、15 minute , The average value of the system load

The second line

total： Total number of system processes running： Number of running processes sleeping： Number of dormant processes stopped： The number of processes in the stopped state zombie： Number of processes in zombie state

The third line

us： User process occupation CPU Percent of sy： The system process occupies CPU Percent of ni： The priority is occupied by the changed process CPU Percent of id： Free CPU Percentage occupied wa：IO Waiting for occupation CPU Percent of hi： Hardware interrupt occupation CPU Percent of si： Software interrupt occupancy CPU Percent of st： Virtualized environments occupy CPU Percent of We need to focus on CPU The usage rate of , When us When the value is higher , Indicates the user process consumption CPU More time , If it takes longer than 50% when , Application services should be optimized as soon as possible . When sy When the value is higher , Description system process consumption CPU More time , For example, it may be the unreasonable configuration of the operating system or the emergence of the operating system Bug. When wa When the value is higher , Explain the system IO Waiting is more serious , For example, there may be a lot of randomness IO visit ,IO Bandwidth bottleneck .

In the fourth row

total： Total physical memory size , Unit is M free： Free memory size used： Size of memory used buff/cache： Cached memory size

The fifth row

total：Swap size free： Idle Swap size used： Used Swap size avail Mem： Cached Swap size

Process list

PID： Process id USER： The owner of the process PR： Priority of the process , The smaller the value, the more priority is given to execution NI： process nice value , A positive value indicates that the priority of the process is reduced , A negative value means to increase the priority of the process ,nice The value range is (-20,19), By default , Process nice The value is 0 VIRT： Virtual memory size occupied by the process RES： The physical memory size occupied by the process SHR： The size of shared memory occupied by the process S： Process status , among S Indicating dormancy ,R Indicates running ,Z Indicates a dead state ,N Indicates that the process priority value is negative %CPU： process CPU Usage rate %MEM： Process memory usage TIME+： After the process starts, it occupies CPU The total time of , I.e. occupation CPU Cumulative value of service time COMMAND： Process start command name appear CPU High usage diagnostic methods ： 1） Find the function called

notes ：xxx by top -H Return to the most consumed CPU The process of . 2） Find out the consumption CPU Of SQL

pidstat -t -p <mysqld_pid> 1 5
select * from performance_schema.threads where thread_os_id = xxx\G
select * from information_schema.processlist where id = performance_schema.threads.processlist_id\G
     1.
2.
3.

notes ：xxx by pidstat Return to the most consumed CPU The thread of .

Memory

top、vmstat、free Can check the memory usage . free An example of the return result is as follows ：

# free -g
total        used        free      shared  buff/cache   available
Mem:            251          44           1           0         205         205
Swap:             7           0           7
     1.
2.
3.
4.

total： Total physical memory size ,total = used + free + buff/cache used： Size of memory used free： Free memory size shared： Shared memory size buff/cache： Cache memory size available： Available physical memory size ,available = free + buff/cache There are diagnostic methods for high memory utilization ： 1） Check if the configuration is reasonable , for example ： Operating system physical memory 128G, And assign to the database instance 110G, Because operating system processes and other applications also need memory , It's easy to run out of memory ; 2） Check whether the number of concurrent connections is too high ,read_buffer_size、read_rnd_buffer_size、sort_buffer_size、thread_stack、join_buffer_size、binlog_cache_size All are session Grade , The more connections , The more memory you need , Therefore, these parameters cannot be set too large ; 3） Check whether there is unreasonable join, for example ： When multiple tables are associated , The result set of the driving table in the execution plan is relatively large , It needs to be executed repeatedly , Easy to cause memory leaks ; 4） Check whether there are too many open files and table_open_cache Whether the setting is reasonable , When accessing a table , The table will be put into the cache table_open_cache, The purpose is to visit faster next time , But if table_open_cache Set too large , And there are many open tables , It consumes a lot of memory .

iostat、dstat、pidstat Can be checked IO Usage situation . iostat An example of the return result is as follows ：

# iostat -x 1 1
Linux 3.10.0-957.el7.x86_64 (htap2)     06/13/2022      _x86_64_        (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.06    0.00    0.03    0.01    0.00   99.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.04     0.00    85.75     0.00    0.25    0.25    0.00   0.15   0.00
sdb               0.06     0.11    7.61    1.10  1849.41    50.81   436.48     0.36   40.93   46.75    0.48   1.56   1.35
dm-0              0.00     0.00    0.28    0.19     8.25    12.05    87.01     0.00    4.81    7.37    0.94   1.61   0.08
     1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

rrqm/s: Every second merge Read operand of wrqm/s: Every second merge Write operand of r/s： Read every second IO frequency w/s： Write every second IO frequency rkB/s： Read every second IO size , Unit is KB wkB/s： Write every second IO size , Unit is KB avgrq-sz： Average request size , The unit is sector （512B） avgqu-sz： The average number of requests active in the driver request queue and in the device await： Average IO response time , Including the waiting in the driver request queue and the device IO response time r_await： Every read operation IO response time w_await： Every write operation IO response time svctm： Disk device IO Mean response time %util： The device is busy processing IO Percentage of requests （ Usage rate ）, How busy the disk is r/s + w/s：IOPS appear IO High usage diagnostic methods ： 1） Find the most used disk device

2） Find out the occupation IO High application

3） Find out the occupation IO High thread

4） Find out the occupation IO high SQL

select * from performance_schema.threads where thread_os_id = xxx\G
select * from information_schema.processlist where id = performance_schema.threads.processlist_id\G
     1.
2.