当前位置：网站首页>Technical practice online fault analysis and solutions (Part 1)

Technical practice online fault analysis and solutions (Part 1)

2022-07-04 00:50:00 【51CTO】

Online failures usually refer to large-scale problems or events that affect the availability of online services , Dealing with online faults is not only a technical activity , Especially for technicians / Test of emergency response ability of technical team . This paper mainly discusses the classification of online faults 、 Coping ideas 、 Start with the causes , Summarize some experience and problem-solving methods , In order to discuss and communicate with colleagues .

One 、 Online fault classification

■　 Unexpected mistakes 、 No response or slow response

■　 In service , Impact on user experience

■　 No shutdown or large-scale shutdown

■　 It needs to be repaired as soon as possible

Two 、 Coping ideas

Analyze according to experience , If someone in the emergency team has experience in corresponding problems , And determine that the normal operation of the system can be restored by some means , Then you should recover as soon as possible （ Roll back, etc ）, At the same time, be sure to keep the scene , In case of subsequent problem positioning and repair ; If no one has experience , You need to use a rough way to ensure that the service is available , Such as timed restart 、 Current limiting 、 Degradation etc. .

Head of business 、 Technical director 、 Core R & D personnel 、 Architects 、 Operation and maintenance engineers and operators quickly analyze the cause of the problem . The analysis process needs to first consider the recent changes of the system , It includes the following aspects ：

■　 Whether the system has been released and launched recently ？

■　 Whether the user of the service has operational activities ？

■　 Whether there is traffic fluctuation in the network ？

■　 Whether the business volume has increased recently ？

■　 Whether the operators have made changes in the system ？

■　 Whether the dependent basic platforms and resources have been released and launched ？

■　 Whether other systems that depend on have been released and launched ？

3、 ... and 、 Probable cause

■　 Code Bug： Logic is not rigorous 、 Connection not released

■　 Code performance ： Loop external call 、 Batch read is not used 、 Regular loop, etc

■　 Memory leak ： Local cache

■　 Abnormal flow ：DDOS

■　 Business volume increased ： Capacity estimation error

■　 External system problems ： database 、 Search engine 、 Distributed cache 、 Performance problems of middleware such as message queuing

CPU、 Memory 、IO Indicators are abnormal

Four 、 Three steps

■　 monitor ：“ I don't know what I'm going to do ”. Monitoring mechanisms are needed to discover 、 Expose system performance problems . It generally depends on system level or business level monitoring tools .

■　 analysis ：“ I know what I want to do ”. Basic computer knowledge and analysis tools are needed .

■　 solve ：“ I know what I need to know ”. System 、 Adjustment of program parameters 、 Refactoring and optimization of code .

Understanding how a system should work does not make people experts , Only by investigating why the system can't work properly .

1. preparation

Technology practice ｜ Online fault analysis and solutions （ On ）_ Stack edit

Preparation for fault analysis 、 Knowledge needed ,based on CentOS 6.5 && JDK 1.8.0_121

■　 Basic computer knowledge ： computer network 、 operating system 、 The principle of computer organization

■　Java memory management ： Garbage collection algorithm 、 Garbage collector 、 The key GC Parameters 、JVM Memory model, etc

■　Java Code benchmark performance test ： have access to JMH（ Micro benchmarking framework ） To carry out , Able to remove JIT The impact of hot code compilation on performance

■　HotSpot Virtual machine architecture

■　 System parameter tuning

■　 Master the common diagnostic tools of the system 、JDK Use of self-contained diagnostic tools and other diagnostic tools

■　 Understand the business system ： Overall framework 、 Pressure direction 、 Capacity estimate 、 Version of system related software 、 Mode and parameters

2. Common system diagnostic tools -CentOS Bring their own

■　uptime： Running time of the system 、 Average load , Include 1 minute 、5 minute 、15 The average number of tasks that can be run in minutes , Including running tasks 、 Although it can run, it is waiting for a processor to idle and block the process in the non interruptible sleep state （ wait for IO, Status as D） The task of .

The first part shows the system time . The first message from the left is 22:36:32, This is the current system time , With 24 Output in hour format .

The second part shows the system running time .up 10 days,11:21, It indicates that the system of the machine has been running 10 God 11 Hours 21 minute . When the system restarts, it will be cleared .

The information in the third part is to display the number of logged in users . It is shown that 1 user , That is, the number of currently logged in users is 1.

The last information is the average load of the system .0.00,,0.05, 0.07 Each represents the past 1 minute 、5 minute 、15 The average load of the minute system . The lower the load, the better the system performance .

Technology practice ｜ Online fault analysis and solutions （ On ）_ The server _03

Technology practice ｜ Online fault analysis and solutions （ On ）_java_04 edit

■　dmesg | tail： This command will output the last... Of the system log 10 That's ok . common OOM kill and TCP Packet loss will be recorded here .

■　free -m： This command can check the usage of system memory ,-m Parameter representation is displayed in megabytes .Buffer and Cache Are calculated in used Inside . What really reflects the memory usage is the second line . If there is less memory available , Will use swap District , increase IO expenses , Reduce performance .

Technology practice ｜ Online fault analysis and solutions （ On ）_java_05

Technology practice ｜ Online fault analysis and solutions （ On ）_ The server _06 edit

■　vmstat 1： Real time performance testing tool , It can show the state value of the server at a given time interval , Including the server CPU Usage rate 、 Memory usage 、 Virtual memory exchange 、IO Core indicators of the system such as reading and writing .r, wait for CPU Number of processes for resources , This is higher than the average load load More able to reflect CPU Busy situation ;b, The number of processes blocked in non interruptible sleep ;si、so,swap Use of the area , If not for 0 Description has been used swap District ;us、sy、id、wa、st,CPU Usage ,id + us + sy = 100.

Technology practice ｜ Online fault analysis and solutions （ On ）_java_07

Technology practice ｜ Online fault analysis and solutions （ On ）_ Stack _08 edit

This means vmstat Every time 2 Seconds to collect data , All the time , Until the end of the program .

■　top： It contains a lot of overall indicator information of the system , Including system load 、 System memory usage 、 System CPU Usage, etc , It basically covers the functions of the above commands .

■　netstap -tanp： see TCP Network connection status .

iproute Toolset ：ss,ip, Can replace netstat

3. Common system diagnostic tools -Sysstat

■　mpstat -P ALL 1： This command is used to display each CPU Usage situation . If there is a CPU The occupancy rate is extremely high , It indicates that it may be caused by a single threaded application .

■　sar -n DEV 1：sar The command is mainly used to check the throughput of network devices . Throughput through network devices , Determine if the network device is saturated .

■　sar -n TCP,ETCP 1： see TCP Connection status .active/s, The number of actively initiated connections per second （connect）;passive/s, The number of passively initiated connections per second （accept）;retrans/s, Number of retransmissions per second , It can reflect the network condition and whether packet loss has occurred .

■　iostat -xz 1： View the machine disk IO situation .await（ms）,IO Average waiting time for operation , When the application interacts with the disk , It takes time , Include IO Waiting and actual operation time ;avgqu-s, The average number of requests sent to the device ;%util, Equipment utilization .

sar、iostat、mpstat、pidstat Belong to sysstat Software Suite

4. JDK Diagnostic tools

■　jstack：Java Stack trace tool , Mainly used to print specified Java Process 、 Core file or remote debugging server Java Thread's stack trace information .

■　jmap：Java Memory mapping tool （Java Memory Map）, Mainly used to print specified Java process 、 Core file or remote debugging server's shared object memory mapping or heap memory details .

■　jhat：Java Heap analysis tools （Java Heep Analysis Tool）, Used to analyze Java Object information in heap memory .

■　jinfo：Java Configuration information tools （Java Configuration Information）, Used to print specified Java process 、 Configuration information of core file or remote debugging server , You can also dynamically modify JVM Parameter configuration .

■　jstat：JVM Statistical testing tools （Java Statistics Monitoring Tool）, Mainly used to monitor and display JVM Performance statistics for , Include gc Statistics .

■　jcmd：Java Command line （Java Command）, Used to refer to the running JVM Send diagnostic command request . because jmap The official mark is unsupported,jcmd It can be used as an alternative tool .

■　visualvm： adopt JMX Interface connection JVM process , So that we can see JVM On the thread 、 Memory 、 Class and other information . Various plug-ins can be installed .（ adopt CATALINA_OPTS Turn on Tomcat jmx Interface ）

■　jconsole： Function like visualvmv, It can display the specific thread stack information and the occupation of memory in various years , And support direct remote execution MBEAN.

5. Other tools

■　jmc：Java Mission Control, Is a sampling type of set diagnosis 、 A very powerful tool for analysis and monitoring . Due to charges , Not too much .

■　greys-atonomy： Online diagnostic tools , By dynamically modifying bytecode, no restart is required JVM Add log 、 The purpose of dynamically enhancing code, such as time-consuming monitoring methods .

■　arthas： Alibaba open source Java Diagnostic toolbox , be based on greys-atonomy And come , Including online diagnosis 、 Decompile bytecode 、 Check out the most resource intensive Java Threads, etc .

■　jwebap：JavaEE Performance testing framework , be based on ASM Enhanced bytecode implementation . Support ：HTTP request 、JDBC Connect 、method Trace and number of calls 、 Time consuming statistics . Secondary developed suishen-webap, Joined the right Java8 Support and Redis Connected monitoring .

■　awesome-scripts： It encapsulates many common diagnostic tools 、 Script etc. , Include greys-atonomy、sjk、VJTools And get the most resource consuming thread stack information 、 Statistics TCP Number of connections and other scripts .（ To be continued ）

原网站

版权声明
本文为[51CTO]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207040034225788.html

当前位置：网站首页>Technical practice online fault analysis and solutions (Part 1)

Technical practice online fault analysis and solutions (Part 1)

边栏推荐

猜你喜欢

随机推荐