当前位置:网站首页>Technical practice online fault analysis and solutions (Part 1)
Technical practice online fault analysis and solutions (Part 1)
2022-07-04 00:50:00 【51CTO】
Online failures usually refer to large-scale problems or events that affect the availability of online services , Dealing with online faults is not only a technical activity , Especially for technicians / Test of emergency response ability of technical team . This paper mainly discusses the classification of online faults 、 Coping ideas 、 Start with the causes , Summarize some experience and problem-solving methods , In order to discuss and communicate with colleagues .
One 、 Online fault classification
■ Unexpected mistakes 、 No response or slow response
■ In service , Impact on user experience
■ No shutdown or large-scale shutdown
■ It needs to be repaired as soon as possible
Two 、 Coping ideas
Analyze according to experience , If someone in the emergency team has experience in corresponding problems , And determine that the normal operation of the system can be restored by some means , Then you should recover as soon as possible ( Roll back, etc ), At the same time, be sure to keep the scene , In case of subsequent problem positioning and repair ; If no one has experience , You need to use a rough way to ensure that the service is available , Such as timed restart 、 Current limiting 、 Degradation etc. .
Head of business 、 Technical director 、 Core R & D personnel 、 Architects 、 Operation and maintenance engineers and operators quickly analyze the cause of the problem . The analysis process needs to first consider the recent changes of the system , It includes the following aspects :
■ Whether the system has been released and launched recently ?
■ Whether the user of the service has operational activities ?
■ Whether there is traffic fluctuation in the network ?
■ Whether the business volume has increased recently ?
■ Whether the operators have made changes in the system ?
■ Whether the dependent basic platforms and resources have been released and launched ?
■ Whether other systems that depend on have been released and launched ?
3、 ... and 、 Probable cause
■ Code Bug: Logic is not rigorous 、 Connection not released
■ Code performance : Loop external call 、 Batch read is not used 、 Regular loop, etc
■ Memory leak : Local cache
■ Abnormal flow :DDOS
■ Business volume increased : Capacity estimation error
■ External system problems : database 、 Search engine 、 Distributed cache 、 Performance problems of middleware such as message queuing
CPU、 Memory 、IO Indicators are abnormal
Four 、 Three steps
■ monitor :“ I don't know what I'm going to do ”. Monitoring mechanisms are needed to discover 、 Expose system performance problems . It generally depends on system level or business level monitoring tools .
■ analysis :“ I know what I want to do ”. Basic computer knowledge and analysis tools are needed .
■ solve :“ I know what I need to know ”. System 、 Adjustment of program parameters 、 Refactoring and optimization of code .
Understanding how a system should work does not make people experts , Only by investigating why the system can't work properly .
1. preparation
edit
Preparation for fault analysis 、 Knowledge needed ,based on CentOS 6.5 && JDK 1.8.0_121
■ Basic computer knowledge : computer network 、 operating system 、 The principle of computer organization
■ Java memory management : Garbage collection algorithm 、 Garbage collector 、 The key GC Parameters 、JVM Memory model, etc
■ Java Code benchmark performance test : have access to JMH( Micro benchmarking framework ) To carry out , Able to remove JIT The impact of hot code compilation on performance
■ HotSpot Virtual machine architecture
■ System parameter tuning
■ Master the common diagnostic tools of the system 、JDK Use of self-contained diagnostic tools and other diagnostic tools
■ Understand the business system : Overall framework 、 Pressure direction 、 Capacity estimate 、 Version of system related software 、 Mode and parameters
2. Common system diagnostic tools -CentOS Bring their own
■ uptime: Running time of the system 、 Average load , Include 1 minute 、5 minute 、15 The average number of tasks that can be run in minutes , Including running tasks 、 Although it can run, it is waiting for a processor to idle and block the process in the non interruptible sleep state ( wait for IO, Status as D) The task of .
The first part shows the system time . The first message from the left is 22:36:32, This is the current system time , With 24 Output in hour format .
The second part shows the system running time .up 10 days,11:21, It indicates that the system of the machine has been running 10 God 11 Hours 21 minute . When the system restarts, it will be cleared .
The information in the third part is to display the number of logged in users . It is shown that 1 user , That is, the number of currently logged in users is 1.
The last information is the average load of the system .0.00,,0.05, 0.07 Each represents the past 1 minute 、5 minute 、15 The average load of the minute system . The lower the load, the better the system performance .
edit
■ dmesg | tail: This command will output the last... Of the system log 10 That's ok . common OOM kill and TCP Packet loss will be recorded here .
■ free -m: This command can check the usage of system memory ,-m Parameter representation is displayed in megabytes .Buffer and Cache Are calculated in used Inside . What really reflects the memory usage is the second line . If there is less memory available , Will use swap District , increase IO expenses , Reduce performance .
edit
■ vmstat 1: Real time performance testing tool , It can show the state value of the server at a given time interval , Including the server CPU Usage rate 、 Memory usage 、 Virtual memory exchange 、IO Core indicators of the system such as reading and writing .r, wait for CPU Number of processes for resources , This is higher than the average load load More able to reflect CPU Busy situation ;b, The number of processes blocked in non interruptible sleep ;si、so,swap Use of the area , If not for 0 Description has been used swap District ;us、sy、id、wa、st,CPU Usage ,id + us + sy = 100.
edit
This means vmstat Every time 2 Seconds to collect data , All the time , Until the end of the program .
■ top: It contains a lot of overall indicator information of the system , Including system load 、 System memory usage 、 System CPU Usage, etc , It basically covers the functions of the above commands .
■ netstap -tanp: see TCP Network connection status .
iproute Toolset :ss,ip, Can replace netstat
3. Common system diagnostic tools -Sysstat
■ mpstat -P ALL 1: This command is used to display each CPU Usage situation . If there is a CPU The occupancy rate is extremely high , It indicates that it may be caused by a single threaded application .
■ sar -n DEV 1:sar The command is mainly used to check the throughput of network devices . Throughput through network devices , Determine if the network device is saturated .
■ sar -n TCP,ETCP 1: see TCP Connection status .active/s, The number of actively initiated connections per second (connect);passive/s, The number of passively initiated connections per second (accept);retrans/s, Number of retransmissions per second , It can reflect the network condition and whether packet loss has occurred .
■ iostat -xz 1: View the machine disk IO situation .await(ms),IO Average waiting time for operation , When the application interacts with the disk , It takes time , Include IO Waiting and actual operation time ;avgqu-s, The average number of requests sent to the device ;%util, Equipment utilization .
sar、iostat、mpstat、pidstat Belong to sysstat Software Suite
4. JDK Diagnostic tools
■ jstack:Java Stack trace tool , Mainly used to print specified Java Process 、 Core file or remote debugging server Java Thread's stack trace information .
■ jmap:Java Memory mapping tool (Java Memory Map), Mainly used to print specified Java process 、 Core file or remote debugging server's shared object memory mapping or heap memory details .
■ jhat:Java Heap analysis tools (Java Heep Analysis Tool), Used to analyze Java Object information in heap memory .
■ jinfo:Java Configuration information tools (Java Configuration Information), Used to print specified Java process 、 Configuration information of core file or remote debugging server , You can also dynamically modify JVM Parameter configuration .
■ jstat:JVM Statistical testing tools (Java Statistics Monitoring Tool), Mainly used to monitor and display JVM Performance statistics for , Include gc Statistics .
■ jcmd:Java Command line (Java Command), Used to refer to the running JVM Send diagnostic command request . because jmap The official mark is unsupported,jcmd It can be used as an alternative tool .
■ visualvm: adopt JMX Interface connection JVM process , So that we can see JVM On the thread 、 Memory 、 Class and other information . Various plug-ins can be installed .( adopt CATALINA_OPTS Turn on Tomcat jmx Interface )
■ jconsole: Function like visualvmv, It can display the specific thread stack information and the occupation of memory in various years , And support direct remote execution MBEAN.
5. Other tools
■ jmc:Java Mission Control, Is a sampling type of set diagnosis 、 A very powerful tool for analysis and monitoring . Due to charges , Not too much .
■ greys-atonomy: Online diagnostic tools , By dynamically modifying bytecode, no restart is required JVM Add log 、 The purpose of dynamically enhancing code, such as time-consuming monitoring methods .
■ arthas: Alibaba open source Java Diagnostic toolbox , be based on greys-atonomy And come , Including online diagnosis 、 Decompile bytecode 、 Check out the most resource intensive Java Threads, etc .
■ jwebap:JavaEE Performance testing framework , be based on ASM Enhanced bytecode implementation . Support :HTTP request 、JDBC Connect 、method Trace and number of calls 、 Time consuming statistics . Secondary developed suishen-webap, Joined the right Java8 Support and Redis Connected monitoring .
■ awesome-scripts: It encapsulates many common diagnostic tools 、 Script etc. , Include greys-atonomy、sjk、VJTools And get the most resource consuming thread stack information 、 Statistics TCP Number of connections and other scripts .( To be continued )
边栏推荐
- Release and visualization of related data
- Function: find the sum of the elements on the main and sub diagonal of the matrix with 5 rows and 5 columns. Note that the elements where the two diagonals intersect are added only once. For example,
- The super fully automated test learning materials sorted out after a long talk with a Tencent eight year old test all night! (full of dry goods
- system. Exit (0) and system exit(1)
- Reading notes on how programs run
- Introduction to thread pool
- Wechat official account and synchronization assistant
- Five high-frequency questions were selected from the 200 questions raised by 3000 test engineers
- Global and Chinese market of process beer equipment 2022-2028: Research Report on technology, participants, trends, market size and share
- 机器学习基础:用 Lasso 做特征选择
猜你喜欢
老姜的特点
[NLP] text classification still stays at Bert? Duality is too strong than learning framework
Query efficiency increased by 10 times! Three optimization schemes to help you solve the deep paging problem of MySQL
Celebrate the new year | Suihua fire rescue detachment has wonderful cultural activities during the Spring Festival
Shell script three swordsman sed
be based on. NETCORE development blog project starblog - (14) realize theme switching function
Qtcharts notes (V) scatter diagram qscatterseries
A Kuan food rushed to the Shenzhen Stock Exchange: with annual sales of 1.1 billion, Hillhouse and Maotai CCB are shareholders
[dynamic programming] leetcode 53: maximum subarray sum
功能:将主函数中输入的字符串反序存放。例如:输入字符串“abcdefg”,则应输出“gfedcba”。
随机推荐
A-Frame虚拟现实开发入门
Sorry, Tencent I also refused
The culprit of unrestrained consumption -- Summary
不得不会的Oracle数据库知识点(一)
UTS | causal reasoning random intervention based on Reinforcement Learning
STM32 key light
It's OK to have hands-on 8 - project construction details 3-jenkins' parametric construction
On the day when 28K joined Huawei testing post, I cried: everything I have done in these five months is worth it
Global and Chinese market of glossometer 2022-2028: Research Report on technology, participants, trends, market size and share
2-redis architecture design to use scenarios - four deployment and operation modes (Part 2)
Alibaba test engineer with an annual salary of 500000 shares notes: a complete set of written tests of software testing
Shell script three swordsman sed
【.NET+MQTT】.NET6 环境下实现MQTT通信,以及服务端、客户端的双边消息订阅与发布的代码演示
Global and Chinese market of process beer equipment 2022-2028: Research Report on technology, participants, trends, market size and share
MPLS experiment
Print diamond pattern
What is regression testing? Talk about regression testing in the eyes of Ali Test Engineers
Oracle database knowledge points (I)
Is the securities account opened by Caicai for individuals safe? Is there a routine
机器学习基础:用 Lasso 做特征选择