当前位置:网站首页>Technical practice online fault analysis and solutions (Part 1)
Technical practice online fault analysis and solutions (Part 1)
2022-07-04 00:50:00 【51CTO】
Online failures usually refer to large-scale problems or events that affect the availability of online services , Dealing with online faults is not only a technical activity , Especially for technicians / Test of emergency response ability of technical team . This paper mainly discusses the classification of online faults 、 Coping ideas 、 Start with the causes , Summarize some experience and problem-solving methods , In order to discuss and communicate with colleagues .
One 、 Online fault classification
■ Unexpected mistakes 、 No response or slow response
■ In service , Impact on user experience
■ No shutdown or large-scale shutdown
■ It needs to be repaired as soon as possible
Two 、 Coping ideas
Analyze according to experience , If someone in the emergency team has experience in corresponding problems , And determine that the normal operation of the system can be restored by some means , Then you should recover as soon as possible ( Roll back, etc ), At the same time, be sure to keep the scene , In case of subsequent problem positioning and repair ; If no one has experience , You need to use a rough way to ensure that the service is available , Such as timed restart 、 Current limiting 、 Degradation etc. .
Head of business 、 Technical director 、 Core R & D personnel 、 Architects 、 Operation and maintenance engineers and operators quickly analyze the cause of the problem . The analysis process needs to first consider the recent changes of the system , It includes the following aspects :
■ Whether the system has been released and launched recently ?
■ Whether the user of the service has operational activities ?
■ Whether there is traffic fluctuation in the network ?
■ Whether the business volume has increased recently ?
■ Whether the operators have made changes in the system ?
■ Whether the dependent basic platforms and resources have been released and launched ?
■ Whether other systems that depend on have been released and launched ?
3、 ... and 、 Probable cause
■ Code Bug: Logic is not rigorous 、 Connection not released
■ Code performance : Loop external call 、 Batch read is not used 、 Regular loop, etc
■ Memory leak : Local cache
■ Abnormal flow :DDOS
■ Business volume increased : Capacity estimation error
■ External system problems : database 、 Search engine 、 Distributed cache 、 Performance problems of middleware such as message queuing
CPU、 Memory 、IO Indicators are abnormal
Four 、 Three steps
■ monitor :“ I don't know what I'm going to do ”. Monitoring mechanisms are needed to discover 、 Expose system performance problems . It generally depends on system level or business level monitoring tools .
■ analysis :“ I know what I want to do ”. Basic computer knowledge and analysis tools are needed .
■ solve :“ I know what I need to know ”. System 、 Adjustment of program parameters 、 Refactoring and optimization of code .
Understanding how a system should work does not make people experts , Only by investigating why the system can't work properly .
1. preparation
edit
Preparation for fault analysis 、 Knowledge needed ,based on CentOS 6.5 && JDK 1.8.0_121
■ Basic computer knowledge : computer network 、 operating system 、 The principle of computer organization
■ Java memory management : Garbage collection algorithm 、 Garbage collector 、 The key GC Parameters 、JVM Memory model, etc
■ Java Code benchmark performance test : have access to JMH( Micro benchmarking framework ) To carry out , Able to remove JIT The impact of hot code compilation on performance
■ HotSpot Virtual machine architecture
■ System parameter tuning
■ Master the common diagnostic tools of the system 、JDK Use of self-contained diagnostic tools and other diagnostic tools
■ Understand the business system : Overall framework 、 Pressure direction 、 Capacity estimate 、 Version of system related software 、 Mode and parameters
2. Common system diagnostic tools -CentOS Bring their own
■ uptime: Running time of the system 、 Average load , Include 1 minute 、5 minute 、15 The average number of tasks that can be run in minutes , Including running tasks 、 Although it can run, it is waiting for a processor to idle and block the process in the non interruptible sleep state ( wait for IO, Status as D) The task of .
The first part shows the system time . The first message from the left is 22:36:32, This is the current system time , With 24 Output in hour format .
The second part shows the system running time .up 10 days,11:21, It indicates that the system of the machine has been running 10 God 11 Hours 21 minute . When the system restarts, it will be cleared .
The information in the third part is to display the number of logged in users . It is shown that 1 user , That is, the number of currently logged in users is 1.
The last information is the average load of the system .0.00,,0.05, 0.07 Each represents the past 1 minute 、5 minute 、15 The average load of the minute system . The lower the load, the better the system performance .
edit
■ dmesg | tail: This command will output the last... Of the system log 10 That's ok . common OOM kill and TCP Packet loss will be recorded here .
■ free -m: This command can check the usage of system memory ,-m Parameter representation is displayed in megabytes .Buffer and Cache Are calculated in used Inside . What really reflects the memory usage is the second line . If there is less memory available , Will use swap District , increase IO expenses , Reduce performance .
edit
■ vmstat 1: Real time performance testing tool , It can show the state value of the server at a given time interval , Including the server CPU Usage rate 、 Memory usage 、 Virtual memory exchange 、IO Core indicators of the system such as reading and writing .r, wait for CPU Number of processes for resources , This is higher than the average load load More able to reflect CPU Busy situation ;b, The number of processes blocked in non interruptible sleep ;si、so,swap Use of the area , If not for 0 Description has been used swap District ;us、sy、id、wa、st,CPU Usage ,id + us + sy = 100.
edit
This means vmstat Every time 2 Seconds to collect data , All the time , Until the end of the program .
■ top: It contains a lot of overall indicator information of the system , Including system load 、 System memory usage 、 System CPU Usage, etc , It basically covers the functions of the above commands .
■ netstap -tanp: see TCP Network connection status .
iproute Toolset :ss,ip, Can replace netstat
3. Common system diagnostic tools -Sysstat
■ mpstat -P ALL 1: This command is used to display each CPU Usage situation . If there is a CPU The occupancy rate is extremely high , It indicates that it may be caused by a single threaded application .
■ sar -n DEV 1:sar The command is mainly used to check the throughput of network devices . Throughput through network devices , Determine if the network device is saturated .
■ sar -n TCP,ETCP 1: see TCP Connection status .active/s, The number of actively initiated connections per second (connect);passive/s, The number of passively initiated connections per second (accept);retrans/s, Number of retransmissions per second , It can reflect the network condition and whether packet loss has occurred .
■ iostat -xz 1: View the machine disk IO situation .await(ms),IO Average waiting time for operation , When the application interacts with the disk , It takes time , Include IO Waiting and actual operation time ;avgqu-s, The average number of requests sent to the device ;%util, Equipment utilization .
sar、iostat、mpstat、pidstat Belong to sysstat Software Suite
4. JDK Diagnostic tools
■ jstack:Java Stack trace tool , Mainly used to print specified Java Process 、 Core file or remote debugging server Java Thread's stack trace information .
■ jmap:Java Memory mapping tool (Java Memory Map), Mainly used to print specified Java process 、 Core file or remote debugging server's shared object memory mapping or heap memory details .
■ jhat:Java Heap analysis tools (Java Heep Analysis Tool), Used to analyze Java Object information in heap memory .
■ jinfo:Java Configuration information tools (Java Configuration Information), Used to print specified Java process 、 Configuration information of core file or remote debugging server , You can also dynamically modify JVM Parameter configuration .
■ jstat:JVM Statistical testing tools (Java Statistics Monitoring Tool), Mainly used to monitor and display JVM Performance statistics for , Include gc Statistics .
■ jcmd:Java Command line (Java Command), Used to refer to the running JVM Send diagnostic command request . because jmap The official mark is unsupported,jcmd It can be used as an alternative tool .
■ visualvm: adopt JMX Interface connection JVM process , So that we can see JVM On the thread 、 Memory 、 Class and other information . Various plug-ins can be installed .( adopt CATALINA_OPTS Turn on Tomcat jmx Interface )
■ jconsole: Function like visualvmv, It can display the specific thread stack information and the occupation of memory in various years , And support direct remote execution MBEAN.
5. Other tools
■ jmc:Java Mission Control, Is a sampling type of set diagnosis 、 A very powerful tool for analysis and monitoring . Due to charges , Not too much .
■ greys-atonomy: Online diagnostic tools , By dynamically modifying bytecode, no restart is required JVM Add log 、 The purpose of dynamically enhancing code, such as time-consuming monitoring methods .
■ arthas: Alibaba open source Java Diagnostic toolbox , be based on greys-atonomy And come , Including online diagnosis 、 Decompile bytecode 、 Check out the most resource intensive Java Threads, etc .
■ jwebap:JavaEE Performance testing framework , be based on ASM Enhanced bytecode implementation . Support :HTTP request 、JDBC Connect 、method Trace and number of calls 、 Time consuming statistics . Secondary developed suishen-webap, Joined the right Java8 Support and Redis Connected monitoring .
■ awesome-scripts: It encapsulates many common diagnostic tools 、 Script etc. , Include greys-atonomy、sjk、VJTools And get the most resource consuming thread stack information 、 Statistics TCP Number of connections and other scripts .( To be continued )
边栏推荐
- 国元证券开户是真的安全可靠吗
- It is worthy of "Alibaba internal software test interview notes" from beginning to end, all of which are essence
- Sequence list and linked list
- Global and Chinese market of underwater bags 2022-2028: Research Report on technology, participants, trends, market size and share
- 老姜的特点
- [complimentary ppt] kubemeet Chengdu review: make the delivery and management of cloud native applications easier!
- [error record] configure NDK header file path in Visual Studio
- Qtcharts notes (V) scatter diagram qscatterseries
- Celebrate the new year | Suihua fire rescue detachment has wonderful cultural activities during the Spring Festival
- The FISCO bcos console calls the contract and reports an error does not exist
猜你喜欢
Makefile judge custom variables
A-Frame虚拟现实开发入门
Att & CK actual combat series - red team actual combat - V
GUI 应用:socket 网络聊天室
我管你什么okr还是kpi,PPT轻松交给你
Qtcharts notes (V) scatter diagram qscatterseries
Sequence list and linked list
Cannot build artifact 'test Web: War expanded' because it is included into a circular depend solution
[GNN] hard core! This paper combs the classical graph network model
[error record] configure NDK header file path in Visual Studio
随机推荐
Analysis and solution of lazyinitializationexception
【.NET+MQTT】.NET6 环境下实现MQTT通信,以及服务端、客户端的双边消息订阅与发布的代码演示
The first training of wechat applet
Delete all elements with a value of Y. The values of array elements and y are entered by the main function through the keyboard.
Regular expression of shell script value
Optimization of for loop
Self study software testing. To what extent can you go out and find a job?
Global and Chinese market of underwater bags 2022-2028: Research Report on technology, participants, trends, market size and share
Employees' turnover intention is under the control of the company. After the dispute, the monitoring system developer quietly removed the relevant services
Software testers, how can you quickly improve your testing skills? Ten minutes to teach you
It is worthy of "Alibaba internal software test interview notes" from beginning to end, all of which are essence
[error record] configure NDK header file path in Visual Studio (three header file paths of NDK | ASM header file path selection related to CPU architecture)
不得不会的Oracle数据库知识点(一)
Global and Chinese market of melting furnaces 2022-2028: Research Report on technology, participants, trends, market size and share
Print diamond pattern
[common error] custom IP instantiation error
Bodong medical sprint Hong Kong stocks: a 9-month loss of 200million Hillhouse and Philips are shareholders
不得不会的Oracle数据库知识点(四)
The difference between fetchtype lazy and eagle in JPA
Anomalies seen during the interview