当前位置:网站首页>Technical practice online fault analysis and solutions (Part 1)
Technical practice online fault analysis and solutions (Part 1)
2022-07-04 00:50:00 【51CTO】
Online failures usually refer to large-scale problems or events that affect the availability of online services , Dealing with online faults is not only a technical activity , Especially for technicians / Test of emergency response ability of technical team . This paper mainly discusses the classification of online faults 、 Coping ideas 、 Start with the causes , Summarize some experience and problem-solving methods , In order to discuss and communicate with colleagues .
One 、 Online fault classification
■ Unexpected mistakes 、 No response or slow response
■ In service , Impact on user experience
■ No shutdown or large-scale shutdown
■ It needs to be repaired as soon as possible
Two 、 Coping ideas
Analyze according to experience , If someone in the emergency team has experience in corresponding problems , And determine that the normal operation of the system can be restored by some means , Then you should recover as soon as possible ( Roll back, etc ), At the same time, be sure to keep the scene , In case of subsequent problem positioning and repair ; If no one has experience , You need to use a rough way to ensure that the service is available , Such as timed restart 、 Current limiting 、 Degradation etc. .
Head of business 、 Technical director 、 Core R & D personnel 、 Architects 、 Operation and maintenance engineers and operators quickly analyze the cause of the problem . The analysis process needs to first consider the recent changes of the system , It includes the following aspects :
■ Whether the system has been released and launched recently ?
■ Whether the user of the service has operational activities ?
■ Whether there is traffic fluctuation in the network ?
■ Whether the business volume has increased recently ?
■ Whether the operators have made changes in the system ?
■ Whether the dependent basic platforms and resources have been released and launched ?
■ Whether other systems that depend on have been released and launched ?
3、 ... and 、 Probable cause
■ Code Bug: Logic is not rigorous 、 Connection not released
■ Code performance : Loop external call 、 Batch read is not used 、 Regular loop, etc
■ Memory leak : Local cache
■ Abnormal flow :DDOS
■ Business volume increased : Capacity estimation error
■ External system problems : database 、 Search engine 、 Distributed cache 、 Performance problems of middleware such as message queuing
CPU、 Memory 、IO Indicators are abnormal
Four 、 Three steps
■ monitor :“ I don't know what I'm going to do ”. Monitoring mechanisms are needed to discover 、 Expose system performance problems . It generally depends on system level or business level monitoring tools .
■ analysis :“ I know what I want to do ”. Basic computer knowledge and analysis tools are needed .
■ solve :“ I know what I need to know ”. System 、 Adjustment of program parameters 、 Refactoring and optimization of code .
Understanding how a system should work does not make people experts , Only by investigating why the system can't work properly .
1. preparation

edit
Preparation for fault analysis 、 Knowledge needed ,based on CentOS 6.5 && JDK 1.8.0_121
■ Basic computer knowledge : computer network 、 operating system 、 The principle of computer organization
■ Java memory management : Garbage collection algorithm 、 Garbage collector 、 The key GC Parameters 、JVM Memory model, etc
■ Java Code benchmark performance test : have access to JMH( Micro benchmarking framework ) To carry out , Able to remove JIT The impact of hot code compilation on performance
■ HotSpot Virtual machine architecture
■ System parameter tuning
■ Master the common diagnostic tools of the system 、JDK Use of self-contained diagnostic tools and other diagnostic tools
■ Understand the business system : Overall framework 、 Pressure direction 、 Capacity estimate 、 Version of system related software 、 Mode and parameters
2. Common system diagnostic tools -CentOS Bring their own
■ uptime: Running time of the system 、 Average load , Include 1 minute 、5 minute 、15 The average number of tasks that can be run in minutes , Including running tasks 、 Although it can run, it is waiting for a processor to idle and block the process in the non interruptible sleep state ( wait for IO, Status as D) The task of .
The first part shows the system time . The first message from the left is 22:36:32, This is the current system time , With 24 Output in hour format .
The second part shows the system running time .up 10 days,11:21, It indicates that the system of the machine has been running 10 God 11 Hours 21 minute . When the system restarts, it will be cleared .
The information in the third part is to display the number of logged in users . It is shown that 1 user , That is, the number of currently logged in users is 1.
The last information is the average load of the system .0.00,,0.05, 0.07 Each represents the past 1 minute 、5 minute 、15 The average load of the minute system . The lower the load, the better the system performance .
edit
■ dmesg | tail: This command will output the last... Of the system log 10 That's ok . common OOM kill and TCP Packet loss will be recorded here .
■ free -m: This command can check the usage of system memory ,-m Parameter representation is displayed in megabytes .Buffer and Cache Are calculated in used Inside . What really reflects the memory usage is the second line . If there is less memory available , Will use swap District , increase IO expenses , Reduce performance .
edit
■ vmstat 1: Real time performance testing tool , It can show the state value of the server at a given time interval , Including the server CPU Usage rate 、 Memory usage 、 Virtual memory exchange 、IO Core indicators of the system such as reading and writing .r, wait for CPU Number of processes for resources , This is higher than the average load load More able to reflect CPU Busy situation ;b, The number of processes blocked in non interruptible sleep ;si、so,swap Use of the area , If not for 0 Description has been used swap District ;us、sy、id、wa、st,CPU Usage ,id + us + sy = 100.
edit
This means vmstat Every time 2 Seconds to collect data , All the time , Until the end of the program .
■ top: It contains a lot of overall indicator information of the system , Including system load 、 System memory usage 、 System CPU Usage, etc , It basically covers the functions of the above commands .
■ netstap -tanp: see TCP Network connection status .
iproute Toolset :ss,ip, Can replace netstat
3. Common system diagnostic tools -Sysstat
■ mpstat -P ALL 1: This command is used to display each CPU Usage situation . If there is a CPU The occupancy rate is extremely high , It indicates that it may be caused by a single threaded application .
■ sar -n DEV 1:sar The command is mainly used to check the throughput of network devices . Throughput through network devices , Determine if the network device is saturated .
■ sar -n TCP,ETCP 1: see TCP Connection status .active/s, The number of actively initiated connections per second (connect);passive/s, The number of passively initiated connections per second (accept);retrans/s, Number of retransmissions per second , It can reflect the network condition and whether packet loss has occurred .
■ iostat -xz 1: View the machine disk IO situation .await(ms),IO Average waiting time for operation , When the application interacts with the disk , It takes time , Include IO Waiting and actual operation time ;avgqu-s, The average number of requests sent to the device ;%util, Equipment utilization .
sar、iostat、mpstat、pidstat Belong to sysstat Software Suite
4. JDK Diagnostic tools
■ jstack:Java Stack trace tool , Mainly used to print specified Java Process 、 Core file or remote debugging server Java Thread's stack trace information .
■ jmap:Java Memory mapping tool (Java Memory Map), Mainly used to print specified Java process 、 Core file or remote debugging server's shared object memory mapping or heap memory details .
■ jhat:Java Heap analysis tools (Java Heep Analysis Tool), Used to analyze Java Object information in heap memory .
■ jinfo:Java Configuration information tools (Java Configuration Information), Used to print specified Java process 、 Configuration information of core file or remote debugging server , You can also dynamically modify JVM Parameter configuration .
■ jstat:JVM Statistical testing tools (Java Statistics Monitoring Tool), Mainly used to monitor and display JVM Performance statistics for , Include gc Statistics .
■ jcmd:Java Command line (Java Command), Used to refer to the running JVM Send diagnostic command request . because jmap The official mark is unsupported,jcmd It can be used as an alternative tool .
■ visualvm: adopt JMX Interface connection JVM process , So that we can see JVM On the thread 、 Memory 、 Class and other information . Various plug-ins can be installed .( adopt CATALINA_OPTS Turn on Tomcat jmx Interface )
■ jconsole: Function like visualvmv, It can display the specific thread stack information and the occupation of memory in various years , And support direct remote execution MBEAN.
5. Other tools
■ jmc:Java Mission Control, Is a sampling type of set diagnosis 、 A very powerful tool for analysis and monitoring . Due to charges , Not too much .
■ greys-atonomy: Online diagnostic tools , By dynamically modifying bytecode, no restart is required JVM Add log 、 The purpose of dynamically enhancing code, such as time-consuming monitoring methods .
■ arthas: Alibaba open source Java Diagnostic toolbox , be based on greys-atonomy And come , Including online diagnosis 、 Decompile bytecode 、 Check out the most resource intensive Java Threads, etc .
■ jwebap:JavaEE Performance testing framework , be based on ASM Enhanced bytecode implementation . Support :HTTP request 、JDBC Connect 、method Trace and number of calls 、 Time consuming statistics . Secondary developed suishen-webap, Joined the right Java8 Support and Redis Connected monitoring .
■ awesome-scripts: It encapsulates many common diagnostic tools 、 Script etc. , Include greys-atonomy、sjk、VJTools And get the most resource consuming thread stack information 、 Statistics TCP Number of connections and other scripts .( To be continued )
边栏推荐
- AI 助力艺术设计抄袭检索新突破!刘芳教授团队论文被多媒体顶级会议ACM MM录用
- 删除所有值为y的元素。数组元素中的值和y的值由主函数通过键盘输入。
- The difference between fetchtype lazy and eagle in JPA
- A Kuan food rushed to the Shenzhen Stock Exchange: with annual sales of 1.1 billion, Hillhouse and Maotai CCB are shareholders
- Is it really possible that the monthly salary is 3K and the monthly salary is 15K?
- Wechat official account and synchronization assistant
- 查询效率提升10倍!3种优化方案,帮你解决MySQL深分页问题
- [error record] configure NDK header file path in Visual Studio (three header file paths of NDK | ASM header file path selection related to CPU architecture)
- Weekly open source project recommendation plan
- 不得不会的Oracle数据库知识点(三)
猜你喜欢

GUI 应用:socket 网络聊天室

打印菱形图案

手机异步发送短信验证码解决方案-Celery+redis

It's OK to have hands-on 8 - project construction details 3-jenkins' parametric construction

功能:将主函数中输入的字符串反序存放。例如:输入字符串“abcdefg”,则应输出“gfedcba”。

Alibaba test engineer with an annual salary of 500000 shares notes: a complete set of written tests of software testing

Software testers, how can you quickly improve your testing skills? Ten minutes to teach you

Struct in linked list

MPLS experiment

Weekly open source project recommendation plan
随机推荐
leetcode 121 Best Time to Buy and Sell Stock 买卖股票的最佳时机(简单)
Design of database table foreign key
挖财帮个人开的证券账户安全吗?是不是有套路
Introduction to thread pool
Anomalies seen during the interview
mysql使用视图报错,EXPLAIN/SHOW can not be issued; lacking privileges for underlying table
system. Exit (0) and system exit(1)
Interview script of Software Test Engineer
中电资讯-信贷业务数字化转型如何从星空到指尖?
长文综述:大脑中的熵、自由能、对称性和动力学
数据库表外键的设计
Optimization of for loop
Generic
Global and Chinese market of melting furnaces 2022-2028: Research Report on technology, participants, trends, market size and share
A dichotomy of Valentine's Day
Five high-frequency questions were selected from the 200 questions raised by 3000 test engineers
Windos10 reinstallation system tutorial
Solution to the impact of Remote Code Execution Vulnerability of log4j2 component on December 9, 2021
From functools import reduce -- see the use of reduce function from typical examples
Characteristics of ginger