当前位置:网站首页>Station B collapsed. If we were the developer responsible for the repair that night
Station B collapsed. If we were the developer responsible for the repair that night
2022-07-27 21:45:00 【Technical Trivia】
As early as ten days ago , And I saw B The post explaining the website crash a year ago .
The first reaction was Time past quickly. , Always feel that B The station collapsed as if yesterday , In my mind, I can still picture the lively microblogs and circles of friends at that time .
According to the scene , The reason I analyzed is CDN Something went wrong. , The traffic goes directly to the back , Due to the peak traffic at night , All of a sudden, the traffic is too large and I hang up , Although it's alive , But maybe after I hung up, I got hot search again , So everyone wants to see the excitement , So it gets worse , This led to a series of linkage failures .
However, if this is only the case, the flow should be cut , Then, after the service starts, it should be good if the traffic comes in bit by bit , But the final recovery time is still relatively long , I thought there should be something else , But we are not insiders , We dare not talk nonsense .
A year later , The answer finally came , The specific reason B The article of the station has been written clearly , I won't repeat it here , The article link I posted at the end of the article .
And my article is mainly to set a plate from the perspective of developers , If you are responsible for the development of the repair module that night , How to do ?
in other words , If something goes wrong online in the project you are responsible for , What kind of thinking can stop loss quickly and recover quickly ?
Ideas for dealing with service collapse
In fact, no matter in B The station is still some small enterprises , When the project in charge collapses or something happens online BUG When , I believe everyone's psychological pressure is the same .
I have encountered many online problems , Basically, they are all absorbed , Neck straight Stare at the screen and check the problem , With the problem successfully fixed , The whole person seems to be paralyzed , I feel very tired .
In the face of service collapse and other problems , The spirit is highly nervous , The brain may not work well , I even saw my colleagues shaking their hands on the keyboard to see problems , So in this scenario , We need enough psychological quality and preparation to deal with mistakes , This can quickly fix the problem .
Stop loss in time
Dealing with online problems , Our primary goal is not to identify problems , But stop loss in time .
The best way to stop loss in time is : restart .
We all say restart to solve everything , This is not unreasonable . In many scenarios, restarting is the fastest way to solve the problem , Not one of them. .
A lot of times , Just restart , The reason may not be found afterwards , Because the probability of triggering the problem is very low , This requires a special team to investigate and solve .
But at that time, the service stopped losing at the fastest speed , If you want to keep the crime scene , Try to find the root cause of the problem , After it is solved, the service will be improved , Maybe your company will go bankrupt .
image B Station that night is positioned to SLB It's broken down , Immediately restart , But immediately CPU 100% 了 , It's useless to restart Dafa at this time .
Backtracking rollback
It's useless to restart the method , It can only be repaired .
Repair is not random , Check according to the phenomenon , For example CPU 100% problem , Then locate the problem through tools such as performance analysis .
If it is OOM problem , Then analyze the stack positioning problem , The main purpose is to find the code that causes the problem , Then check the recent submission records , Analyze the submissions that may cause problems , Then rollback , Finally, repackage and release .
Basically, the invalid restart can be solved in this way .
After it is solved , Then slowly build the code with problems , After rigorous testing , Then publish .
But pay attention to , Pay attention to the upstream and downstream effects when rolling back code .
I have encountered this situation before , After rollback, other services are affected , Then I hung up another service , So even if the situation was urgent , Also be careful , Be careful Lenovo upstream and downstream Services , Prevent secondary injury .
The best way is to check with many people , A person sometimes has limitations in his thinking , Especially in stressful situations .
Even if you are not the main handler of this accident , But if you can, you'd better see things with your colleagues , In turn, let your colleagues watch it for you , This way is more efficient and safe .
It makes sense that many people have great power .
Conditional plan B parallel
In case rollback code is useless ? That is to say, temporary repair cannot !
At this time, in an ideal state, you need a plan B The line is executing synchronously , But I think this is a bit of an afterthought .
B Stations SLB After hanging up , Rolling back the code several times still doesn't work , Just behind A new group SLB, Business uses new SLB Only then gradually recovers .
therefore , Come back later ,B The best repair plan for the station that night should be to arrange other personnel to perform the new SLB The operation of , Prevent aging SLB Can't fix the embarrassing scene .
B Standing by itself means that there are insufficient personnel , Just Two people , There is no parallel .
But I think the normal way of thinking is definitely to rollback the code and repackage it for release , Generally, problems are caused by recent changes .
And like our business applications , In fact, it can only be repaired , Unlike infrastructure, which can be rebuilt , Right .
So if there is a business problem , Basically, there is no suitable plan B, But this is also an idea , When you encounter problems, you can think , In case there is ?
Preparation before failure
The above is the handling method in the accident , In fact, infrastructure preparation before the accident is very critical .
B The public network architecture diagram drawn by the station is only a high-dimensional abstraction , There are many more nodes below .
The key to troubleshooting is to quickly locate the module where the problem lies , Only in this way can we mobilize relevant personnel to check and repair , This involves monitoring .
image B The station quickly locates to the seventh floor of the business host room SLB CPU 100% The problem of , Lead to the unavailability of the business .
So monitoring is very important , It is our eyes when dealing with faults .
Of course , Monitoring is also very complex , It may take several articles to spread the story , Let me talk about the direction of monitoring :
- operating system
- cache
- database
- Message queue
- Application service
- journal
Of course , If your service is successful K8S, That still needs K8S Monitoring of .
There are many more specific subdivisions below , For example, in addition to some basic elements of the database , You also need to monitor the number of connections 、 Number of threads 、 Lock information 、 The slow query 、qps wait .
Want to deal with problems smoothly , You also need to master some common analysis tools , such as B Station analysis CPU The question is Linux Of perf command .
Of course , These may not be familiar to our development , After all, we are not SRE, But some basic commands still need to be mastered , such as top The command can see the process CPU Usage rate ,vmstat Look at the number of context switches ,dstat You can watch the Internet and I/O Situation, etc .
There are also basic concepts , such as CPU Relevant indicators us(user)、ni(nice)、sys(system)、id(idle) wait .
It is suggested to learn about this knowledge , You still have to have a concept .
Redo before the failure
Every company should resume after something happens .
The reason for the problem in the second round , Where are the deficiencies in the process , What aspects need to be strengthened control and so on , Not much specific BB 了 .
Of course , And the pot , How to say this , Is it your own or your own , It's someone else's. don't be confused .
Last
Troubleshooting is really not easy , Especially when asset losses have occurred , This requires solid strength , Otherwise, a monk will be confused directly .
Speaking of online problems , There's actually another one Chaos Engineering , This project comes from Netflix The engineer messed up the monkey (Chaos Monkey).
In short, it's like a monkey jumping up and down to destroy your system , From time to time, the online system will fail ( Build a monkey to destroy your system ), So developers are needed to fix it , In order to simulate the real error , So as to temper the cooperation ability of upstream and downstream , Verify the resilience of the system .
Is to do exercises , Make trouble for yourself , Mandatory practice , Improve the reliability of the system , Interested friends can get to know .
边栏推荐
- Software test interview questions: the steps to write test cases by drawing cause and effect diagrams are___ And transforming the cause and effect diagram into a state diagram in five steps. What are
- 二维数组的基本用法
- Software test interview question: does software acceptance test include formal acceptance test, alpha test and beta test?
- V2.x synchronization is abnormal. There are a lot of posts that cannot be synchronized in the cloud, and the synchronization is blocked and slow
- Recursion / backtracking (Part 1)
- Technical practice behind bloom model: how to refine 176billion parameter model?
- zibbix安装部署
- Form of objects in memory & memory allocation mechanism
- 疫情之下,手机供应链及线下渠道受阻!销量骤降库存严重!
- ECCV 2022 | China University of science and Technology & jd.com proposed: data efficient transformer target detector
猜你喜欢

Mobilevit learning notes
![[2022 Niuke multi School Game 2] k-link with bracket sequence I](/img/95/9d6710bfb7b9282b4a06a5f61a1f08.png)
[2022 Niuke multi School Game 2] k-link with bracket sequence I

LM NAV: robot navigation method based on large models of language, vision and behavior

LinkedList underlying source code

异常-Exception

LVS+Keepalived高可用群集

Zibbix installation and deployment
![Tencent cloud [hiflow] | automation --------- hiflow: still copying and pasting?](/img/dd/8ee989f5c9db632f78e79425497e71.png)
Tencent cloud [hiflow] | automation --------- hiflow: still copying and pasting?

【2022牛客多校第二场】K-Link with Bracket Sequence I

聊聊 MySQL 事务二阶段提交
随机推荐
美司法部增加针对华为的指控,包括窃取商业秘密等16项新罪名
Principle analysis and best practice of guava cache
最高7.5Gbps!全球首款5nm 5G基带骁龙X60发布:支持聚合全部主要频段!
IDEA常用快捷键及设置方法
Up to 7.5gbps! The world's first 5nm 5g baseband snapdragon X60 release: support the aggregation of all major bands!
Qmodbus library is used, and it is written as ROS node publishing topic and program cmakelist
Qt取出输入框字符串,lineEdit
An article takes you into the world of pycharm - stop asking me about pycharm installation and environment configuration!!!
In addition to "adding machines", in fact, your micro service can be optimized like this
Under the epidemic, the mobile phone supply chain and offline channels are blocked! Sales plummeted and inventory was serious!
day 1 - day 4
Zibbix installation and deployment
软件测试面试题:软件测试项目从什么时候开始?为什么?
Characteristics of exonuclease in Worthington venom and related literature
B站崩了,如果我们是那晚负责修复的开发人员
递归/回溯刷题(上)
紫光展锐:2020年将有数十款基于春藤510的5G终端商用
After sorting (bubble sorting), learn to continuously update other sorting methods
对象在内存中存在形式&内存分配机制
一文读懂Plato Farm的ePLATO,以及其高溢价缘由