当前位置:网站首页>If we were the developer responsible for repairing the collapse of station B that night
If we were the developer responsible for repairing the collapse of station B that night
2022-07-27 23:57:00 【Xi. Technical chopping】
As early as ten days ago , And I saw B The post explaining the website crash a year ago .
The first reaction was Time past quickly. , Always feel that B The station collapsed as if yesterday , In my mind, I can still picture the lively microblogs and circles of friends at that time .
According to the scene , The reason I analyzed is CDN Something went wrong. , The traffic goes directly to the back , Due to the peak traffic at night , All of a sudden, the traffic is too large and I hang up , Although it's alive , But maybe after I hung up, I got hot search again , So everyone wants to see the excitement , So it gets worse , This led to a series of linkage failures .
However, if this is only the case, the flow should be cut , Then, after the service starts, it should be good if the traffic comes in bit by bit , But the final recovery time is still relatively long , I thought there should be something else , But we are not insiders , We dare not talk nonsense .
A year later , The answer finally came , The specific reason B The article of the station has been written clearly , I won't repeat it here , The article link I posted at the end of the article .
And my article is mainly to set a plate from the perspective of developers , If you are responsible for the development of the repair module that night , How to do ?
in other words , If something goes wrong online in the project you are responsible for , What kind of thinking can stop loss quickly and recover quickly ?
Ideas for dealing with service collapse
In fact, no matter in B The station is still some small enterprises , When the project in charge collapses or something happens online BUG When , I believe everyone's psychological pressure is the same .
I have encountered many online problems , Basically, they are all absorbed , Neck straight Stare at the screen and check the problem , With the problem successfully fixed , The whole person seems to be paralyzed , I feel very tired .
In the face of service collapse and other problems , The spirit is highly nervous , The brain may not work well , I even saw my colleagues shaking their hands on the keyboard to see problems , So in this scenario , We need enough psychological quality and preparation to deal with mistakes , This can quickly fix the problem .
Stop loss in time
Dealing with online problems , Our primary goal is not to identify problems , But stop loss in time .
The best way to stop loss in time is : restart .
We all say restart to solve everything , This is not unreasonable . In many scenarios, restarting is the fastest way to solve the problem , Not one of them. .
A lot of times , Just restart , The reason may not be found afterwards , Because the probability of triggering the problem is very low , This requires a special team to investigate and solve .
But at that time, the service stopped losing at the fastest speed , If you want to keep the crime scene , Try to find the root cause of the problem , After it is solved, the service will be improved , Maybe your company will go bankrupt .
image B Station that night is positioned to SLB It's broken down , Immediately restart , But immediately CPU 100% 了 , It's useless to restart Dafa at this time .
Backtracking rollback
It's useless to restart the method , It can only be repaired .
Repair is not random , Check according to the phenomenon , For example CPU 100% problem , Then locate the problem through tools such as performance analysis .
If it is OOM problem , Then analyze the stack positioning problem , The main purpose is to find the code that causes the problem , Then check the recent submission records , Analyze the submissions that may cause problems , Then rollback , Finally, repackage and release .
Basically, the invalid restart can be solved in this way .
After it is solved , Then slowly build the code with problems , After rigorous testing , Then publish .
But pay attention to , Pay attention to the upstream and downstream effects when rolling back code .
I have encountered this situation before , After rollback, other services are affected , Then I hung up another service , So even if the situation was urgent , Also be careful , Be careful Lenovo upstream and downstream Services , Prevent secondary injury .
The best way is to check with many people , A person sometimes has limitations in his thinking , Especially in stressful situations .
Even if you are not the main handler of this accident , But if you can, you'd better see things with your colleagues , In turn, let your colleagues watch it for you , This way is more efficient and safe .
It makes sense that many people have great power .
Conditional plan B parallel
In case rollback code is useless ? That is to say, temporary repair cannot !
At this time, in an ideal state, you need a plan B The line is executing synchronously , But I think this is a bit of an afterthought .
B Stations SLB After hanging up , Rolling back the code several times still doesn't work , Just behind A new group SLB, Business uses new SLB Only then gradually recovers .
therefore , Come back later ,B The best repair plan for the station that night should be to arrange other personnel to perform the new SLB The operation of , Prevent aging SLB Can't fix the embarrassing scene .
B Standing by itself means that there are insufficient personnel , Just Two people , There is no parallel .
But I think the normal way of thinking is definitely to rollback the code and repackage it for release , Generally, problems are caused by recent changes .
And like our business applications , In fact, it can only be repaired , Unlike infrastructure, which can be rebuilt , Right .
So if there is a business problem , Basically, there is no suitable plan B, But this is also an idea , When you encounter problems, you can think , In case there is ?
Preparation before failure
The above is the handling method in the accident , In fact, infrastructure preparation before the accident is very critical .
B The public network architecture diagram drawn by the station is only a high-dimensional abstraction , There are many more nodes below .

The key to troubleshooting is to quickly locate the module where the problem lies , Only in this way can we mobilize relevant personnel to check and repair , This involves monitoring .
image B The station quickly locates to the seventh floor of the business host room SLB CPU 100% The problem of , Lead to the unavailability of the business .
So monitoring is very important , It is our eyes when dealing with faults .
Of course , Monitoring is also very complex , It may take several articles to spread the story , Let me talk about the direction of monitoring :
operating system
cache
database
Message queue
Application service
journal
Of course , If your service is successful K8S, That still needs K8S Monitoring of .
There are many more specific subdivisions below , For example, in addition to some basic elements of the database , You also need to monitor the number of connections 、 Number of threads 、 Lock information 、 The slow query 、qps wait .
Want to deal with problems smoothly , You also need to master some common analysis tools , such as B Station analysis CPU The question is Linux Of perf command .
Of course , These may not be familiar to our development , After all, we are not SRE, But some basic commands still need to be mastered , such as top The command can see the process CPU Usage rate ,vmstat Look at the number of context switches ,dstat You can watch the Internet and I/O Situation, etc .
There are also basic concepts , such as CPU Relevant indicators us(user)、ni(nice)、sys(system)、id(idle) wait .
It is suggested to learn about this knowledge , You still have to have a concept .
Redo before the failure
Every company should resume after something happens .
The reason for the problem in the second round , Where are the deficiencies in the process , What aspects need to be strengthened control and so on , Not much specific BB 了 .
Of course , And the pot , How to say this , Is it your own or your own , It's someone else's. don't be confused .
Last
Troubleshooting is really not easy , Especially when asset losses have occurred , This requires solid strength , Otherwise, a monk will be confused directly .
Speaking of online problems , There's actually another one Chaos Engineering , This project comes from Netflix The engineer messed up the monkey (Chaos Monkey).
In short, it's like a monkey jumping up and down to destroy your system , From time to time, the online system will fail ( Build a monkey to destroy your system ), So developers are needed to fix it , In order to simulate the real error , So as to temper the cooperation ability of upstream and downstream , Verify the resilience of the system .
Is to do exercises , Make trouble for yourself , Mandatory practice , Improve the reliability of the system , Interested friends can get to know .

边栏推荐
猜你喜欢
![[NCTF2019]babyRSA1](/img/c1/52e79b6e40390374d48783725311ba.gif)
[NCTF2019]babyRSA1

基于mediapipe的姿态识别和简单行为识别

Design and implementation of spark offline development framework

Starfish OS X metabell strategic cooperation, metauniverse business ecosystem further

Is it really hard to understand? What level of cache is the recyclerview caching mechanism?

主数据管理理论与实践

Character stream learning 14.3

Redis 哈希Hash底层数据结构
Edit the copy and paste judgment problem (bug?), WYSIWYG display symbol problem feedback.

C # delegate usage -- console project, which implements events through delegation
随机推荐
BUUCTF-RSA
Edit the copy and paste judgment problem (bug?), WYSIWYG display symbol problem feedback.
洛谷 P1009 [NOIP1998 普及组] 阶乘之和
urllib.error. URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: un
Xss.haozi.me practice customs clearance
[NCTF2019]babyRSA1
Introduction to several common usage scenarios of message queue
Technical certification | Tupo software and Huawei cloud create a new situation of win-win cooperation
2022 International Conference on civil, building and Environmental Engineering (iccaee 2022)
主数据管理理论与实践
给网站套上Cloudflare(以腾讯云为例)
How to use FTP to realize automatic update of WinForm
[RoarCTF2019]babyRSA威尔逊定理
【飞控开发基础教程6】疯壳·开源编队无人机-SPI(六轴传感器数据获取)
Master data management theory and Practice
Spark 离线开发框架设计与实现
字符流学习14.3
BUUCTF-RSA4
New technology leads new changes in marketing of large and medium-sized enterprises, and UFIDA BiP CRM is launched!
BUUCTF-Dangerous RSA