当前位置：网站首页>Station B collapsed. If we were the developer responsible for the repair that night

Station B collapsed. If we were the developer responsible for the repair that night

2022-07-27 21:45:00 【Technical Trivia】

As early as ten days ago , And I saw B The post explaining the website crash a year ago .

The first reaction was Time past quickly. , Always feel that B The station collapsed as if yesterday , In my mind, I can still picture the lively microblogs and circles of friends at that time .

According to the scene , The reason I analyzed is CDN Something went wrong. , The traffic goes directly to the back , Due to the peak traffic at night , All of a sudden, the traffic is too large and I hang up , Although it's alive , But maybe after I hung up, I got hot search again , So everyone wants to see the excitement , So it gets worse , This led to a series of linkage failures .

However, if this is only the case, the flow should be cut , Then, after the service starts, it should be good if the traffic comes in bit by bit , But the final recovery time is still relatively long , I thought there should be something else , But we are not insiders , We dare not talk nonsense .

A year later , The answer finally came , The specific reason B The article of the station has been written clearly , I won't repeat it here , The article link I posted at the end of the article .

And my article is mainly to set a plate from the perspective of developers , If you are responsible for the development of the repair module that night , How to do ？

in other words , If something goes wrong online in the project you are responsible for , What kind of thinking can stop loss quickly and recover quickly ？

Ideas for dealing with service collapse

In fact, no matter in B The station is still some small enterprises , When the project in charge collapses or something happens online BUG When , I believe everyone's psychological pressure is the same .

I have encountered many online problems , Basically, they are all absorbed , Neck straight Stare at the screen and check the problem , With the problem successfully fixed , The whole person seems to be paralyzed , I feel very tired .

In the face of service collapse and other problems , The spirit is highly nervous , The brain may not work well , I even saw my colleagues shaking their hands on the keyboard to see problems , So in this scenario , We need enough psychological quality and preparation to deal with mistakes , This can quickly fix the problem .

Stop loss in time

Dealing with online problems , Our primary goal is not to identify problems , But stop loss in time .

The best way to stop loss in time is ： restart .

We all say restart to solve everything , This is not unreasonable . In many scenarios, restarting is the fastest way to solve the problem , Not one of them. .

A lot of times , Just restart , The reason may not be found afterwards , Because the probability of triggering the problem is very low , This requires a special team to investigate and solve .

But at that time, the service stopped losing at the fastest speed , If you want to keep the crime scene , Try to find the root cause of the problem , After it is solved, the service will be improved , Maybe your company will go bankrupt .

image B Station that night is positioned to SLB It's broken down , Immediately restart , But immediately CPU 100% 了 , It's useless to restart Dafa at this time .

Backtracking rollback

It's useless to restart the method , It can only be repaired .

Repair is not random , Check according to the phenomenon , For example CPU 100%   problem , Then locate the problem through tools such as performance analysis .

If it is OOM problem , Then analyze the stack positioning problem , The main purpose is to find the code that causes the problem , Then check the recent submission records , Analyze the submissions that may cause problems , Then rollback , Finally, repackage and release .

Basically, the invalid restart can be solved in this way .

After it is solved , Then slowly build the code with problems , After rigorous testing , Then publish .

But pay attention to , Pay attention to the upstream and downstream effects when rolling back code .

I have encountered this situation before , After rollback, other services are affected , Then I hung up another service , So even if the situation was urgent , Also be careful , Be careful Lenovo upstream and downstream Services , Prevent secondary injury .

The best way is to check with many people , A person sometimes has limitations in his thinking , Especially in stressful situations .

Even if you are not the main handler of this accident , But if you can, you'd better see things with your colleagues , In turn, let your colleagues watch it for you , This way is more efficient and safe .

It makes sense that many people have great power .

Conditional plan B parallel

In case rollback code is useless ？ That is to say, temporary repair cannot ！

At this time, in an ideal state, you need a plan B The line is executing synchronously , But I think this is a bit of an afterthought .

B Stations SLB After hanging up , Rolling back the code several times still doesn't work , Just behind A new group SLB, Business uses new SLB Only then gradually recovers .

therefore , Come back later ,B The best repair plan for the station that night should be to arrange other personnel to perform the new SLB The operation of , Prevent aging SLB Can't fix the embarrassing scene .

B Standing by itself means that there are insufficient personnel , Just Two people , There is no parallel .

But I think the normal way of thinking is definitely to rollback the code and repackage it for release , Generally, problems are caused by recent changes .

And like our business applications , In fact, it can only be repaired , Unlike infrastructure, which can be rebuilt , Right .

So if there is a business problem , Basically, there is no suitable plan B, But this is also an idea , When you encounter problems, you can think , In case there is ？

Preparation before failure

The above is the handling method in the accident , In fact, infrastructure preparation before the accident is very critical .

B The public network architecture diagram drawn by the station is only a high-dimensional abstraction , There are many more nodes below .

The key to troubleshooting is to quickly locate the module where the problem lies , Only in this way can we mobilize relevant personnel to check and repair , This involves monitoring .

image B The station quickly locates to the seventh floor of the business host room SLB CPU 100% The problem of , Lead to the unavailability of the business .

So monitoring is very important , It is our eyes when dealing with faults .

Of course , Monitoring is also very complex , It may take several articles to spread the story , Let me talk about the direction of monitoring ：

operating system
cache
database
Message queue
Application service
journal

Of course , If your service is successful K8S, That still needs K8S Monitoring of .

There are many more specific subdivisions below , For example, in addition to some basic elements of the database , You also need to monitor the number of connections 、 Number of threads 、 Lock information 、 The slow query 、qps wait .

Want to deal with problems smoothly , You also need to master some common analysis tools , such as B Station analysis CPU The question is Linux Of perf command .

Of course , These may not be familiar to our development , After all, we are not SRE, But some basic commands still need to be mastered , such as top The command can see the process CPU Usage rate ,vmstat Look at the number of context switches ,dstat You can watch the Internet and I/O Situation, etc .

There are also basic concepts , such as CPU Relevant indicators us（user）、ni（nice）、sys（system）、id（idle） wait .

It is suggested to learn about this knowledge , You still have to have a concept .

Redo before the failure

Every company should resume after something happens .

The reason for the problem in the second round , Where are the deficiencies in the process , What aspects need to be strengthened control and so on , Not much specific BB 了 .

Of course , And the pot , How to say this , Is it your own or your own , It's someone else's. don't be confused .

Last

Troubleshooting is really not easy , Especially when asset losses have occurred , This requires solid strength , Otherwise, a monk will be confused directly .

Speaking of online problems , There's actually another one Chaos Engineering , This project comes from Netflix The engineer messed up the monkey （Chaos Monkey）.

In short, it's like a monkey jumping up and down to destroy your system , From time to time, the online system will fail （ Build a monkey to destroy your system ）, So developers are needed to fix it , In order to simulate the real error , So as to temper the cooperation ability of upstream and downstream , Verify the resilience of the system .

Is to do exercises , Make trouble for yourself , Mandatory practice , Improve the reliability of the system , Interested friends can get to know .

原网站

版权声明
本文为[Technical Trivia]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271910056713.html