当前位置:网站首页>Seven lines of code made station B crash for three hours, but "a scheming 0"
Seven lines of code made station B crash for three hours, but "a scheming 0"
2022-07-25 12:54:00 【QbitAl】
Fish and sheep Abundant color From the Aofei temple
qubits | official account QbitAI
A small character “0”, Unexpectedly attracted B The station completely collapsed .

I wonder if you still remember that night ,B standing “ Power failure in the building ”、“ The server exploded ”、“ The programmer deleted the library and ran away ” All night revelry .( Manual formation )
A year later , behind “ Murderers ” Now I'm finally B Disclose ——

I didn't think! , It's just a few lines of code , Go straight ahead B Stand for twoorthree hours , It's got to be B The station programmer stayed awake all night and lost his hair .
You may ask , This is a general function to find the greatest common divisor , How can it be so powerful ?
One after another , In the final analysis, it's just a sentence :0, It really doesn't want to be eliminated .

Details , Let's have a look “ accident report ”.
character string “0” It caused “ Murder case ”
First, let's talk about the root cause of the tragedy , That is, the one posted at the beginning gcd function .
Little friends who have learned a little programming knowledge should know , This is a kind of use division To calculate the greatest common divisor Recursive function .
It is different from our method of calculating the greatest common divisor by hand , This algorithm is aunt sauce's :
A simple example ,a=24,b=18, seek a and b Maximum common divisor of ;
a Divide b, The remainder is 6, Then let a=18,b=6, Then continue to calculate ;
18 Divide 6, This time the remainder is 0, that 6 That is to say 24 and 18 The greatest common divisor of .
in other words ,a and b Divide and remainder repeatedly , until b=0, Function :
if b==0 then return a end
This judgment statement takes effect , Even if the result comes out .
Based on this mathematical principle , Let's look at this code again , There seems to be no problem :

But if you enter b It's a string “0” Well ?
B Station technical analysis article mentioned , This accident code is used Lua Written .Lua It has these characteristics :
This is a dynamically typed language , In common usage, variables do not need to define types , Just assign a value to the variable directly .
Lua When performing arithmetic operations on a numeric string , Will try to convert the numeric string to a number .
stay Lua In language , Mathematical operations n%0 The result is nan(Not A Number).
Let's simulate this process :
1、 When b Is a string “0” when , Because of this gcd Function does not type verify it , Therefore, when encountering the decision statement ,“0” It's not equal to 0, In the code “return _gcd(b, a%b)” Trigger , return _gcd(“0”, nan).
2、_gcd(“0”, nan) Be executed again , So the return value becomes _gcd(nan, nan).
This is the end of the calf , In the decision statement b=0 The conditions of will never be met , therefore , Dead cycle There is .
in other words , This program starts to go around frantically , And for a result that you will never get , hold CPU Occupied 100%, Other user requests naturally cannot be handled .

So here comes the question , This “0” How on earth did it get in ?
The official saying is :
In some release mode , The applied instance weight will be temporarily adjusted to 0, At this time, the registration center returns to SLB( Load balancing ) The weight of is of string type “0”. This release environment is only used by the production environment , The frequency of simultaneous use is extremely low , stay SLB This problem is not triggered in the early grayscale process .
SLB stay balance_by_lua Stage , Services saved in shared memory IP、Port、Weight As a parameter to lua-resty-balancer The module is used to select upstream server, At the node weight=“0” when ,balancer Module _gcd Input parameters received by function b May be “0”.
bug How is it positioned
With “ Be wise after the event ” From the perspective of , This triggered B The root cause of the overall collapse of the station is somewhat straightforward “ Is this ”.
But from the perspective of the programmers involved , Things are really not as simple as spicy .
In the evening 22:52 branch —— Most programmers just get off work or haven't gotten off work yet (doge),B The station operation and maintenance receives the alarm of service unavailability , Suspect the machine room at the first time 、 The Internet 、 four layers LB、 Seven layers SLB And other infrastructure problems .
Then we immediately held an emergency voice conference with relevant technicians to start processing .
5 Minutes later , The operation and maintenance department found that the mainframe room carrying all online businesses was on the seventh floor SLB Of CPU The occupancy rate has reached 100%, Unable to process user request , After excluding other facilities , The locking fault is this layer .
( Seven layers SLB Is based on URL Load balancing of application layer information . Load balancing allocates client requests to server clusters through algorithms , Thus reducing server pressure .)
In all emergencies , The episode also appeared : Programmers who are remotely at home log on VPN But I can't access the intranet , I had to go again call The person in charge of the intranet , It took a green channel to go online ( Because one of the domain names is faulty SLB Acting ).

This time has passed 25 minute , The emergency repair officially started .
First , The operation and maintenance has been restarted SLB, Not recovered ; Then try to reject the cold restart of user traffic SLB,CPU still 100%, Still not recovered .
next , The operation and maintenance department found that there were many active machine rooms SLB Request a large number of timeouts , but CPU Not overloaded , Preparing to restart the multi live machine room SLB when , Internal group reaction master station service has been restored , Video playback 、 recommend 、 Comment on 、 Dynamic and other functions are basically normal .
Now it's 23 spot 23 branch , Distance from the accident 31 minute .
It is worth mentioning that , These functions are actually restored by netizens roast at the time of the incident “ High Availability Disaster Recovery architecture ” It worked .

As for why this line of defense didn't work at first , There may be a little pot for you and me .
Simply speaking , It's just that big guys can't open it B The station began to refresh crazily ,CDN The traffic returns to the source and tries again + User retries , Directly to B Sudden increase of station flow 4 More than times , The number of connections increased suddenly 100 Times to tens of millions , How to live SLB It's overloaded .

however , Not all services have a flexible architecture , So far, the matter has not been completely solved .
For the next half hour , You have done a lot of operations , Rolled back the last two weeks or so Lua Code , Did not restore the remaining services .
Time has come 12 spot , There is no way ,“ I don't care bug How did it come out , Let's resume the service ”.
Simple + Brutal : Operation and maintenance directly takes one hour A new set of SLB colony .
In the morning 1 spot , The new cluster is finally built :
On one side , Someone is responsible for broadcasting live 、 Online retailers 、 comic 、 Core business traffic such as payment is switched to the new cluster , Restore all services ( In the morning 1 spot 50 Get everything done , It's over for the time being, and the collapse is approaching 3 An hour's accident );
On the other side , Continue analysis bug reason .
After they ran out a detailed flame diagram data with the analysis tool , The one who made trouble “0” Finally, a clue emerged :
CPU The hot spots are obviously concentrated in one pair lua-resty-balancer Module calling . Of this module _gcd The function returned an unexpected value after a certain execution :NaN.
meanwhile , They also found the conditions that trigger the inducement : Some container IP Of weight=0.
They suspect that the function triggered jit A compiler bug, Operation error Fall into a dead cycle Lead to SLB CPU 100%.
So the whole situation was closed jit compile , Temporarily avoid the risk . After everything is solved , It's almost there 4 spot , Everyone finally had a good sleep for the time being .
The next day, everyone was not idle , The offline environment reappears ceaselessly bug after , The discovery is not jit Compiler problems , It's about service Some special release mode The container instance weight will appear 0 The situation of , And this 0 Is a string form .
As mentioned earlier , This string “0” In dynamic language Lua Arithmetic operations in , Converted into numbers , Went to the wrong Branch , It creates an endless cycle , triggered b Standing on this unprecedented crash .
Recursive pot or weak typed language pot ?
Many netizens still have fresh memories of the accident , I recall that I just thought that mobile phones can't change computers , Some people still remember that time 5 Minutes later, the matter became a hot search .
Everyone was surprised , Such a simple dead cycle can cause such a large website collapse .
however , It was pointed out that , Dead circulation is not uncommon , Rare is in SLB layer 、 There is a problem in the distribution process , It's not like it can restart and solve problems in the background soon .

To avoid that , Some people think that recursion should be used with caution , It is necessary to use or set a counter , After reaching a value that the business is unlikely to achieve, directly return fall .
Others think that recursion is not to blame , It is mainly the pot of weakly typed languages .
This also led to “ Crafty ‘0’” This joke .

in addition , Due to the accident, it took too long 、 Too many things , at that time B The station has made up a day of membership for all users .
Someone calculated an account here , Call it this 7 Line code , Give Way b The station boss lost about 1,5750,0000 element .( Manual formation )

For this bug, What do you want to roast about ?
Reference link :
[1]《2021.07.13 We collapsed like this 》by Bili Bili Technology
https://mp.weixin.qq.com/s/nGtC5lBX_Iaj57HIdXq3Qg
边栏推荐
- 【高并发】通过源码深度分析线程池中Worker线程的执行流程
- shell基础知识(退出控制、输入输出等)
- Eccv2022 | transclassp class level grab posture migration
- A hard journey
- [operation and maintenance, implementation of high-quality products] interview skills for technical positions with a monthly salary of 10k+
- Lu MENGZHENG's "Fu of broken kiln"
- 【5】 Page and print settings
- Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"
- Fiddler packet capturing app
- 【AI4Code】《Contrastive Code Representation Learning》 (EMNLP 2021)
猜你喜欢

cmake 学习使用笔记(二)库的生成与使用

【C语言进阶】动态内存管理

【问题解决】ibatis.binding.BindingException: Type interface xxDao is not known to the MapperRegistry.

程序员奶爸自制AI喂奶检测仪,预判宝宝饿点,不让哭声影响老婆睡眠

A hard journey
![[shutter -- layout] stacked layout (stack and positioned)](/img/01/c588f75313580063cf32cc01677600.jpg)
[shutter -- layout] stacked layout (stack and positioned)

2022.07.24 (lc_6124_the first letter that appears twice)

Detailed explanation of flex box

AtCoder Beginner Contest 261E // 按位思考 + dp

The larger the convolution kernel, the stronger the performance? An interpretation of replknet model
随机推荐
LeetCode 0133. 克隆图
Mid 2022 review | latest progress of large model technology Lanzhou Technology
Detailed explanation of switch link aggregation [Huawei ENSP]
mysql有 flush privileges 吗
零基础学习CANoe Panel(16)—— Clock Control/Panel Control/Start Stop Control/Tab Control
感动中国人物刘盛兰
Use vsftpd service to transfer files (anonymous user authentication, local user authentication, virtual user authentication)
[advanced C language] dynamic memory management
我想问DMS有没有定时备份某一个数据库的功能?
深度学习MEMC插帧论文列表paper list
艰辛的旅程
word样式和多级列表设置技巧(二)
Interviewer: "classmate, have you ever done a real landing project?"
使用vsftpd服务传输文件(匿名用户认证、本地用户认证、虚拟用户认证)
CONDA common commands: install, update, create, activate, close, view, uninstall, delete, clean, rename, change source, problem
[problem solving] org.apache.ibatis.exceptions PersistenceException: Error building SqlSession. 1-byte word of UTF-8 sequence
【OpenCV 例程 300篇】239. Harris 角点检测之精确定位(cornerSubPix)
零基础学习CANoe Panel(14)——二极管( LED Control )和液晶屏(LCD Control)
Zero basic learning canoe panel (12) -- progress bar
软件测试流程包括哪些内容?测试方法有哪些?