当前位置:网站首页>I want to say more about this communication failure

I want to say more about this communication failure

2022-07-06 17:58:00 Fresh jujube class

This few days , Everyone is paying attention to Japanese telecom operators KDDI Large scale communication failure .

This fault has a great impact , Involving the whole territory of Japan , common 3915 Million users . and , The fault lasts for a long time , It took almost two days , It's basically recovered .

The specific cause of the failure , I see many official account have been written , I will not repeat the analysis .

Today's article , I want to enlarge the topic , Have an in-depth chat with you —— all 2022 Years. , Why are there so many failures in our communication network , as well as , Do we have the ultimate solution .

ea3cc9c957e046ce4e220faacea0f479.png

█  Communication failure : A game that lasts for a hundred years

Fault is the natural attribute of communication network . Just like people get sick , Since the birth of communication network , It is accompanied by failure . Or say , We are in the process of troubleshooting , To create a communication network .

ca7e1014995f24cdc17e9e73a7f9eb18.jpeg

After solving countless troubles, father bell , Just invented the telephone

For more than 100 years , Countless correspondents , They are fighting and playing games with the fault unremittingly . They have worked hard to develop various technologies , Various means have been used , Fight against communication failure .

On a macro level , The effect of the struggle is remarkable . With the continuous accumulation of experience , With the continuous progress of Technology , The probability of communication network failure is declining .

Young readers may not know ,20 Many years ago , The landline cannot be dialed ( There are not many families with telephones ), It's the same as cutting off water and power , It's a common phenomenon .10 Many years ago , The mobile phone cannot be dialed , Don't go online , It is also a common phenomenon .

222ddc5e36d18e10d1e228356648b084.jpeg

In the past ten years , These phenomena are becoming increasingly rare . Once in a while , Instead, people will feel very strange . The Internet is down , The first reaction of many people is that their mobile phone is broken , Or I owe you , Restart or recharge quickly . isn't it? ?

We are now in an information society , Communication network is the same as hydropower , Is an important infrastructure . Our work and life , And the operation of all walks of life , Can not be separated from the communication network .

Under such premise , As a state-owned enterprise , As the construction and maintenance of the network , We will always put the security and stability of the network first .

196c5f4cf8c35fcac1b1399ff7584820.jpeg

For network stability , The Ministry of industry and information technology has set strict assessment indicators for operators . If there is a network failure in a province or city , The top leaders must bear the responsibility , Career worries .

Pressure from operator leaders , Will be passed on to employees , It will also be passed on to equipment manufacturers and outsourcers .

Now the market competition is so fierce , Once something happens , Or huge compensation , Or lose the market share of this province , This is an unbearable loss for equipment manufacturers and outsourcers .

So , The entire communication industry is concerned about the security and stability of the communication network , Attention must be enough . The key , It's still a question of ability and execution .

█  The weakness of communication network , Where on earth ?

First , I want to talk about the definition of security level of communication network .

Depending on the scene , The security of communication network is divided into different levels . From low to high , They are family level 、 Enterprise class 、 Telecommunication level .

e9a44aeb2e31c205a60322f98c37663a.jpeg

Security level of communication system

Like the router we use at home , All belong to family level . The safety and reliability of this equipment is very low , Bad is bad , It is easy to cause network interruption .

Enterprise level , It is the network equipment used in the unit . According to the network size and the number of users , Enterprise level equipment has high safety and reliability , It is not easy to interrupt the service .

Requirements for carrier grade , Even higher . Like moving 、 telecom 、 Unicom , Their network , To provide services for hundreds of millions of users , It is absolutely not allowed to break down easily . Generally speaking , Carrier level reliability , To achieve 5 individual 9 The above criteria .

95e01f28384ba0897b1211b71b9d4283.png

Today, Xiaozao Jun talked about communication network , It refers to the public communication network of operators facing the public , Including cellular mobile communication network , It also includes fixed line broadband network . They all belong to the carrier class .

The architecture of cellular mobile communication network and fixed broadband network is similar , The main difference is that Access network part .

8a3ac65b73e05d6fe474bbbfb8ef1640.png

Cellular mobile communication network is a wireless access network , The access device is a base station . The fixed broadband network is a wired access network , The access device is PON equipment ( Passive optical network equipment , Including the light cat ).

Let's take the cellular mobile communication network as an example , Analyze .

Public communication network , It serves hundreds of millions of user groups , therefore , A pyramid level architecture is usually used , The core network is the core , Transmission network ( Bearer network ) As the backbone , The access network is limb .

075863b96720b96773ddad29a8dfd2fb.png

You can see it at a glance , This architecture , The biggest weakness , It lies in the core network and transmission network ( Especially the backbone network ).

The core network is the management center , It is the heart and brain of the network , Once you hang up , Just hang up the whole network . therefore , Core network engineer ( For example, when I was ) It is the post with the greatest risk and pressure .

4080c8b6e173e995723fc09a3a4d6af0.jpeg

Core network machine room

Transmission network ( Bearer network ) Well , It is the blood vessel and nerve of communication network . It's easy to say at the end , Broken at most affects a small piece , however , If the cardiovascular and cerebrovascular system breaks down , What do I do ? That is also complete paralysis .

4790f91538511593c79339f0b179d5b1.jpeg

Optical transmission equipment

This time, KDDI Failure occurred , also 2021 year 10 month DoCoMo Failure occurred , as well as 2020 The breakdown of the four major operators in the UK ,2020 In the U.S. CenturyLink Failure of , Are related to the core router . To put it bluntly , There is something wrong with cardio cerebral vessels , The whole person ( The Internet ) He collapsed .

by comparison , The probability of major problems in the access network is very low . Individual base stations “ Drop the station ”, It affects hundreds of thousands of people at most , no room to swing a cat in , Complaints are controllable .

8073fdde4d87d321f3cc4ce371e712d2.jpeg

Base station equipment

If there is a large-scale failure in the access network , It is most likely the software version of the equipment manufacturer , Or hardware batch problem . The probability of this situation is extremely low .

█  In order to prevent failure , What did the correspondents do ?

that , In order to ensure the safe and smooth operation of the communication network , Prevent failure , What methods have our correspondents adopted ?

First , It is the perfection of the top-level architecture design .

The architecture of the network , It is the foundation of network security . A good architecture , Consider both performance and capacity , Also consider the cost , Also consider safety and redundancy .

Please remember one thing about big housework here : Communication equipment as a complex product , No matter how you design or stack , It has the possibility of failure , Just the probability 、 The question of time .

For possible faults , Instead of strictly guarding against , It is better to focus on the failure , What should I do .

therefore , Introduce backup mechanism , It is the most effective means to deal with faults .

824ba66d7704caae8dbf2dff1ee174bc.png

Backup mechanism

Everyone has learned “ Probability and Statistics ”,1 If the failure probability of a device is 1%, that , Probability of simultaneous failure of two devices , Namely 1%×1%=0.01%. That's right. ?

To ensure absolute safety , Network architecture design , Will be used POOL( pool ) Networking mode , Here's the picture :

12b76f31a5b9e4247f5d90c12a3c26f0.jpeg

Several devices work together to form a pool (POOL), Each is responsible for the business , If one breaks , Others immediately top , Ensure that the business is not affected .

Core equipment , There are usually two or more , In different areas of the provincial capital , Physically, it's far away .

Besides , When doing network architecture design , Important device network elements are usually placed in the core computer room with a higher security level .

4be9166df21cfa7df554eeae09614d70.jpeg

Core machine room

for example , The most important thing in mobile communication network 、 Responsible for storing and managing user data HSS( It's the old HLR, There is the mobile phone number of each user 、 Authentication data 、 Business information, etc ), It is stored in the core computer room of the provincial capital . meanwhile , Maintenance personnel will conduct physical remote isolation backup of data on a regular basis .

In recent years , Because of geological disasters , Plus factors such as war or terrorist attack , Operators even began to do Different provinces Backup of .

for example , Last year's Zhengzhou flood , At that time, the core computer room was flooded ,HLR Withdrawal , It is urgent to use the HLR, Realize the temporary recovery of business .

58cf833cf94955e24f1ecc8a67d2ab4f.png

Different disaster recovery levels

The second way , The underlying active / standby mechanism .

Just now we are talking about the redundancy mechanism of top-level design . Specific to the machine room 、 frame 、 Veneer 、 Cable , There are also active and standby designs , It can be called the underlying active / standby mechanism .

If you have been to the computer room , You'll find out , The frame on the cabinet , There are all kinds of boards inserted . And these boards , Basically, they all appear in pairs .

8ddc7a2669e3e92173920ce8bc79189a.jpeg

A manufacturer 3G Front appearance of the equipment

in other words , A certain type of board , Usually there are two pieces .

The same is true of network cable and optical fiber , You can hardly see a single cable , It's all in pairs .

b90be9c6f001cfbada58eff09de492a5.jpeg

A manufacturer 4G Front appearance of the equipment

The reason for this , Just to back up each other . If a board breaks , Then another board can continue to work , Ensure that the business is not affected . meanwhile , The system will alarm , Remind the staff to replace as soon as possible .

Power supply is the same , All cabinet equipment in telecommunication machine room , There must be at least two power inputs .

7e2b6dd6b4d302f316be56ac0689b459.jpeg

Multiple power input ( One red and one blue is the way )

Except that the city electricity thought , Important machine rooms will also be equipped with batteries 、UPS、 Generators and other emergency power supply equipment .

dba3637e8203dfaa345b126f2a9703c0.jpeg

Battery pack in the machine room

Third , Perfect management system and regulations .

Technology is never the only factor that affects network security and stability . The biggest threat to the communication network , It's actually people , Not technology .

For this point , Jujube Jun believes that every correspondent will have the same feeling .

In terms of management process and system , In terms of engineering technical specifications , We have learned countless bloody lessons .

Why should the upgrade plan be reviewed repeatedly ? Why should engineering specifications be so strict ? Why build a spare parts warehouse ? Why is the cutover step necessary double-check, even to the extent that triple-check? Why should we arrange to be on duty after major operations ? Why should the Internet be closed on important holidays ?……

These are the experiences summarized by predecessors .

bfe91c45c7161f2538f6ede7b06d47c8.jpeg

For network failure , Always be in awe


In addition to the internal management system and process standards , Aiming at the deliberate destruction of communication network that often happens now , The country has also established increasingly strict laws and regulations , Punishment .

Like illegal construction, cutting off optical fibers 、 Deliberately destroy the base station 、 Cut the optical fiber , Will be punished by law .

bdb0a948de7f7f16e21d622c8ad41c94.jpeg

The malicious cut feeder of the base station

█  The deep-seated reasons behind the communication failure

Have a reasonable network architecture design , There is a complete active and standby mechanism , There are also perfect systems and norms , Why do so many faults occur ?

Next , Let me talk about some deep-seated reasons .

First and foremost , It is probably the most agreed point , That's it The internal environment of the communication industry .

Over the years , Malicious competition 、 Low price bidding prevails , Equipment suppliers and subcontractors should rush for orders , And maintain profits , Can only desperately lower costs , For example, product design cost 、 Material cost 、 Cost of construction materials . More importantly , Personnel salary cost .

Costs continue to compress , It is bound to affect product reliability and engineering quality . Low wages , Leading to the loss of a large number of experienced talents . Subcontractor to complete , Only fresh students can be recruited , Simple training ( Not even training ) after , Send to the scene to work .

These personnel lack the necessary training and practice , The quality level and technical ability are insufficient , Become a big risk point .

Some of them have very low quality , Oppressed hard , Directly delete the database and run , It's not impossible .

years ago , In order to ensure that front-line employees are not deducted , Some manufacturers even sign contracts with subcontractors , Restrict the bottom line of outsourcing employees .

Besides low price competition , Another important factor affecting the security of network operation , yes Increasing technical complexity .

The more advanced technology , The more complex , The lower the reliability . As technology evolves , The network scale of operators is becoming larger and larger , Networking is also becoming more and more complex , The probability of problems greatly increases .

The tidal effect of communication network is very obvious . Sometimes there is a difference of ten or even a hundred times between free time and busy time . If there is an accident ( Disasters, etc ), Traffic surged , It is more likely to be a thousand times the difference .


It is impossible for operators to do a thousand times redundant design . therefore , If there is no reasonable bypass design or threshold design , The probability of network congestion is extremely high .( Several major failures in recent years , There are factors of signaling traffic congestion .)

At present, the complex networking of operators , Few of them can fully understand . Time is long. , Once personnel flow , It's even stranger .

Communication network is originally a metaphysics , There are many strange problems , Who dares to say that he can calculate every possibility ?

The third potential network security risk , It is also the risk that Xiaozao Jun is most worried about , That's it External cyber attacks . For example, hackers 、 Viruses and system vulnerabilities .

Now , Communication equipment is basically IP turn 、 The cloud has melted , The network is more and more open , Some are directly deployed on the public cloud , Physical isolation from the outside world is getting weaker , More vulnerable than before .

Now the attacker , The level is also much higher than before , Means are also more diversified , The threat to the network is great .

Of course , Operators and equipment manufacturers are preventing network attacks , There's a lot of investment .

Now? , All manufacturers are concerned “ Safety reinforcement ” The concept . seeing the name of a thing one thinks of its function , Security reinforcement is to block system vulnerabilities , Make the system more stable . Operators will use third-party tools , Or hire a third-party manufacturer , Conduct security scanning of existing network equipment , Looking for security holes , Then ask the equipment manufacturer to rectify and block .

f8db50f93afc7bacf5654d83d7c9fc1e.jpeg

All for safety

such “ Go all the way , Magic height ” The game of , It will last for a long time .

however , Xiaozaojun thinks that , The current defensive side , In terms of personnel safety awareness 、 In terms of technical ability , There are big problems . follow-up , The security incidents we encountered , More and more .

I hope relevant units and departments don't talk about safety , Really spend some time to improve the quality of your staff , Strengthen training . Otherwise something really happened , It's too late to remedy .

█  Last words

Japan KDDI This is not the first time , Certainly not the last time . Communication network failure , It's like beating a drum to pass flowers , No one knows whether he is next .

Now? , Manufacturers have proposed to introduce AI, Let AI take over the network , So as to reduce the failure rate of the network . Some manufacturers , On the basis of network cloud , Do grayscale upgrading ( That is, partial upgrade ), It can also significantly reduce network risk . These are all good trends .

I think , On the road of fighting against the failure of communication network , We have a long way to go . What a long long road! , Correspondents ask for help from top to bottom .

Okay , That's all for today's article . Thank you for your patience in reading , See you next time !

thank you !

d27c838e4add815a67c1876e477dee0f.jpeg

原网站

版权声明
本文为[Fresh jujube class]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207061003491563.html