当前位置:网站首页>Noc-sla acquisition C-side business monitoring practice
Noc-sla acquisition C-side business monitoring practice
2022-06-09 03:48:00 【Acquisition technology】
original | Get things Technology - Wooden fish mouse
Preface
With the rapid development of the company's business , Our production environment products and applications are becoming more and more complex , Interconnections and dependencies are becoming more and more complex ; Any application exception may affect the system availability , Have a global impact . Through last year 2021 year C End failure throughout the year , Find out from the fault , response time 、 Fault emergency needs to be improved , so NOC To optimize the existing alarm response quality , Make new NOC——SLA systematic 【】‘’‘’‘ Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap tap tap tap tap ;1. Acquisition transaction C End introduction
1.1 C End concept
What is? C End ?
C End refers to consumers 、 Individual users Consumer; As its name implies, it is a product that provides services to individual users , It serves users directly , Get something C The end is divided into two scenarios “ transaction ”“ Community ” Two components ,C End contains online transactions 、 Community 、 Algorithm , It involves the order placed by the transaction 、 offer & stock 、 marketing 、 goods , Front and back end of community domain 、 Algorithmic trading recommendation & Community recommendations are important dependencies .
From a trading perspective :
- The user logs in to visit the goods. The details are paid to buy , The whole user oriented transaction process is transaction C End business .
From a community perspective :
- Users log in and post , Enjoy visiting the community 、 The social interaction generated by interaction is community C End business .

1.2 C What happens if there is a problem at the end ?
stay 2021 year 6 Mid , Abnormal goods and services. Orders fell continuously due to technical problems , Affect the user's single purchase experience .
- There is no data on the white screen of the recommendation page & Including page classification ;
- Commodity side : Click the product to load for a long time or the product is off the shelf and does not support sales ;
- offer & Deposit side : Item cannot be bid 、 Bid error ;
- Supply chain side : Affect the identification of some sorting photos , And check the product details page ;
- Community side : Some community posts are loaded too slowly or delayed ;
- Merchant side : Open platform dop Also affected by a large area ;

stay 2021 Due to technical problems in the release of supply chain in the fourth quarter of 2004 , Code bug and Redis Abnormal cache resolution led to abnormal decline of trading orders on the same day , Affect the user's single purchase experience .
- offer & stock : Buy the floating layer immediately and open it abnormally ;
- Create order & Payment order : Cannot open create ;
- marketing : Failed to write off the transferred discount ;

1.3 Summary
With the rapid development of the company's business , Our production environment is becoming more and more complex , Connecting and relying on each other is becoming more and more complex ; If an application is abnormal, it may affect the whole body , Influence the whole situation .
2. in the light of C End historical issues - Why do it SLA
2.1 Alarm problem discovery
(1) in the past 2021 Influence in annual failure analysis C The proportion of terminal failure is 34.7%,C The detection rate of segment alarm is only 42%.

stay C The alarms on the end core link have basically been covered , However, it extends to all fault types and major faults , Our alarm coverage still needs to be improved .( Monitoring alarms are scattered ,NOC The system monitoring and alarm closing ability is weak ) There is no effective closing in alarm quality .
(2) from 2021 Year to date fault analysis C In the middle NOC The response rate from 3min-5min-15min Respond to , For all to be strengthened .
| C End scene | ||||
| Major failure | General fault | Smoke | ||
Fault level | P1 | P2 | P3 | P4 | Smoke |
Number of faults | 5 | 4 | 19 | 18 |
|
1 Minute alarm discovery rate | 100% | 25% | 47% | 27% |
|
NOC 3 Minute response rate | 60% | 50% | 74% | 83% |
|
NOC 5 Minute response rate | 20% | 0 | 0 | 0 |
|
Greater than 5 Minute response rate | 20% | 50% | 26% | 17% |
|
2.2 NOC Response problem
After experiencing a large asset loss at the end of the past year P1 In trouble ,NOC- After the risk warning , There is no first time to judge whether the rise response is delayed , This leads to the delay of activities and the expansion of offline asset losses , Such as the following questions :
- The students on duty pay more attention to , There are too many messages processed every day , Cannot prioritize , So when a fault occurs , Students on duty will respond slowly or miss messages .
- When the fault occurs , The students on duty are unable to accurately evaluate the impact surface , Therefore, the corresponding person in charge cannot be found accurately , Cause the fault to miss the best processing time .
- NOC The students on duty are interested in C The business philosophy of the end is not enough , Unable to make effective judgment .
2.3 Summary
Through the above fault analysis, the alarm problem discovery rate and NOC In case of response problems in major fault judgment , From fault discovery to fault pull-up NOC Students have no priority to deal with problems every day , The focus is scattered, and the passive alarm acceptance is lack of specification , Information is scattered in fault handling , Lack of abstract aggregation , Problems can be found by spreading from point to surface , The monitoring system monitors and controls business scenarios NOC Their investment in understanding business scenarios needs to be improved .

3. SLA Overall introduction
therefore NOC—SLA Therefore, the special project was established , Unified closing access NOC——SLA System monitoring alarm , Output SOP, Division wide business scenarios P0P1 priority , Comprehensive fault discovery optimization , From the level of support 、 Type of damage 、 The business perspective is introduced from three aspects NOC-SLA Launch a special project .
3.1 SLA Division
(1) according to SLA Classification of support levels
We combine business importance levels , Grading criteria SLA The level of support is divided into three levels :P0(SLA 3 minute )、P1(SLA 5 minute )、P2(SLA 15 minute ).

(2) According to the type of damage
Refer to our business damage types , We will SLA The types of guarantee objects are divided into six items : Business damage 、 Asset loss 、 infrastructure 、 Data quality 、 Workplace effectiveness 、 Development testing . The explanation is as follows :

(3) Business perspective division
a. Specifically analyze the business damage scenarios from the perspective of business R & D, such as P0 Level scene ;
- With 【 Order process 】 Examples of business direction : Business details - Buy floating and sinking now - place order - Payment order - Cashier ;

b. Analyze business damage scenarios from the perspective of operation and maintenance infrastructure, such as P0 Level ;
- With 【 Order process 】 Examples of operation and maintenance infrastructure : Request from user - route internet- gateway -App- Business Center ;

(4)SOP Definition
- Redefine the business grading logic , Automatic matching SLA The fault scenario is linked to catch all the faults in one net and respond quickly to the predetermined level ;
- Change the original thinking , From the old “ Add ” Turn into “ Subtraction ”, The reduced NOC Students on duty pay attention to the alarm group , According to the fault source ( Monitoring alarm 、 Internal feedback 、 Customer service feedback ) Distinguish the group information that needs attention ,P0/P1 The alarm information shall be summarized uniformly , Information beyond the promise is not guaranteed SLA;
- about SLA Add maintainers in the business scenario , Business related person in charge .
3.2 Access to the process
SLA Access specification
After the alarm page is configured , launch SLA apply , from NOC The students finish the examination in a net , Pass to guarantee SLA, Not through optimization in three days , Return directly after timeout .

4. NOC—C End access NOC-SLA advance
Preparation
- In the implementation of the scheme , First understand the business scenario requirements , Focus scene guarantee level .
- Clarify the core business indicators and the combined dependence on the overall market structure .
- Whether the scene baseline in access is reasonable , Clear alarm rules
- Access complete 、 Verify historical faults .

4.1 Business arrangement — demand
transaction C End contains online transactions 、 Community 、 Algorithm , It involves the order placed by the transaction 、 offer & stock 、 marketing 、 goods , Front and back end of community domain 、 Algorithmic trading recommendation & Community recommendations are important dependencies , Combed C End 9 individual P0 scene 17 individual P1 scene 100+ Perfect alarm rules .
Classification of industry importance :
For example, for C End line of business , The single core link is distinguished in the single core scenario P0P1P2 Level priority
- Evaluate whether it belongs to... Based on the influence of order quantity P0;( Goods details 、 Buy now 、 Create orders and other scenarios )
- Take the shopping guide link that finally guides the order placement as P1;( My collection 、 want to buy 、 Get coupons every day )
4.2 transaction C End to end continuous stability guarantee
transaction C End SLA to ground
stay SLA Special landing stage , We have to SLA After focusing the indicators of the business monitoring scenario , For constant C Core rules of end-to-end business scenarios + Trigger condition scheduling force , Whether the historical faults can be verified and run in repeatedly , For different scenarios, different data indicators , Continuously polish and optimize the baseline calculation formula to the later, and start trying to access the intelligent baseline to make the basic algorithm model through the historical fluctuation data , To ensure that you can P0 Quickly find... In the scene P1 and P2 fault .
4.2.1 C End SLA The baseline
(1) At present, the whole C The baseline configuration of the end is based on the historical peak value of business traffic , Constantly groping and confirming .
When confirming the baseline premise , Be sure to think about and observe the fluctuation ratio of historical business flow , In determining what type of baseline , In practice SLA The stage has some problems with the past monitoring data The holiday season The impact of volatility is often higher than history threshold 50% If there is a business problem during the business peak, resulting in an insignificant decline, it is often difficult to find the problem , Need to focus every week , Then we Introducing an intelligent baseline .
- Take the new year's Day holiday as an example, the flow level is higher than the baseline 50% The proportion of abnormal fluctuation is as high as 112%

- The value range is in the working day, and the fluctuation ratio is 20% Within the predicted range

(2) Smart baseline ——Prophet Model
- fbprophet yes facebook An open source time series prediction algorithm .
- The fluctuation index of data can be monitored through history , Use the prediction algorithm to train a model to predict the trend of future indicators .
- stay NOC——SLA In order to ensure SLA Problem discovery rate 、 Usability 、 accuracy , Once a week NOC Students are observing whether there are false alarms and historical monitoring data , Especially on holidays C The alarm flow water level at the terminal is generally higher than the normal threshold 50% abnormal , Every Tuesday afternoon, there will be an occasional below baseline normal water level 25%, With the continuous increase of business volume and seasonal trend of activities “618, A double tenth ” Constantly challenging the reliability of the baseline .
- Intelligent baseline can predict the future in large calculation data, avoid holiday effect and give reasonable baseline value , Ensure availability and reliability .
4.2.2 SLA Monitoring and alarm optimization
In terms of monitoring alarm configuration , We optimize the upgrade of monitoring aggregation dimension , According to multiple scenes , Multiple rules and different business perspectives : Application alarm - Service alarm - Basic resources are associated with alarm integration , The notification alarm template is clearer , Be responsible for making the abnormal fluctuation chart clear at a glance .
(1) Information Integration
- Scene confirmation distinguishes the above P0P1 scene
- Distinguish business line types and business domains
- Confirm the rule title of the scene “ Easy to understand ”: If the payment order falls from the baseline year-on-year 35%
- Fix NOC——SLA Maintainer & Relevant person in charge of business domain
- Mark the damage type of the business
(2)SLA Special monitoring & Alarm rules —— diversity
- The following improvements to “ dehumanization ” Judge , Quick response 、 The message touch linkage mechanism is fast .


(3)NOC—SLA Proprietary alarm logic
- about SLA False alarm of special alarm data
Basic description :VM The data delay causes an alarm that there is an error in reading the payment order data , Cause false alarm of payment
Impact description :P0-SLA-3m Alarm rules “ The proportion of paid orders fell by more than... Compared with the baseline 35%” False positives due to data delay .
The fluctuation ratio of alarm display is 40%, However, there is no abnormal fluctuation in the monitoring screenshot and jump monitoring address .

Optimize : The original alarm detection time point is 12S, Find out 12S There is still a problem of data completeness , Prone to false positives , Now close 12S Time detection point , Switch to a 20S Detection point (20S Add test points for , Not the final point ), The alarm delay after switching is expected to be 25-30 About seconds , The following formula gives NOC—SLA Proprietary alarm logic ( The value obtained through the calculation formula is compared with the threshold value , The threshold does not need to have no negative number , Only positive numbers , The rise and fall are confirmed by comparison , For example, the month on month decline 30, The alarm will automatically process the logic of the sign ).
- Old alarm logic 30 Evaluate every second , Prevent the alarm noise query data from pushing forward 1 minute , The alarm engine will de duplicate the alarm 、 Merge 、 Inhibit waiting distance about 50 second , The whole process will be delayed 2 minute .
4.3 SLA Preservation accuracy : Preservation method & Fresh object
4.3.1 Service link security
(1)NOC& Business focus
- Review and focus with the business party on a monthly basis SLA The current state of health has improved “ After-sales service ”
- Focus on whether the business iteration information changes ( New marketing campaign 、 Auction items 、 It may be true P0P1 Scene definition , There are new changes after business iteration , Need to monitor the original definition 、NOC Be on duty 、 meet an emergency 、 Adjust accordingly ).
(2)NOC Maintenance of link data correctness
Comb regularly ( transaction C End ) Scenario service indicator link data correctness , Achieve scene dependent discovery .
4.3.2 SLA Alarm optimization
(1) fault & Smoke SLA Analysis and optimization of problem discovery rate
According to the fault & Is it in the smoke SLA Find out ? Other alarm discovery ? user & Employee feedback to continuously improve SLA Alarm problem discovery rate “ Don't kill by mistake ”.
(2)SLA Alarm optimization
According to the weekly SLA The alarm quality monitoring analyzes the time node at which the alarm trigger rate increases suddenly, analyzes the causes, and achieves the alarm closed loop “ justified ”, Increase accuracy , Reduce alarm noise , convergence . such as : Low trading hours , Spike buying and other activities led to a sudden increase in orders , When the fluctuation of monitoring burr occurs, an alarm will be generated to record the problem and classify the alarm .
4.4 NOC Accelerate emergency response
SLA Standardization and optimization of emergency process
(1)NOC—SLA After landing, after review , Join in SOS( Failure emergency system ), appear SLA After the index fell, there was no linkage SOS Fast emergency one click transmission ;
(2) Add emergency response team , contain NOC Two groups / Expert group , It is used to guide in emergency and speed up fault recovery ;
(3) Automatic fault escalation mechanism , Judge the automatic matching based on the classification of the new version of the fault 1min Automatic group pulling synchronization overview of existing fault information .

5. Landing practice fault discovery
5.1 Alarm timeliness
| In the past | Now? | promote |
Acquisition frequency | 1min | 10s | Alarm sensitivity |
Rule checking frequency | 2min | 20s | Rapid alarm |
5.2 Accuracy and effectiveness
This year, 2 In January, the community's clothing selection service showed abnormal faults , Because the new code of resource calculation bug In trouble , We associate according to the scene , Based on baseline rationality and SLA Diversity of alarm configuration , A previously undetected fault phenomenon is found , At that time, all alarms on the line were normal .

5.3 Duty lifting
(1) stay 3/2【 Smoke 】 Buy home page / Business details / Buy now QPS fall ,RT soaring Alibaba cloud Hangzhou ( Availability zone I ) An exception occurred in the network device , adopt SOS Predetermined level and NOC—SLA Linkage automatic matching, corresponding fault scene grading, automatic sending out of humanization 5 Minute quick response .

(2) user & Employee feedback construction TS&NOC Reporting standard process , Strengthen things App User feedback channels ;
(3) Reduce staring at groups , Convergence group , Reduce interruptions , Give Way NOC The person on duty is more focused on the limited key flying Book Group .
6. Summary and prospect
After we have experienced many large asset loss failures and serious failures affecting business availability , We review and summarize how to achieve rapid response from emergency support , After pre-warning . From passive to active , Make a commitment to the whole staff to launch NOC-SLA Support special projects , Learn from past experience and make up your mind to discover 、 Handle 、 hemostasis , Settle P0(SLA 3 minute )、P1(SLA 5 minute )、P2(SLA 15 minute ) Strive for the upper reaches of the goal . At the same time, sort out the application levels of each business domain, and divide the business link scenarios into levels , From alarm aggregation 、 linkage SOS The fault is pulled up quickly , Currently trading C The end fell to the ground 9 Big core P0 scene 17 individual P1 scene , But it's not good enough , To keep “ Keep fresh ” To be sustainable , accuracy , reliability .
From the perspective of smoke discovery , We should constantly polish NOC—SLA Enhance alarm ductility , The observable scene continues to expand and dig deeply P1 The following scenarios , From the perspective of prevention, we should find and prevent foreseeable problems in advance , Rapid recovery does not prevent problems , Avoid minor problems and major failures , There is still a long way to go , So far 3min-5min-15min Respond quickly , Towards the industry 1min-5min-10mi Positioning and rapid recovery capabilities , Help to stabilize production !
* writing / Wooden fish mouse
Focus on Technology , Every Monday, three or five nights 18:30 Update technology dry goods
If you think the article is helpful to you , Welcome to comment and forward some likes ~
边栏推荐
- Fault analysis - a case of excessive CPU load caused by a large number of short-time processes
- 互联网寒冬?软件测试人员如何逆势而行进入高薪大厂
- 目标检测模型mAP计算方法与对比步骤——对比原模型与量化模型之间的mAP
- Server registration use
- Kotlin基础从入门到进阶系列讲解(入门篇) 文件存储的基本使用
- 丰富的色彩变化
- Laravel determines whether the mailbox already exists and verifies whether the mailbox format is legal
- [play with Huawei cloud] functions and features of Kunpeng code migration tool
- character string
- 复杂查询 指什么,包括哪些
猜你喜欢

Final assignment of Web Design - scenic spot tourism website (including navigation bar, rotation map, exquisite style)

技术分享 | 调整 max-write-buffer-size 优化 pika 性能10倍的案例

MAUI 自定义绘图入门

VGA display of color bar, character and picture based on FPGA

分布式 | dble 读写分离场景下为什么普通的读 sql 发送到了 master 实例上
![[examination in May] Oracle OCP 19C passed](/img/ae/8b89cc9004064cec9e8c3ef4e15d21.jpg)
[examination in May] Oracle OCP 19C passed

【分享】网络丢包故障处理方案

Fault analysis how the MySQL database gets stuck after upgrading

如何使用Superset可无缝对接MRS进行自助分析

Fault analysis | a special scenario in MySQL where a new user cannot log in
随机推荐
VGA display of color bar, character and picture based on FPGA
OnlineJudge使用说明
Software engineering final exam questions and answers (detailed and classic)
解决报错:错误1130- Host xxx is not allowed to connect to this MariaDb server
Ideal使用小技巧
苹果宣布 2022 年 Apple 设计大奖得主
character string
Analysis of constant pool related problems
85.(leaflet之家)leaflet军事标绘-直线箭头绘制
[launch] modify the app theme according to the wallpaper. It really comes
月薪近万,3年销售助理转行测试,0经验的我如何拿到多份offer?
mongodb数据库文档显示异常
互联网寒冬?软件测试人员如何逆势而行进入高薪大厂
Detailed explanation of HLS live broadcast protocol m3u8
Vivado HLS int8/9 multiplication optimization
Is CICC wealth safe? I want to open an account
Opencv learning notes 1
Dapr 1.7 之 Unix Domain socket 他来了
[play with Huawei cloud] functions and features of Kunpeng code migration tool
Implementing parallel computing framework with QT