当前位置:网站首页>Noc-sla acquisition C-side business monitoring practice

Noc-sla acquisition C-side business monitoring practice

2022-06-09 03:48:00 Acquisition technology

original | Get things Technology - Wooden fish mouse

Preface
With the rapid development of the company's business , Our production environment products and applications are becoming more and more complex , Interconnections and dependencies are becoming more and more complex ; Any application exception may affect the system availability , Have a global impact . Through last year 2021 year C End failure throughout the year , Find out from the fault , response time 、 Fault emergency needs to be improved , so NOC To optimize the existing alarm response quality , Make new NOC——SLA systematic 【】‘’‘’‘ Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap Tap tap tap tap tap ;
1. Acquisition transaction C End introduction

 

1.1 C End concept

What is? C End ?

C End refers to consumers 、 Individual users Consumer; As its name implies, it is a product that provides services to individual users , It serves users directly , Get something C The end is divided into two scenarios “ transaction ”“ Community ” Two components ,C End contains online transactions 、 Community 、 Algorithm , It involves the order placed by the transaction 、 offer & stock 、 marketing 、 goods , Front and back end of community domain 、 Algorithmic trading recommendation & Community recommendations are important dependencies .

 

From a trading perspective :

  • The user logs in to visit the goods. The details are paid to buy , The whole user oriented transaction process is transaction C End business .

 

From a community perspective :

  • Users log in and post , Enjoy visiting the community 、 The social interaction generated by interaction is community C End business .
  •  
NOC-SLA What you get C End service monitoring practice

 

1.2 C What happens if there is a problem at the end ?

stay 2021 year 6 Mid , Abnormal goods and services. Orders fell continuously due to technical problems , Affect the user's single purchase experience .

 

  • There is no data on the white screen of the recommendation page & Including page classification ;
  • Commodity side : Click the product to load for a long time or the product is off the shelf and does not support sales ;
  • offer & Deposit side : Item cannot be bid 、 Bid error ;
  • Supply chain side : Affect the identification of some sorting photos , And check the product details page ;
  • Community side : Some community posts are loaded too slowly or delayed ;
  • Merchant side : Open platform dop Also affected by a large area ;

 

NOC-SLA What you get C End service monitoring practice

 

stay 2021 Due to technical problems in the release of supply chain in the fourth quarter of 2004 , Code bug and Redis Abnormal cache resolution led to abnormal decline of trading orders on the same day , Affect the user's single purchase experience .

 

  • offer & stock : Buy the floating layer immediately and open it abnormally ;
  • Create order & Payment order : Cannot open create ;
  • marketing : Failed to write off the transferred discount ;

 

NOC-SLA What you get C End service monitoring practice

 

1.3 Summary

With the rapid development of the company's business , Our production environment is becoming more and more complex , Connecting and relying on each other is becoming more and more complex ; If an application is abnormal, it may affect the whole body , Influence the whole situation .

 

2. in the light of C End historical issues - Why do it SLA

 

2.1 Alarm problem discovery

 

(1) in the past 2021 Influence in annual failure analysis C The proportion of terminal failure is 34.7%,C The detection rate of segment alarm is only 42%.

NOC-SLA What you get C End service monitoring practice

 

stay C The alarms on the end core link have basically been covered , However, it extends to all fault types and major faults , Our alarm coverage still needs to be improved .( Monitoring alarms are scattered ,NOC The system monitoring and alarm closing ability is weak ) There is no effective closing in alarm quality .

 

(2) from 2021 Year to date fault analysis C In the middle NOC The response rate from 3min-5min-15min Respond to , For all to be strengthened .

 

 

C End scene

 

Major failure

General fault

Smoke

Fault level

P1

P2

P3

P4

Smoke

Number of faults

5

4

19

18

 

1 Minute alarm discovery rate

100%

25%

47%

27%

 

NOC 3 Minute response rate

60%

50%

74%

83%

 

NOC 5 Minute response rate

20%

0

0

0

 

Greater than 5 Minute response rate

20%

50%

26%

17%

 

 

2.2 NOC Response problem

After experiencing a large asset loss at the end of the past year P1 In trouble ,NOC- After the risk warning , There is no first time to judge whether the rise response is delayed , This leads to the delay of activities and the expansion of offline asset losses , Such as the following questions :

 

  • The students on duty pay more attention to , There are too many messages processed every day , Cannot prioritize , So when a fault occurs , Students on duty will respond slowly or miss messages .
  • When the fault occurs , The students on duty are unable to accurately evaluate the impact surface , Therefore, the corresponding person in charge cannot be found accurately , Cause the fault to miss the best processing time .
  • NOC The students on duty are interested in C The business philosophy of the end is not enough , Unable to make effective judgment .
  •  
  •  

2.3 Summary

Through the above fault analysis, the alarm problem discovery rate and NOC In case of response problems in major fault judgment , From fault discovery to fault pull-up NOC Students have no priority to deal with problems every day , The focus is scattered, and the passive alarm acceptance is lack of specification , Information is scattered in fault handling , Lack of abstract aggregation , Problems can be found by spreading from point to surface , The monitoring system monitors and controls business scenarios NOC Their investment in understanding business scenarios needs to be improved .

 

NOC-SLA What you get C End service monitoring practice

 

3. SLA Overall introduction

 

therefore NOC—SLA Therefore, the special project was established , Unified closing access NOC——SLA System monitoring alarm , Output SOP, Division wide business scenarios P0P1 priority , Comprehensive fault discovery optimization , From the level of support 、 Type of damage 、 The business perspective is introduced from three aspects NOC-SLA Launch a special project .

 

3.1 SLA Division

(1) according to SLA Classification of support levels

We combine business importance levels , Grading criteria SLA The level of support is divided into three levels :P0(SLA 3 minute )、P1(SLA 5 minute )、P2(SLA 15 minute ).

NOC-SLA What you get C End service monitoring practice

 

(2) According to the type of damage

Refer to our business damage types , We will SLA The types of guarantee objects are divided into six items : Business damage 、 Asset loss 、 infrastructure 、 Data quality 、 Workplace effectiveness 、 Development testing . The explanation is as follows :

NOC-SLA What you get C End service monitoring practice

 

(3) Business perspective division

a. Specifically analyze the business damage scenarios from the perspective of business R & D, such as P0 Level scene ;

  • With 【 Order process 】 Examples of business direction : Business details - Buy floating and sinking now - place order - Payment order - Cashier ;
  •  
NOC-SLA What you get C End service monitoring practice

 

b. Analyze business damage scenarios from the perspective of operation and maintenance infrastructure, such as P0 Level ;

  • With 【 Order process 】 Examples of operation and maintenance infrastructure : Request from user - route internet- gateway -App- Business Center ;
  •  
NOC-SLA What you get C End service monitoring practice

 

(4)SOP Definition

  • Redefine the business grading logic , Automatic matching SLA The fault scenario is linked to catch all the faults in one net and respond quickly to the predetermined level ;
  • Change the original thinking , From the old “ Add ” Turn into “ Subtraction ”, The reduced NOC Students on duty pay attention to the alarm group , According to the fault source ( Monitoring alarm 、 Internal feedback 、 Customer service feedback ) Distinguish the group information that needs attention ,P0/P1 The alarm information shall be summarized uniformly , Information beyond the promise is not guaranteed SLA;
  • about SLA Add maintainers in the business scenario , Business related person in charge .

 

3.2 Access to the process

SLA Access specification

After the alarm page is configured , launch SLA apply , from NOC The students finish the examination in a net , Pass to guarantee SLA, Not through optimization in three days , Return directly after timeout .

 

NOC-SLA What you get C End service monitoring practice

 

4. NOC—C End access NOC-SLA advance

Preparation

  • In the implementation of the scheme , First understand the business scenario requirements , Focus scene guarantee level .
  • Clarify the core business indicators and the combined dependence on the overall market structure .
  • Whether the scene baseline in access is reasonable , Clear alarm rules
  • Access complete 、 Verify historical faults .

 

NOC-SLA What you get C End service monitoring practice

 

4.1 Business arrangement — demand

 

transaction C End contains online transactions 、 Community 、 Algorithm , It involves the order placed by the transaction 、 offer & stock 、 marketing 、 goods , Front and back end of community domain 、 Algorithmic trading recommendation & Community recommendations are important dependencies , Combed C End 9 individual P0 scene 17 individual P1 scene 100+ Perfect alarm rules .

 

Classification of industry importance :

For example, for C End line of business , The single core link is distinguished in the single core scenario P0P1P2 Level priority

  • Evaluate whether it belongs to... Based on the influence of order quantity P0;( Goods details 、 Buy now 、 Create orders and other scenarios )
  • Take the shopping guide link that finally guides the order placement as P1;( My collection 、 want to buy 、 Get coupons every day )

 

4.2 transaction C End to end continuous stability guarantee

transaction C End SLA to ground

stay SLA Special landing stage , We have to SLA After focusing the indicators of the business monitoring scenario , For constant C Core rules of end-to-end business scenarios + Trigger condition scheduling force , Whether the historical faults can be verified and run in repeatedly , For different scenarios, different data indicators , Continuously polish and optimize the baseline calculation formula to the later, and start trying to access the intelligent baseline to make the basic algorithm model through the historical fluctuation data , To ensure that you can P0 Quickly find... In the scene P1 and P2 fault .

 

4.2.1 C End SLA The baseline

 

(1) At present, the whole C The baseline configuration of the end is based on the historical peak value of business traffic , Constantly groping and confirming .

 

When confirming the baseline premise , Be sure to think about and observe the fluctuation ratio of historical business flow , In determining what type of baseline , In practice SLA The stage has some problems with the past monitoring data The holiday season The impact of volatility is often higher than history threshold 50% If there is a business problem during the business peak, resulting in an insignificant decline, it is often difficult to find the problem , Need to focus every week , Then we Introducing an intelligent baseline .

 

  • Take the new year's Day holiday as an example, the flow level is higher than the baseline 50% The proportion of abnormal fluctuation is as high as 112%
  •  
NOC-SLA What you get C End service monitoring practice

 

  • The value range is in the working day, and the fluctuation ratio is 20% Within the predicted range
  •  
NOC-SLA What you get C End service monitoring practice

 

(2) Smart baseline ——Prophet Model

 

  • fbprophet yes facebook An open source time series prediction algorithm .
  • The fluctuation index of data can be monitored through history , Use the prediction algorithm to train a model to predict the trend of future indicators .
  • stay NOC——SLA In order to ensure SLA Problem discovery rate 、 Usability 、 accuracy , Once a week NOC Students are observing whether there are false alarms and historical monitoring data , Especially on holidays C The alarm flow water level at the terminal is generally higher than the normal threshold 50% abnormal , Every Tuesday afternoon, there will be an occasional below baseline normal water level 25%, With the continuous increase of business volume and seasonal trend of activities “618, A double tenth ” Constantly challenging the reliability of the baseline .
  • Intelligent baseline can predict the future in large calculation data, avoid holiday effect and give reasonable baseline value , Ensure availability and reliability .

 

4.2.2 SLA Monitoring and alarm optimization

In terms of monitoring alarm configuration , We optimize the upgrade of monitoring aggregation dimension , According to multiple scenes , Multiple rules and different business perspectives : Application alarm - Service alarm - Basic resources are associated with alarm integration , The notification alarm template is clearer , Be responsible for making the abnormal fluctuation chart clear at a glance .

 

(1) Information Integration

  • Scene confirmation distinguishes the above P0P1 scene
  • Distinguish business line types and business domains
  • Confirm the rule title of the scene “ Easy to understand ”: If the payment order falls from the baseline year-on-year 35%
  • Fix NOC——SLA Maintainer & Relevant person in charge of business domain
  • Mark the damage type of the business

 

(2)SLA Special monitoring & Alarm rules —— diversity

  • The following improvements to “ dehumanization ” Judge , Quick response 、 The message touch linkage mechanism is fast .
  •  
NOC-SLA What you get C End service monitoring practice

 

NOC-SLA What you get C End service monitoring practice

 

(3)NOC—SLA Proprietary alarm logic

  • about SLA False alarm of special alarm data

Basic description :VM The data delay causes an alarm that there is an error in reading the payment order data , Cause false alarm of payment

Impact description :P0-SLA-3m Alarm rules “ The proportion of paid orders fell by more than... Compared with the baseline 35%” False positives due to data delay .

The fluctuation ratio of alarm display is 40%, However, there is no abnormal fluctuation in the monitoring screenshot and jump monitoring address .

NOC-SLA What you get C End service monitoring practice

 

Optimize : The original alarm detection time point is 12S, Find out 12S There is still a problem of data completeness , Prone to false positives , Now close 12S Time detection point , Switch to a 20S Detection point (20S Add test points for , Not the final point ), The alarm delay after switching is expected to be 25-30 About seconds , The following formula gives NOC—SLA Proprietary alarm logic ( The value obtained through the calculation formula is compared with the threshold value , The threshold does not need to have no negative number , Only positive numbers , The rise and fall are confirmed by comparison , For example, the month on month decline 30, The alarm will automatically process the logic of the sign ).

 

  • Old alarm logic 30 Evaluate every second , Prevent the alarm noise query data from pushing forward 1 minute , The alarm engine will de duplicate the alarm 、 Merge 、 Inhibit waiting distance about 50 second , The whole process will be delayed 2 minute .

 

4.3 SLA Preservation accuracy : Preservation method & Fresh object

 

4.3.1 Service link security

(1)NOC& Business focus

  • Review and focus with the business party on a monthly basis SLA The current state of health has improved “ After-sales service ”
  • Focus on whether the business iteration information changes ( New marketing campaign 、 Auction items 、 It may be true P0P1 Scene definition , There are new changes after business iteration , Need to monitor the original definition 、NOC Be on duty 、 meet an emergency 、 Adjust accordingly ).

 

(2)NOC Maintenance of link data correctness

Comb regularly ( transaction C End ) Scenario service indicator link data correctness , Achieve scene dependent discovery .

 

4.3.2 SLA Alarm optimization

(1) fault & Smoke SLA Analysis and optimization of problem discovery rate

According to the fault & Is it in the smoke SLA Find out ? Other alarm discovery ? user & Employee feedback to continuously improve SLA Alarm problem discovery rate “ Don't kill by mistake ”.

 

(2)SLA Alarm optimization

According to the weekly SLA The alarm quality monitoring analyzes the time node at which the alarm trigger rate increases suddenly, analyzes the causes, and achieves the alarm closed loop “ justified ”, Increase accuracy , Reduce alarm noise , convergence . such as : Low trading hours , Spike buying and other activities led to a sudden increase in orders , When the fluctuation of monitoring burr occurs, an alarm will be generated to record the problem and classify the alarm .

 

4.4 NOC Accelerate emergency response

SLA Standardization and optimization of emergency process

(1)NOC—SLA After landing, after review , Join in SOS( Failure emergency system ), appear SLA After the index fell, there was no linkage SOS Fast emergency one click transmission ;

(2) Add emergency response team , contain NOC Two groups / Expert group , It is used to guide in emergency and speed up fault recovery ;

(3) Automatic fault escalation mechanism , Judge the automatic matching based on the classification of the new version of the fault 1min Automatic group pulling synchronization overview of existing fault information .

NOC-SLA What you get C End service monitoring practice

 

5. Landing practice fault discovery

 

5.1 Alarm timeliness

 

 

In the past

Now?

promote

Acquisition frequency

1min

10s

Alarm sensitivity

Rule checking frequency

2min

20s

Rapid alarm

 

5.2 Accuracy and effectiveness

This year, 2 In January, the community's clothing selection service showed abnormal faults , Because the new code of resource calculation bug In trouble , We associate according to the scene , Based on baseline rationality and SLA Diversity of alarm configuration , A previously undetected fault phenomenon is found , At that time, all alarms on the line were normal .

NOC-SLA What you get C End service monitoring practice

 

5.3 Duty lifting

(1) stay 3/2【 Smoke 】 Buy home page / Business details / Buy now QPS fall ,RT soaring Alibaba cloud Hangzhou ( Availability zone I ) An exception occurred in the network device , adopt SOS Predetermined level and NOC—SLA Linkage automatic matching, corresponding fault scene grading, automatic sending out of humanization 5 Minute quick response .

NOC-SLA What you get C End service monitoring practice

 

(2) user & Employee feedback construction TS&NOC Reporting standard process , Strengthen things App User feedback channels ;

(3) Reduce staring at groups , Convergence group , Reduce interruptions , Give Way NOC The person on duty is more focused on the limited key flying Book Group .

 

6. Summary and prospect

 

After we have experienced many large asset loss failures and serious failures affecting business availability , We review and summarize how to achieve rapid response from emergency support , After pre-warning . From passive to active , Make a commitment to the whole staff to launch NOC-SLA Support special projects , Learn from past experience and make up your mind to discover 、 Handle 、 hemostasis , Settle P0(SLA 3 minute )、P1(SLA 5 minute )、P2(SLA 15 minute ) Strive for the upper reaches of the goal . At the same time, sort out the application levels of each business domain, and divide the business link scenarios into levels , From alarm aggregation 、 linkage SOS The fault is pulled up quickly , Currently trading C The end fell to the ground 9 Big core P0 scene 17 individual P1 scene , But it's not good enough , To keep “ Keep fresh ” To be sustainable , accuracy , reliability .

 

From the perspective of smoke discovery , We should constantly polish NOC—SLA Enhance alarm ductility , The observable scene continues to expand and dig deeply P1 The following scenarios , From the perspective of prevention, we should find and prevent foreseeable problems in advance , Rapid recovery does not prevent problems , Avoid minor problems and major failures , There is still a long way to go , So far 3min-5min-15min Respond quickly , Towards the industry 1min-5min-10mi Positioning and rapid recovery capabilities , Help to stabilize production !

* writing / Wooden fish mouse

 

Focus on Technology , Every Monday, three or five nights 18:30 Update technology dry goods
If you think the article is helpful to you , Welcome to comment and forward some likes ~

 
原网站

版权声明
本文为[Acquisition technology]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/159/202206080947426808.html