当前位置:网站首页>How to ensure system stability and achieve emission reduction? Ant group has these key technologies
How to ensure system stability and achieve emission reduction? Ant group has these key technologies
2022-06-10 05:14:00 【Alipay Technology】
As a member of the laboratory , Ant group actively participated in the preparation of standards and research reports related to system stability , At this summit , Shi Shiqun, deputy general manager of digital technology business group of ant group, also did 《 Alipay system double 11 stability guarantee experience sharing 》 Keynote speech of , Share ant group's financial level distributed architecture SOFAStack Exploration and practical experience in the field of system stability guarantee .

Here is the speech :
Hello everyone , The group of digital ants is mine . Today I will introduce Alipay double online 11 Relevant contents of stability guarantee .
System stability guarantee , It's a complex system engineering . from 2004 Year to 2021 year , Alipay has experienced a series of technical architecture upgrades and iterations , From the cellular architecture to the elastic cloud , And then evolved to cloud primordial 、 Green Computing , This process should consider both capacity stability , Also consider cost and efficiency .
The first stage , It mainly solves the problem of capacity . adopt LDC、 Flexibility and OceanBase, It solves the capacity of infinite expansion in theory . meanwhile , The full link capacity is well verified through the full link voltage measurement technology ;
The second stage , When the capacity to pay is up to standard , The further consideration is how to improve the stability and efficiency of the overall architecture through technological innovation . Typical scenes are 2 individual , One is Yunyuan , The core idea of cloud native architecture is to separate infrastructure and business , So as to release the dividend of infrastructure , Significantly improve the speed and efficiency of innovation , A typical case is ServiceMesh On the landing of ants . The other is our intelligent monitoring operation and maintenance system , Through data intelligence , Improve the response speed of system emergency response recovery . The third stage , Green emission reduction . For several years in a row , We keep the peak steady growth , Put forward a great promotion 0 Cost increase .2021 Annual double 11, Our main direction is to focus on green emission reduction , Through the off-line mixing department 、 Time sharing scheduling 、 intelligence AI Innovative technologies such as capacity , Achieving savings 64 Ten thousand kilowatts of electricity and 394 Tons of carbon emission reduction .

Next , Let me introduce Alipay double 11 The key technology of big promotion .
One 、 Unit deployment
Remote multi active logical unit architecture , The inside of ants is also called LDC, The full name is Logical Data Center( Logical data center ), It's right IDC(Internet Data Center, Internet Data Center ) A logical division of , It is also the practice of Alipay system “ Unit deployment ” The scheme adopted .
Ensure the stability of information system , The core is to solve two problems :
first , A single bottleneck . When any Internet system develops to a certain scale , Will inevitably touch a single point of bottleneck . From a single server 、 Single application , To single database 、 Single room , Then deploy to multiple computer rooms 、 Deploy more ( Different live ), This process is constantly breaking through the single point bottleneck ;
the second , Ensure disaster tolerance in other places , Only in this way can we meet the requirements of financial stability . Multi location and multi machine room deployment , It is the inevitable direction of the development of Internet system , There are many key problems to be solved , Including traffic allocation 、 Data splitting 、 Delay, etc , Of course, these problems can be solved through technology and solutions , These solutions are hosted by a deployment architecture . Although there is more than one deployment option available , But whether it's pure theoretical research , It is also the architectural practice of some advanced systems , All the “ Unit deployment ” Listed as the best solution .

So called unit , It refers to a self-contained set that can complete all business operations , This collection contains all the services required by all businesses , And the data assigned to this unit . A unit , It's a miniature version of the whole station , It's all powerful , Because all the applications are deployed ; But it's not full , Because only part of the data can be manipulated .
Alipay divides the units into RZone、GZone、CZone Three types of , To solve the problem of traffic allocation 、 Data splitting 、 The problem of delay :
RZone(Region Zone): That best fits the theoretical unit definition zone, Every RZone Are self-contained , Have your own data , Able to complete all business .
GZone(Global Zone): Global unit , Deployed indivisible data and services , These data or services may be RZone rely on .GZone There is only one group in the global , There is only one copy of the data .
CZone(City Zone): Units deployed in cities , It also deploys data and services that cannot be split , Will also be RZone rely on . But follow GZone The difference is ,CZone Data or services in will be RZone Frequent visits , Each business will visit at least once ; and GZone By RZone The frequency of access is much lower .CZone It is specially designed to solve the problem of remote delay .
be based on LDC framework , Alipay has realized the real multi live architecture in different places , Achieved financial grade 99.99% Usability , And theoretical wireless capacity , Successfully supported the ability to promote hundreds of thousands of levels , At the same time, it also lays a good foundation for the subsequent elastic Architecture .
Two 、 Flexible architecture
We just talked about LDC Logical unit architecture , It has the possibility of unlimited capacity in theory , But reality is often not feasible , There are two reasons :
One side , The resources controlled by the company are limited , With the rapid growth of the number of payments , Self sustaining resources will encounter bottlenecks ; On the other hand , double 11 After all, big promotion is a few times , If you have so many resources for a long time , It is also uneconomical for cost , This does not fully release the dividends of Cloud Computing .
Ant Alipay is LDC On the basis of Architecture , Further upgraded the elastic Architecture , The flexibility according to business granularity is realized , Transform a part of the element into an elastic element , Bounce into the clouds at rush hour , So as to realize rapid capacity expansion . When the promotion is over , Then bounce these units back to the daily machine room , In this way, we can ensure the more effective use of resources . All elastic logic is encapsulated at the infrastructure level , Realize the insensible elasticity to the business . We are 2016 Annual double 11 Great promotion , Effectively support the peak payment of more than 100000 levels per second , Compared with the model of holding resources , The cost is greatly reduced 50% above .

3、 ... and 、 Service Grid
Next, let's look at the service grid ServiceMesh, This is also a very key technology .
Why ServiceMesh? We have to start with micro Services . The problems with microservices , Many are related to service governance , Including the interdependence between components 、 Service control is difficult 、 Platform transportation management and other issues , We use lightweight web agents , Be responsible for the communication between microservices , With sidecar The form is deployed in a separate process of the container , And through a series of infrastructure and business decoupling , Efficient upgrade of infrastructure . During the promotion period , Iterative infrastructure for efficiency improvement 10 More than times .
secondly , adopt ServiceMesh Flexible flow control can be realized , All current limiting 、 Fuse by ServiceMesh To take over , There is no need for business transformation , It saves a lot of research and development costs and costs SDK Access costs of . at present ServiceMesh Alipay has been covered 100% Core payment link , With millions of container sizes , Peak ten million QPS.

Four 、 Evolution of online full link voltage measurement technology
Pressure measurement is an extremely important means of capacity verification , All the methods we just talked about , They are constantly improving the capacity expansion ability . But it also needs a very good means to verify whether the capacity meets the expectation , On line full link voltage measurement technology becomes very critical .
The traditional pressure measurement technology has many problems , It is mainly reflected in the incomplete traditional local single chain road pressure measurement , It is based on single business pressure measurement , The database level is not good for pressure measurement , It's not easy to press at the network level , The business cannot simulate the real situation . Besides , Traditional offline pressure measurement 、 Simulation pressure measurement 、 The accuracy of on-line single machine drainage pressure measurement is not high , There is no accurate assessment of resources .

For the whole line and link voltage measurement , We mainly have the following points :
Core link analysis , Build an end-to-end behavior model for users . Through big data technology , User behavior and back-end link based on big promotion , Build an end-to-end traffic model , It is used to verify the full link voltage measurement .
Pressure measurement environment reuse production . Through the data access agent , Lead the pressure measurement data to the link , Does not affect normal business data , The result is very reliable .
Pressure test performance analysis and diagnosis . During pressure measurement , If there is a problem , Can quickly locate problems , And give optimization suggestions . Typical include network diagnostics ( network quality 、 bandwidth )、 Applied diagnostics ( Memory 、CPU hotspot 、 Threads )、 Database diagnostics ( slow SQL、CPU、 Memory )、 infrastructure ( Containers 、 process ) And full link diagnostics ( The diagnosis Bottlenecks in distributed links ).
Based on the accumulation over the past years , Our simulation degree in full link voltage measurement exceeds 99%, Double in recent years 11 The big promotion is 0 Major failure ,0 Asset loss .
5、 ... and 、 Intelligent monitoring technology
Although a lot of things have been done before , But for a complex business , Problems are inevitable in online systems , So how to find problems quickly 、 Quick response 、 Rapid recovery becomes very important .
In the face of large peak value , The challenge of monitoring is also huge . Under the condition of large-scale traffic , The number of logs per second may reach hundreds G, The cleaning flow rate may reach dozens of... Per minute T, How to deal with these logs effectively is very important .
Ant self-developed timing database engine Ceresdb, By optimizing acquisition technology and streaming computing engine , It can basically achieve second level monitoring , Realization 1 Minute discovery 、5 Minute positioning 、10 Minutes to recover , Ensure rapid emergency response and response in online time .
1 Minute discovery : The trouble is 1 Found in minutes , Stakeholders are introduced into the troubleshooting process .
5 Minute positioning : stay 5 Respond to the cause of failure within minutes , And formulate a hemostasis plan .
10 Minutes to recover :10 The hemostatic scheme is completed in minutes , Fault recovery .

2021 Annual double 11, We focus on the peak 、 Focus on traffic , The focus has shifted to green computing , Consider both cost and efficiency , Ensure technology sustainability . We have adopted a series of hybrid deployment technologies including offline deployment 、 Cloud original time-sharing scheduling and AI Elastic capacity and other innovative technological means , Realize the overall resource scheduling , Green computing , Economize 64 10000 kWh , Carbon emission reduction 394 Tons of .

We talked about a lot of technical capabilities and methods to ensure the stability of the system , But for every organization , Build these capabilities and systems from scratch , It takes a long time , Also need to do a lot of complex work , In order to better help all walks of life realize digital upgrading and transformation , Ant group is also actively promoting the scientific and technological opening of relevant capabilities .
6、 ... and 、 Native distributed database OceanBase
Next , Let's look at an important product ——OceanBase.OceanBase Internal process 9 Annual double 11 Validation of the , Have a lot of application experience , Also very mature and stable .OceanBase As a native distributed database , With wireless expansion 、 The ability to always be online , Protect data from loss ,30 Automatic disaster recovery within seconds .OceanBase It is applicable to various large-scale scenarios and industries with high requirements for business continuity , For strong consistency 、 High availability 、 high HTAP Industries with performance requirements are also very applicable , Now in Finance 、 The government 、 Operator, 、 traffic 、 Energy and other industries have a lot of successful implementation experience .


This article is from WeChat official account. - Alipay Technology (Ant-Techfin).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .
边栏推荐
- 第六章 软件测试工具(此章完结)
- 2022年危险化学品生产单位安全生产管理人员操作证考试题库及答案
- 2022.5.29-----leetcode. four hundred and sixty-eight
- Installation and configuration of NPM and yarn
- Interview question 05.08 draw a straight line
- [Linux < day20 >] - An Introduction to database and container technology
- [STM32] transplantation of Hal library on 4-pin 0.96 inch OLED screen - hardware IIC (I)
- The meaning of likelihood function
- Record the realization of animation effect on the page of small rocket of BiliBili (station B)
- S series · several postures for deleting folders
猜你喜欢

City / school / major, which is the most important when choosing a university| Daily anecdotes

五项最优!蚂蚁集团通过信通院“稳保计划”最高级评测

Use nodejs to export the pictures in the md/markdown document to the local and replace the original picture link with the local picture link

2022年流动式起重机司机考试题及在线模拟考试

IDEA不小心排除常用类的自动导包或补全

Interview question 05.06 Integer conversion

Softing为艾默生提供AMS设备管理系统的连接解决方案

js微信小游戏之打蚊子

Mindscore1.6conda installation GPU version verification failed

2022g1 industrial boiler stoker examination questions and answers
随机推荐
Study notes for typescript
Curator - Create Client
Reconstruction of acquisition login component
[general database tools] Shanghai daoning provides developers, analysts and database administrators with a tool for all databases and operating systems - dbvisualizer
2022.6.5-----leetcode. four hundred and seventy-eight
[UE4 automatic terrain material]
五项最优!蚂蚁集团通过信通院“稳保计划”最高级评测
IDC发布《中国云原生市场分析》,蚂蚁集团已成覆盖最全面厂商之一
Softing为艾默生提供AMS设备管理系统的连接解决方案
[stacking | fast scheduling] Top-k problem
Contact QR code generation plug-in qrcode js
S series · add legend in the made Matplotlib diagram
冒泡排序bubble_sort
Curator - implement service registration and discovery
torch. Randn migrates to mindspore ops. Truncatednormal usage problems
Powerful development board
S series · add data to the text file without adding duplicate values
【对话直播】图计算是下一个科技前沿
找寻目标值
蚂蚁集团三项技术方案入选“2021年信息技术应用创新典型解决方案”


