当前位置:网站首页>Exploration on Optimization of elastic expansion engineering
Exploration on Optimization of elastic expansion engineering
2022-06-24 07:41:00 【Crooked stream】
This article will take you to explore , Technical design points of Tencent cloud elastic computing products .
This sentence is applicable to all walks of life to a certain extent :Those who talk don’t know, and those who know don’t talk.
—— And I believe in , You will become the one who knows the principle 、 An expert who can do something and is willing to share .
0x00 Introduction to the background of elastic computing
The bottom layer of cloud computing is inseparable from virtualization technology , Virtualization makes people feel safe and happy , It solves the security isolation and efficient utilization of resources . Operating virtual machines , It's like swimming in a swimming lane , Because there is a swimming lane line to block waves , We don't need to care what kind of stroke the people in the next lane are doing and how much splash they make , And we are always free , It's like having the whole pool .
Back to the main character of this article , Tencent cloud is elastic (AutoScaling), It is the core hosting service of cloud based computing products , It can efficiently manage the capacity expansion and contraction activities of ECS clusters for users . And all this is inseparable from the full cooperation of surrounding related products : On the opposite bottom , Tencent cloud based KVM Virtual machine management on a single node physical machine is solved by the virtualization technology of ; On top of virtualization , Tencent cloud's VStation The scheduling system perfectly solves the management of large-scale distributed physical machine clusters 、 Scheduling of virtual machine resources and tasks 、 Dynamic migration and other core issues , And directly supports cloud server products (CVM) The implementation of the ; meanwhile ,CVM It is also the other computing products —— Stretch and stretch 、 Lightweight application servers (Lighthouse)、 Batch calculation (Batch) And so on . Elastic scaling integrates cloud servers 、 Load balancing (CLB)、 The core competence of cloud monitoring and other services , Make it easy for users to pass through the cloud API/ The console system flexibly manages the ECS cluster , Jointly escort the dynamic expansion and contraction of users' business . meanwhile , Elastic scaling or Tencent cloud container service (TKE) Basic support for . Various peripheral service relationships , As shown in the figure below :
VStation As a computing product CVM The underlying distributed scheduling system , Its actual effect in the resource management dimension is much better than that at that time OpenStack Of . There are several reasons , The most important thing is the communication mode between its components . Both use message queues , but OpenStack Treat it as RPC How to implement , The communication between components is inefficient , It also involves complex management such as service registration and discovery , As shown on the right . and VStation The component communication of is shown in the left figure , Message queues act as message buses , Each microservice component only needs to care about the same MQ signal communication , Without paying attention to each other's existence .
Rather than VStation It is a unique innovative practice , than VStation The author has a deeper understanding of message queuing , Grasp the essence of message queue . Subsequent products of Tencent cloud, etc , They have also learned from this message bus model in design to varying degrees .
Stretch and stretch (AutoScaling), As the core cloud computing class hosting service , Whether its continuous operation is stable and efficient , It directly reflects the technical professionalism of cloud service providers , It will directly reflect the user's trust in the cloud service provider's products . Official website AS The introduction can be summarized as a typical scenario : Peak shaving and valley filling —— Dynamically adjust resources according to business load ( Mainly cloud server clusters ) size , Optimize the cost of resources without losing the high availability of the business (High Availibilitiy). But in fact, for the daily management of the cluster , The efficiency can also be greatly improved through elastic expansion , So when to use AS Well , The answer is :
“ If your business needs 1 More than cloud servers , You should consider using AS 了 .”—— The stream is crooked
Stretch and stretch (AutoScaling) It mainly solves the horizontal problem of computing resources / Problems in the horizontal expansion scenario (Scale-out), That is, dynamically adjust the service capacity of the system by increasing or decreasing the number of ECS instances . Take capacity expansion as an example , Just like when the number of people at a dinner party increases , Every kind ( Hot Dishes 、 Dessert, etc ) Serve a different dish , When there are more people, they even add the same whole table , Not every dish “ A large portion of ”, That is, improve efficiency through instance duplication and scale , Reduce the overall cost .
In the real world , Business architecture is evolving with business development .
Minimize validation from the first phase of the single service :
To multiple terminals 、 Dynamic and static split of multiple background services :
And then to more complex microservicing :
The ECS of Tencent cloud (CVM) And stretch (AutoScaling) Will accompany every stage of user business development , Witness the rapid changes of users' business and architecture . Of course , Stretch and stretch (AutoScaling) The service itself is also a business that continues to grow at a high speed :
Stretch and stretch (AutoScaling) As a computing service product providing basic services , It has been growing rapidly since its birth . Its typical important customers are : Hyperparameters 、 The little red book 、VIPKid、 Homework help 、 Mango. TV、 Litchi micro class 、Supercell、EpicGame etc. , Covering all industries . Its number of users and scaling activities continues to grow , Cluster resource management that has stably supported millions of cores of users .
that , Stretch and stretch (AutoScaling) What essential problems have been solved for users ?《The Art of Scalability》 The extensions mentioned in the book cube( Business 、 copy 、 Data fragmentation dimension ) be widely known , However, the common methodology or design patterns mentioned in the left figure of the book are actually more instructive , In stability ( High availability )、 Fast on-line (Time-to-Market) In fact, the trade-off among the three pursuits of cost saving is the real art in engineering , And the horizontal expansion of the design (Scale Out) It is one of the few ways to solve these three problems at the same time ( The other two are asynchronous design and automation ). Elasticity helps our business achieve stability 、 fast 、 province .
Tips: There is no such thing as synchronization in the real objective world , Only asynchrony ; No updates , Only the reconstruction after breaking . So-called “ Synchronize updates ” That's just people's fantasy . Try to embrace event driven in system design (event-driven) And immutability (immutability) Well , Life will be much simpler .
elastic (Elasticity)—— As the extension, so the force. Elasticity is the deformation of an object when subjected to an external force , And the ability to restore its original shape when the external force is released . A solid object will deform when subjected to an external force . If the material is elastic , When these forces are removed , The object will return to its original shape . According to the generalized Hooke's law , Stress (stress) Size and strain (strain) In direct proportion to .
Stretch and stretch —— Even the entire cloud service —— The key problem to be solved is : Let the business live steadily , Continue to generate social value . The two core ideas are implemented throughout the implementation :
First of all , Be prepared against want —— Design for failure, and nothing will fail.
second , Unified decision making , Flexible execution —— Top level instructions are executed at various levels “ Administrative contracting ”. The focus is on the broad mechanism (machanism) Strategy (strategy) Separate ,strategy It can be a decision / Strategy / Design / Dispatch / Algorithm ,mechanism It can be a mechanism / perform / Calculation /IO/ Task implementation / Service etc. .
0x01 Elastic scaling bottleneck and scheme analysis
Stretch activity —— Ensure that the business architecture is full of elasticity “ collagen protein ”. Elastic scaling activities include capacity expansion 、 Shrinkage capacity 、 Unhealthy instance replacement, etc . The core of elastic scaling is the design and implementation of scaling activities and the management of their life cycle .
Any work that needs to ensure quality needs to form a certain convention (Routine) technological process , Conventions are generally steps (Step) Simple linear superposition of , However, more complex work flow is required Workflow, strand / Parallel or even branch decision .
In fact, in life , You may feel the process every day , Such as simple skin care process : Facial Cleanser cleanser、 Toner toner、 Essence serum、 body lotion moisturizer And sunscreen SPF, like this “ linear ” Five steps of , Each step depends on the end of the previous step . however , When you want to further enhance your charm , Things are getting complicated :
First of all , There will be more steps , For example, in addition to basic skin care, we also need foundation make-up 、 Concealer and final makeup setting powder ;
second , The steps do not completely depend on each other's order , Like eyelashes 、 Eyeliner 、 The sequence of lip gloss and contour repair , Free to play , Make overall adjustment according to interests , When there are many makeup artists, they can go hand in hand ;
Third , There may be some cycles between steps , Such as foundation make-up . Here's the picture :
wait , Does it feel like a program ? You bet . The flexible business abstraction can be compared with the makeup process , Using elastic scaling can also make our architecture more dynamic and attractive , Let our business remain young forever .
How complex are the calculation steps for scaling activities ? It can be considered that each scaling activity consists of several sub steps (Step) Composed of , Often every Step Both involve at least one external micro service or API call , It can be simply understood as an asynchronous I/O Mission , There are both serial , There must also be parallel , The call links to each other are also dynamically adjusted by the results of the runtime .
Of course , The actual steps are already several times as many as those shown in the figure , Reach hundreds of steps , The complexity is also increased to a higher level . Steps are divided into activity level steps ( The blue circle in the following figure ) And instance level ( The green circle in the following figure ) Steps for , As shown in the figure . Simply calculate , If 100 Scaling activities , Each expansion 100 If you have a cloud server , So at the same time ( Or in a very short time ), As shown in the figure “ The most naive ” Under the circumstances , And tens of thousands I/O request , The actual number of steps is actually two orders of magnitude larger . therefore , Stretch activity , namely In the context of elasticity “ Perform tasks ”, Its implementation difficulty and complexity are universal : That is, the background asynchronous task process 、 Multi component / Microservices call each other 、 Business related complex branch logic judgment 、 Multi state of task ( abnormal / retry / Cancel )、 There is a large amount of metadata involved 、 Context isolation between user tasks , And high concurrency performance and stability .
If you have done business development in the background , Do you feel familiar with the above situations and needs ?
Think about it : Such tasks are computationally intensive (Compute Bound) still I/O intensive (I/O Bound) Well ?
0x02 Frame design and details
In order to achieve a unified decision design , Task flow is usually required WorkFlow Step definition to complete . So let's look at the real Flow What is it like ? In the photo , grand Joekulgilkvísl Meandering through the highlands of Iceland , A braided snake moves forward , Flow through snow covered on both sides Rhyolite Rhyolite mountains :
It's not hard to see :
1、 actual Flow Not against the current , Always forward , No, rollback( Roll back );
2、 actual Flow It's not a line , There must be several tributaries , The so-called mainstream is just the biggest ( Probability flows through ) One of the best ;
3、 actual Flow Every point in the path in , They are unique , It has its own context . It is impossible to step into the same river twice .
So sum up : Band Context Of DAG( Directed acyclic graph ) abstract Flow. Don't use similar rollback/cleanup、retry、exception These concepts are used to implement the underlying framework , These complex concepts can have , But it's better to put it on a higher level of abstraction .
in addition , We should abandon the simple and straightforward linear thinking , Compared to success (Success) Or failure (Failure), Focus on “ Where to go next ”(Next) as well as “ Finish cleanly ”(Done) It is more important . Because the first two may have a lot of situations , For example, there must be many reasons and consequences for failure , The success or failure of a task is just DAG One in the Path nothing more , There is no essential difference . The binary decision of "black or white" is often very immature and narrow in scope of application .
Of course , If you are familiar with the linear rollback mindset , Through simple transformation , The process can be Reduce to DAG The description of form , Pictured :S2 yes S1 Rollback cleanup steps for ,S4 yes S3 Rollback cleanup steps for , On the left is the linear rollback mode , On the right is reduced isomorphism DAG.
Completed the unified decision design , Let's look at flexible and efficient execution . The following scenario is an elegant way of organizing people , The model of an orchestra :
A conductor , Several instruments ( String music 、 Wooden pipe 、 Copper tube 、 blow ) Group , Let's work together to complete a harmonious performance process .
For the whole orchestra , Command is the brain (CPU Bound), The musicians in each group are hands and feet (I/O Bound); For the commander , He also has his own brain and baton waving hands . They are all Actor, And they have an elegant way of communicating with each other .
Conducting a brain that seems to be playing an emotional rhythm , But he didn't give orders directly to every musician , So what is the truly efficient organizational force behind it ?
The following figure shows elastic expansion (AutoScaling) Simple schematic diagram of the background architecture of : from API To task scheduler 、 Timed task triggers and peripheral components , It's all important . For the actual execution of the scaling activity , Its engine components are at the end , That is, the red Activator Service components , Note that it is also a MQ Task consumer , Not much different from the common consumer components in your business .
Zoom in on the above Activator, We see the following internal implementation similar to symphony orchestra . The red one is the core of the strategy : Scaling activity step definitions , namely WorkFlow The definition of , It's also the whole activity “ music ”.Activator The core engine is responsible for the efficient execution of scaling activities and the management of life cycle . It can be compared to :
process = Program + virtual machine ( Interpreter + Runtime + Library function )
Stretch activity = Steps to configure the + Implementation framework ( The core engine + Processing function )
Each execution unit is Reactor Asynchronous event handlers for the model : On the upper floor “ command ” execution unit Actor Query workflow step table , The next task is calculated and distributed to the lower layer according to the returned results of the execution of the lower layer ; In the next layer Actor Through bus and upper layer “ command ” Execute unit communication . There is a fractal (Fractal) Design considerations , It feels like recursion , This fractal tree structure is the most natural and efficient organization , Although theoretically it can be extended in infinite levels , In practical application, it is generally enough to be within four floors . In practice , We can consider from the cooperative process / Threads / process / Node to microservice levels are implemented step by step . The above figure is just a design example of a two-tier split execution engine in a single thread .
The system design practice is introduced here first . About performance , Add two more implementation details that make it qualitative .
details 1: The smallest execution unit ( atom Actor) adopt eventfd To inform each other , Efficient use of kernel interfaces / resources , Guaranteed high performance .eventfd-with-epoll It can guarantee the concurrency of millions of events on a single node , It is very suitable for such scenarios with high event throughput .
details 2: Transaction message Copy-on-Write Realization , adopt Immutability Save space and ensure communication security . Be careful , Here we recommend message passing (Massage Passing) To realize memory sharing (Memory Sharing), Not the opposite .
0x03 Technical thinking and methodology summary
The point is undoubtedly Policy mechanism separation (Separation mechanism from policy), It is the biggest premise and effective method to deal with the contradiction between unified strategy and flexible implementation . It is also one of the most important concepts in the operating system , If you are a technician, you should not feel too strange . For business background computing logic , Core issues tend to fall into two broad categories :
1、 Resource management : It is characterized by large scale , static state , It can be solved by the latter
2、 task management : The difficulty lies in executing scheduling 、 Lifecycle complex states 、 High concurrency 、 And dynamic decision
The demand for unified decision-making , It can be flexibly designed and assembled 、 Reliable and controllable , We use... Through the policy layer DAG Task flow tensor choreography ;
Demands for flexibility in execution , High performance and customization , We use fractal through the execution layer Reactor Layer by layer modeling implementation .
System design , It can be optimized along this line : Transform the demand problem into a calculation problem , Re convert to I/O problem , Then it is transformed into the problem of combinatorial design , Finally, it turns into a strategic problem , Our system will become more and more value-added because it meets people's needs .
0x04 Summary
The design points and related methodologies described in this article have been elastically scaled in Tencent cloud (AutoScaling)、 Lightweight application servers (Lighthouse) And other products , It not only effectively supports the rapid iteration of these core businesses 、 Improved development efficiency , What's more, we are happier with our daily code . If you're interested in , Welcome to the discussion ~
0x05 Reference material
The content of this paper comes from the 13th China system architects conference SACC2021 Live sharing of 《 Tencent cloud elastic expansion engineering optimization exploration technology 》
Extensible task flow framework implementation
Linux eventfd Principle application
Incredible images reveal the stunning beauty of braided rivers
About author : The stream is crooked ,2015 Joined tencent in 2004 , Focus on the technical exploration of cloud service products . Responsible for ECS 、 Stretch and stretch 、 Lightweight application servers 、GPU And other products . Long term focus on high-performance cluster management 、 Distributed task scheduling system 、Web Related directions of full stack development .
To be continued , Coming soon ...
边栏推荐
- 后疫情时代下,家庭服务机器人行业才刚启航
- [signal recognition] signal modulation classification based on deep learning CNN with matlab code
- get_ started_ 3dsctf_ two thousand and sixteen
- 图形技术之坐标转换
- (cve-2020-11978) command injection vulnerability recurrence in airflow DAG [vulhub range]
- Deploy L2TP in VPN (medium)
- What challenges does the video streaming media platform face in transmitting HD video?
- Spark stage and shuffle for daily data processing
- Common coding and encryption in penetration testing
- Global and Chinese markets for food puffers 2022-2028: Research Report on technology, participants, trends, market size and share
猜你喜欢

jarvisoj_ level2

bjdctf_2020_babystack

jarvisoj_level2

Win10 build webservice
![[OGeek2019]babyrop](/img/74/5f93dcee9ea5a562a7fba5c17aab76.png)
[OGeek2019]babyrop

Camera calibration (calibration purpose and principle)

Dichotomous special training

Analog display of the module taking software verifies the correctness of the module taking data, and reversely converts the bin file of the lattice array to display

阿里云全链路数据治理
![[GUET-CTF2019]zips](/img/79/22ff5d4a3cdc3fa9e0957ccc9bad4b.png)
[GUET-CTF2019]zips
随机推荐
bjdctf_ 2020_ babystack
In the era of industrial Internet, there are no more centers in the real sense, and these centers just turn tangible into intangible
Global and Chinese market of digital fryer 2022-2028: Research Report on technology, participants, trends, market size and share
湖北专升本-湖师计科
Pyhton crawls to Adu (Li Yifeng) Weibo comments
jarvisoj_level2
[image fusion] multi focus and multi spectral image fusion based on pixel saliency and wavelet transform with matlab code
How to realize multi protocol video capture and output in video surveillance system?
PIP install XXX on the terminal but no module named XXX on pycharm
图形技术之坐标转换
Global and Chinese market of offshore furnaces 2022-2028: Research Report on technology, participants, trends, market size and share
光照使用的简单总结
[MySQL usage Script] clone data tables, save query data to data tables, and create temporary tables
How to realize high stability and high concurrency of live video streaming transmission and viewing?
Lend you a pair of insight, Frida native trace
[从零开始学习FPGA编程-41]:视野篇 - 摩尔时代与摩尔定律以及后摩尔时代的到来
RDD basic knowledge points
什么是CC攻击?如何判断网站是否被CC攻击? CC攻击怎么防御?
atguigu----16-自定义指令
[image feature extraction] image feature extraction based on pulse coupled neural network (PCNN) including Matlab source code