当前位置:网站首页>Exploration on Optimization of elastic expansion engineering

Exploration on Optimization of elastic expansion engineering

2022-06-24 07:41:00 Crooked stream

This article will take you to explore , Technical design points of Tencent cloud elastic computing products .

This sentence is applicable to all walks of life to a certain extent :Those who talk don’t know, and those who know don’t talk.

—— And I believe in , You will become the one who knows the principle 、 An expert who can do something and is willing to share .

0x00 Introduction to the background of elastic computing

The bottom layer of cloud computing is inseparable from virtualization technology , Virtualization makes people feel safe and happy , It solves the security isolation and efficient utilization of resources . Operating virtual machines , It's like swimming in a swimming lane , Because there is a swimming lane line to block waves , We don't need to care what kind of stroke the people in the next lane are doing and how much splash they make , And we are always free , It's like having the whole pool .

Lane line —— The fall of buoy based swimming pool virtualization technology “ water ” practice

Back to the main character of this article , Tencent cloud is elastic (AutoScaling), It is the core hosting service of cloud based computing products , It can efficiently manage the capacity expansion and contraction activities of ECS clusters for users . And all this is inseparable from the full cooperation of surrounding related products : On the opposite bottom , Tencent cloud based KVM Virtual machine management on a single node physical machine is solved by the virtualization technology of ; On top of virtualization , Tencent cloud's VStation The scheduling system perfectly solves the management of large-scale distributed physical machine clusters 、 Scheduling of virtual machine resources and tasks 、 Dynamic migration and other core issues , And directly supports cloud server products (CVM) The implementation of the ; meanwhile ,CVM It is also the other computing products —— Stretch and stretch 、 Lightweight application servers (Lighthouse)、 Batch calculation (Batch) And so on . Elastic scaling integrates cloud servers 、 Load balancing (CLB)、 The core competence of cloud monitoring and other services , Make it easy for users to pass through the cloud API/ The console system flexibly manages the ECS cluster , Jointly escort the dynamic expansion and contraction of users' business . meanwhile , Elastic scaling or Tencent cloud container service (TKE) Basic support for . Various peripheral service relationships , As shown in the figure below :

Tencent cloud is elastic AutoScaling And related computing products around

VStation As a computing product CVM The underlying distributed scheduling system , Its actual effect in the resource management dimension is much better than that at that time OpenStack Of . There are several reasons , The most important thing is the communication mode between its components . Both use message queues , but OpenStack Treat it as RPC How to implement , The communication between components is inefficient , It also involves complex management such as service registration and discovery , As shown on the right . and VStation The component communication of is shown in the left figure , Message queues act as message buses , Each microservice component only needs to care about the same MQ signal communication , Without paying attention to each other's existence .

Rather than VStation It is a unique innovative practice , than VStation The author has a deeper understanding of message queuing , Grasp the essence of message queue . Subsequent products of Tencent cloud, etc , They have also learned from this message bus model in design to varying degrees .

VStation Message total vs. OpenStack In the early AllToAll Communications

Stretch and stretch (AutoScaling), As the core cloud computing class hosting service , Whether its continuous operation is stable and efficient , It directly reflects the technical professionalism of cloud service providers , It will directly reflect the user's trust in the cloud service provider's products . Official website AS The introduction can be summarized as a typical scenario : Peak shaving and valley filling —— Dynamically adjust resources according to business load ( Mainly cloud server clusters ) size , Optimize the cost of resources without losing the high availability of the business (High Availibilitiy). But in fact, for the daily management of the cluster , The efficiency can also be greatly improved through elastic expansion , So when to use AS Well , The answer is :

“ If your business needs 1 More than cloud servers , You should consider using AS 了 .”—— The stream is crooked

Official website AS brief introduction —— Peak shaving and valley filling

Stretch and stretch (AutoScaling) It mainly solves the horizontal problem of computing resources / Problems in the horizontal expansion scenario (Scale-out), That is, dynamically adjust the service capacity of the system by increasing or decreasing the number of ECS instances . Take capacity expansion as an example , Just like when the number of people at a dinner party increases , Every kind ( Hot Dishes 、 Dessert, etc ) Serve a different dish , When there are more people, they even add the same whole table , Not every dish “ A large portion of ”, That is, improve efficiency through instance duplication and scale , Reduce the overall cost .

Each component in the business architecture should be comprehensive , It's like an angry dinner

In the real world , Business architecture is evolving with business development .

Minimize validation from the first phase of the single service :

To multiple terminals 、 Dynamic and static split of multiple background services :

And then to more complex microservicing :

The ECS of Tencent cloud (CVM) And stretch (AutoScaling) Will accompany every stage of user business development , Witness the rapid changes of users' business and architecture . Of course , Stretch and stretch (AutoScaling) The service itself is also a business that continues to grow at a high speed :

Early days of Tencent cloud (2018-2019) Part of the statistics of the expansion and contraction activities of elastic expansion

Stretch and stretch (AutoScaling) As a computing service product providing basic services , It has been growing rapidly since its birth . Its typical important customers are : Hyperparameters 、 The little red book 、VIPKid、 Homework help 、 Mango. TV、 Litchi micro class 、Supercell、EpicGame etc. , Covering all industries . Its number of users and scaling activities continues to grow , Cluster resource management that has stably supported millions of cores of users .

that , Stretch and stretch (AutoScaling) What essential problems have been solved for users ?《The Art of Scalability》 The extensions mentioned in the book cube( Business 、 copy 、 Data fragmentation dimension ) be widely known , However, the common methodology or design patterns mentioned in the left figure of the book are actually more instructive , In stability ( High availability )、 Fast on-line (Time-to-Market) In fact, the trade-off among the three pursuits of cost saving is the real art in engineering , And the horizontal expansion of the design (Scale Out) It is one of the few ways to solve these three problems at the same time ( The other two are asynchronous design and automation ). Elasticity helps our business achieve stability 、 fast 、 province .

Tips: There is no such thing as synchronization in the real objective world , Only asynchrony ; No updates , Only the reconstruction after breaking . So-called “ Synchronize updates ” That's just people's fantasy . Try to embrace event driven in system design (event-driven) And immutability (immutability) Well , Life will be much simpler .

elastic (Elasticity)—— As the extension, so the force. Elasticity is the deformation of an object when subjected to an external force , And the ability to restore its original shape when the external force is released . A solid object will deform when subjected to an external force . If the material is elastic , When these forces are removed , The object will return to its original shape . According to the generalized Hooke's law , Stress (stress) Size and strain (strain) In direct proportion to .

Stretch and stretch —— Even the entire cloud service —— The key problem to be solved is : Let the business live steadily , Continue to generate social value . The two core ideas are implemented throughout the implementation :

First of all , Be prepared against want —— Design for failure, and nothing will fail.

second , Unified decision making , Flexible execution —— Top level instructions are executed at various levels “ Administrative contracting ”. The focus is on the broad mechanism (machanism) Strategy (strategy) Separate ,strategy It can be a decision / Strategy / Design / Dispatch / Algorithm ,mechanism It can be a mechanism / perform / Calculation /IO/ Task implementation / Service etc. .

0x01 Elastic scaling bottleneck and scheme analysis

Stretch activity —— Ensure that the business architecture is full of elasticity “ collagen protein ”. Elastic scaling activities include capacity expansion 、 Shrinkage capacity 、 Unhealthy instance replacement, etc . The core of elastic scaling is the design and implementation of scaling activities and the management of their life cycle .

Any work that needs to ensure quality needs to form a certain convention (Routine) technological process , Conventions are generally steps (Step) Simple linear superposition of , However, more complex work flow is required Workflow, strand / Parallel or even branch decision .

In fact, in life , You may feel the process every day , Such as simple skin care process : Facial Cleanser cleanser、 Toner toner、 Essence serum、 body lotion moisturizer And sunscreen SPF, like this “ linear ” Five steps of , Each step depends on the end of the previous step . however , When you want to further enhance your charm , Things are getting complicated :

First of all , There will be more steps , For example, in addition to basic skin care, we also need foundation make-up 、 Concealer and final makeup setting powder ;

second , The steps do not completely depend on each other's order , Like eyelashes 、 Eyeliner 、 The sequence of lip gloss and contour repair , Free to play , Make overall adjustment according to interests , When there are many makeup artists, they can go hand in hand ;

Third , There may be some cycles between steps , Such as foundation make-up . Here's the picture :

Make up “ Program ” Serial and parallel of steps

wait , Does it feel like a program ? You bet . The flexible business abstraction can be compared with the makeup process , Using elastic scaling can also make our architecture more dynamic and attractive , Let our business remain young forever .

How complex are the calculation steps for scaling activities ? It can be considered that each scaling activity consists of several sub steps (Step) Composed of , Often every Step Both involve at least one external micro service or API call , It can be simply understood as an asynchronous I/O Mission , There are both serial , There must also be parallel , The call links to each other are also dynamically adjusted by the results of the runtime .

Step configuration for scaling activities ( Reduction diagram )

Of course , The actual steps are already several times as many as those shown in the figure , Reach hundreds of steps , The complexity is also increased to a higher level . Steps are divided into activity level steps ( The blue circle in the following figure ) And instance level ( The green circle in the following figure ) Steps for , As shown in the figure . Simply calculate , If 100 Scaling activities , Each expansion 100 If you have a cloud server , So at the same time ( Or in a very short time ), As shown in the figure “ The most naive ” Under the circumstances , And tens of thousands I/O request , The actual number of steps is actually two orders of magnitude larger . therefore , Stretch activity , namely In the context of elasticity “ Perform tasks ”, Its implementation difficulty and complexity are universal : That is, the background asynchronous task process 、 Multi component / Microservices call each other 、 Business related complex branch logic judgment 、 Multi state of task ( abnormal / retry / Cancel )、 There is a large amount of metadata involved 、 Context isolation between user tasks , And high concurrency performance and stability .

If you have done business development in the background , Do you feel familiar with the above situations and needs ?

The sub steps of a single telescopic activity are similar to the edges and corners of a necklace

Think about it : Such tasks are computationally intensive (Compute Bound) still I/O intensive (I/O Bound) Well ?

0x02 Frame design and details

In order to achieve a unified decision design , Task flow is usually required WorkFlow Step definition to complete . So let's look at the real Flow What is it like ? In the photo , grand Joekulgilkvísl Meandering through the highlands of Iceland , A braided snake moves forward , Flow through snow covered on both sides Rhyolite Rhyolite mountains :

In the Icelandic canyon Joekulgilkvísl The river , Real world Flow

It's not hard to see :

1、 actual Flow Not against the current , Always forward , No, rollback( Roll back );

2、 actual Flow It's not a line , There must be several tributaries , The so-called mainstream is just the biggest ( Probability flows through ) One of the best ;

3、 actual Flow Every point in the path in , They are unique , It has its own context . It is impossible to step into the same river twice .

So sum up : Band Context Of DAG( Directed acyclic graph ) abstract Flow. Don't use similar rollback/cleanup、retry、exception These concepts are used to implement the underlying framework , These complex concepts can have , But it's better to put it on a higher level of abstraction .

in addition , We should abandon the simple and straightforward linear thinking , Compared to success (Success) Or failure (Failure), Focus on “ Where to go next ”(Next) as well as “ Finish cleanly ”(Done) It is more important . Because the first two may have a lot of situations , For example, there must be many reasons and consequences for failure , The success or failure of a task is just DAG One in the Path nothing more , There is no essential difference . The binary decision of "black or white" is often very immature and narrow in scope of application .

Of course , If you are familiar with the linear rollback mindset , Through simple transformation , The process can be Reduce to DAG The description of form , Pictured :S2 yes S1 Rollback cleanup steps for ,S4 yes S3 Rollback cleanup steps for , On the left is the linear rollback mode , On the right is reduced isomorphism DAG.

DAG Describe the business process

Completed the unified decision design , Let's look at flexible and efficient execution . The following scenario is an elegant way of organizing people , The model of an orchestra :

symphony orchestra

A conductor , Several instruments ( String music 、 Wooden pipe 、 Copper tube 、 blow ) Group , Let's work together to complete a harmonious performance process .

For the whole orchestra , Command is the brain (CPU Bound), The musicians in each group are hands and feet (I/O Bound); For the commander , He also has his own brain and baton waving hands . They are all Actor, And they have an elegant way of communicating with each other .

Conducting a brain that seems to be playing an emotional rhythm , But he didn't give orders directly to every musician , So what is the truly efficient organizational force behind it ?

Who is the real control : command vs. Music score

The following figure shows elastic expansion (AutoScaling) Simple schematic diagram of the background architecture of : from API To task scheduler 、 Timed task triggers and peripheral components , It's all important . For the actual execution of the scaling activity , Its engine components are at the end , That is, the red Activator Service components , Note that it is also a MQ Task consumer , Not much different from the common consumer components in your business .

Schematic diagram of elastic and scalable background service architecture

Zoom in on the above Activator, We see the following internal implementation similar to symphony orchestra . The red one is the core of the strategy : Scaling activity step definitions , namely WorkFlow The definition of , It's also the whole activity “ music ”.Activator The core engine is responsible for the efficient execution of scaling activities and the management of life cycle . It can be compared to :

process = Program + virtual machine ( Interpreter + Runtime + Library function )

Stretch activity = Steps to configure the + Implementation framework ( The core engine + Processing function )

Step layout core computing engine design diagram

Each execution unit is Reactor Asynchronous event handlers for the model : On the upper floor “ command ” execution unit Actor Query workflow step table , The next task is calculated and distributed to the lower layer according to the returned results of the execution of the lower layer ; In the next layer Actor Through bus and upper layer “ command ” Execute unit communication . There is a fractal (Fractal) Design considerations , It feels like recursion , This fractal tree structure is the most natural and efficient organization , Although theoretically it can be extended in infinite levels , In practical application, it is generally enough to be within four floors . In practice , We can consider from the cooperative process / Threads / process / Node to microservice levels are implemented step by step . The above figure is just a design example of a two-tier split execution engine in a single thread .

The design is inspired by : Finite automaton 、Actor Model and fractal theory

The system design practice is introduced here first . About performance , Add two more implementation details that make it qualitative .

details 1: The smallest execution unit ( atom Actor) adopt eventfd To inform each other , Efficient use of kernel interfaces / resources , Guaranteed high performance .eventfd-with-epoll It can guarantee the concurrency of millions of events on a single node , It is very suitable for such scenarios with high event throughput .

The smallest fractal Actor unit , be based on Linux eventfd&epoll Efficient implementation

details 2: Transaction message Copy-on-Write Realization , adopt Immutability Save space and ensure communication security . Be careful , Here we recommend message passing (Massage Passing) To realize memory sharing (Memory Sharing), Not the opposite .

Transaction message CoW Reference implementation , Using message passing to realize memory sharing

0x03 Technical thinking and methodology summary

The point is undoubtedly Policy mechanism separation (Separation mechanism from policy), It is the biggest premise and effective method to deal with the contradiction between unified strategy and flexible implementation . It is also one of the most important concepts in the operating system , If you are a technician, you should not feel too strange . For business background computing logic , Core issues tend to fall into two broad categories :

1、 Resource management : It is characterized by large scale , static state , It can be solved by the latter

2、 task management : The difficulty lies in executing scheduling 、 Lifecycle complex states 、 High concurrency 、 And dynamic decision

The demand for unified decision-making , It can be flexibly designed and assembled 、 Reliable and controllable , We use... Through the policy layer DAG Task flow tensor choreography ;

Demands for flexibility in execution , High performance and customization , We use fractal through the execution layer Reactor Layer by layer modeling implementation .

The process strategy is separated from the execution mechanism

System design , It can be optimized along this line : Transform the demand problem into a calculation problem , Re convert to I/O problem , Then it is transformed into the problem of combinatorial design , Finally, it turns into a strategic problem , Our system will become more and more value-added because it meets people's needs .

0x04 Summary

The design points and related methodologies described in this article have been elastically scaled in Tencent cloud (AutoScaling)、 Lightweight application servers (Lighthouse) And other products , It not only effectively supports the rapid iteration of these core businesses 、 Improved development efficiency , What's more, we are happier with our daily code . If you're interested in , Welcome to the discussion ~

0x05 Reference material

The content of this paper comes from the 13th China system architects conference SACC2021 Live sharing of 《 Tencent cloud elastic expansion engineering optimization exploration technology 》

Extensible task flow framework implementation

Linux eventfd Principle application

Incredible images reveal the stunning beauty of braided rivers

About author : The stream is crooked ,2015 Joined tencent in 2004 , Focus on the technical exploration of cloud service products . Responsible for ECS 、 Stretch and stretch 、 Lightweight application servers 、GPU And other products . Long term focus on high-performance cluster management 、 Distributed task scheduling system 、Web Related directions of full stack development .

To be continued , Coming soon ...

原网站

版权声明
本文为[Crooked stream]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/06/20210629171412421f.html