当前位置:网站首页>Ways to improve the utilization of openeuler resources 01: Introduction

Ways to improve the utilization of openeuler resources 01: Introduction

2022-07-07 19:52:00 openEuler

The problem background

According to the Canalys A report released showed [1], Global spending on cloud infrastructure services is 2022 Year on year growth in the first quarter of 34%, achieve 559 Billion dollars . However , Several studies have shown that , The current average number of global data center user clusters CPU Utilization is lower than 20%, There is a huge waste of resources . therefore , Improving the utilization of data center resources is an important problem that needs to be solved urgently [2].

The cause of the problem

The main reason for low resource utilization is the imbalance between tasks and resource allocation , This imbalance has many forms , for example :

  1. The scheduling system is independent of the cluster : Different jobs adopt different scheduling systems , Jobs cannot flow in a broader cluster , Idle resources of other clusters cannot be effectively utilized .

  2. Lack of diversity in task types : The job homogeneity in the cluster is serious , Some resources are used in the job set , As a result, the utilization rate of this part of resources is high , But the rest of the resources are idle .

  3. Lack of priority hierarchical management : Or the lack of low priority jobs to fill idle resources , Or there are low priority jobs, but the cluster does not have hierarchical control capability , Lead to over allocation of resources .

  4. The resource type in the cluster is single : The overall specification of the internal resources of the cluster is single , It cannot flexibly scale the dynamic requirements of various resources according to the overall business , This leads to excessive allocation of some resources .

Overall speaking , It is the lack of diversity of tasks and resources within the cluster , The weak ability of scheduling to manage diverse tasks and resources leads to .

Solutions

Deploy different types of jobs , Improve the utilization rate of resources in time and space respectively .

  1. Oversold resources ( Air separation is oversold ): The idle resources of online business are oversold to offline jobs , Improve overall resource utilization .
  2. Peak staggering use ( Time oversold ): The idle period of online business is filled with offline jobs , Reduce resource idling .

Technical challenges

Whether it is oversold by air or time , There is a lack of common peak resources , This problem will lead to the service quality of some businesses (QoS) Damage . How to improve resource utilization , Security business QoS Undamaged is a key technical challenge .

Besides , The diversity and complexity of cloud businesses further increase the difficulty of ensuring service quality :

One side , Perceived degree from load characteristics , It can be divided into white box applications , Black box application and gray box application . White box applications can be perceived by the system , Get... In real time QoS indicators ; Black box business cannot be perceived by the system , The system doesn't even know the application QoS What is it? ; Applications with a perceptibility between the two are called gray box applications . How to accurately quantify the service quality of black box business and locate interference sources is the technical challenge of capability generalization , It is also a research hotspot in the industry .

On the other hand , From the business complexity of the load , It can be divided into lightweight applications ( Such as microservices , Function calculation ), Traditional applications ( Such as monomer Application ) And super applications ( Such as HPC/AI) etc. . We need to overcome technical problems such as full stack collaborative awareness , Build a universal unified system .

Solution brief

According to the above cause analysis , further , Diversified businesses / Load and resource integration deployment scheduling , It can significantly improve the flexibility of resource allocation , So as to achieve the purpose of improving the efficiency of resource utilization . But it also brings greater technical challenges , Managed business / The more load , The more resource types , The more complex the dependency relationship is , The more complex the multi-objective optimization requirements of the system . Based on this , We divide it into the following development stages :

L0: Independent deployment : Cluster independent technology stack 、 Independent resource pool , Low cluster utilization (<20%).

L1: Shared deployment : Unified technology stack expands the scale of the cluster , Single type business shared resource deployment , Improve resource utilization based on dynamic elasticity , The utilization rate of cluster resources is low (<30%).

  • Related technology : Technology stack unification 、 Containerization 、 Stretch and stretch

L2:「 Mixed deployment 」: Unified technology stack expands the scale of the cluster , Deployment of shared resources for various types of businesses , Improve resource utilization based on oversold and isolation technology , The utilization rate of cluster resources is high (>40%).

  • Related technology : Oversold resources 、 Hierarchical isolation of resources 、 Feedback control

L3:「 Generic hybrid 」: Hybrid deployment business type generalization , Support the deployment of thousands of black box business shared resources on the public cloud , be based on QoS Quantitative perception ensures the service quality of key businesses .

  • Related technology :QoS quantitative / location 、 Precise control 、QoS Perceptual scheduling

L4:「 Integration deployment 」: On the basis of load type generalization , Fusion container 、 The virtual machine 、 Lightweight runtime and other diverse loads , combination HPC/AI+ Complex scenarios such as heterogeneous resource perception , Comprehensively improve the overall utilization of various resources .

  • Related technology : Heterogeneous resource aware scheduling 、 Unified scheduling

among ,L1~L2 To improve the cluster CPU Resource utilization is the main factor ,L3~L4 Generalize the technology of improving resource utilization .

The industry is currently engaged in internal business L2 Level exploration has significantly improved the overall utilization of clusters and even data centers , But public cloud generalization is still in its early stage , It's not commercial yet .

We are on the trend of combining future generics and converged deployment , It has built a set of sustainable resource utilization solutions , As shown in the figure below :

In order to achieve the best deployment effect , It needs to be controlled and optimized at multiple levels of task execution :

「 Cluster management 」: At the scheduling level, businesses with strong performance interference are deployed separately , Reduce unnecessary interference through task combination optimization .

「 Stand alone management 」: Stand alone management level real-time perception of resource competition , Eliminate the impact on key operations .

「 Resource isolation layer 」: Priority control by grading tasks , Ensure the resource requirements of high priority tasks .

At present, Huawei has realized based on the above framework L2 Level solutions , The relevant features have been verified in Huawei and launched in succession . Important breakthroughs have been made in technology at all levels :

「 Cluster management 」

  • Predictive scheduling : Support predictive scheduling based on node physical resource utilization [3]、 Load balancing scheduling 、 Resource preemption scheduling and other features .
  • Feature modeling : A set of general application portrait modeling components is designed and implemented , This component can automatically inject interference 、 Index collection and model output .

「 Stand alone management 」

  • QoS quantitative : Real time detection of business based on quantitative model QoS And real-time control of interference sources .
  • Topology layout : According to the hardware topology , Make dynamic affinity arrangement for business , With the resource quota unchanged , Improve overall performance .
  • Power control : The increased resource utilization increases the risk of excessive power consumption of the whole machine , Power consumption changes need to be monitored in real time , Carry out targeted power consumption suppression .
  • L3/MB control : The current underlying hardware provides L3 Cache and memory bandwidth isolation , But still need software dynamic control , To achieve a balance between interference control and resource utilization .

「 Resource isolation layer 」

  • Hierarchical preemption : Provide hierarchical preemption capability for prioritized queued resources , Such as CPU、MEM、IO/NET etc. , among CPU Absolute suppression ability ( Avoid priority reversal ),NET Preemptive performance (<100ms) And other industry leaders .
  • Flexible scheduling : Support tidal affinity 、CPU Burst Equal elastic scheduling capacity .

The above fine particle characteristics , We will also open to openEuler On , Please use more 、 Communicate more in the community .

Future plans

At present, we have verified and implemented the hybrid deployment scheme in some internal scenarios , It's reached L2 Stage . In the short term , We also need to break through the black box business QoS Ensure relevant technology and enter L3 Stage , Only to achieve L3 Only in this stage can more users benefit . In the long term , In addition to the container scenario , There are more load types 、 Resource types need to improve resource utilization , This needs to be scheduled in the cluster 、OS And other levels, there are more technological breakthroughs .

This article briefly introduces the thinking about the solution technology of improving the utilization of resources on the cloud , Follow up plans for the isolation technology involved , Feedback control technology , Perceptual scheduling technology is introduced in detail , Coming soon !

Reference material

  1. Global cloud services spend hits US$55.9 billion in Q1 2022
  2. Wang Kangjin , Jia Tong , Li Ying . Summary of research on job scheduling and resource management technology in off-line mixed Department . Journal of software ,2020,31(10):3100-3119
  3. Volcano: On the management platform of off-line operation Department , Realize intelligent resource management and job scheduling

Join us

The resource utilization improvement technology mentioned in the article , from Cloud Native SIG、High Performance Network SIG,Kernel SIG, OpenStack SIG and Virt SIG Joint participation , Its source code will be in openEuler The community is gradually open source . If you are interested in related technologies , Welcome to watch and join . You can add a small assistant wechat , Add the corresponding SIG Wechat group .



This article is from WeChat official account. - openEuler(openEulercommunity).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .

原网站

版权声明
本文为[openEuler]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071739070694.html