当前位置:网站首页>Ways to improve the utilization of openeuler resources 01: Introduction
Ways to improve the utilization of openeuler resources 01: Introduction
2022-07-07 19:52:00 【openEuler】
The problem background
According to the Canalys A report released showed [1], Global spending on cloud infrastructure services is 2022 Year on year growth in the first quarter of 34%, achieve 559 Billion dollars . However , Several studies have shown that , The current average number of global data center user clusters CPU Utilization is lower than 20%, There is a huge waste of resources . therefore , Improving the utilization of data center resources is an important problem that needs to be solved urgently [2].
The cause of the problem
The main reason for low resource utilization is the imbalance between tasks and resource allocation , This imbalance has many forms , for example :
The scheduling system is independent of the cluster : Different jobs adopt different scheduling systems , Jobs cannot flow in a broader cluster , Idle resources of other clusters cannot be effectively utilized .
Lack of diversity in task types : The job homogeneity in the cluster is serious , Some resources are used in the job set , As a result, the utilization rate of this part of resources is high , But the rest of the resources are idle .
Lack of priority hierarchical management : Or the lack of low priority jobs to fill idle resources , Or there are low priority jobs, but the cluster does not have hierarchical control capability , Lead to over allocation of resources .
The resource type in the cluster is single : The overall specification of the internal resources of the cluster is single , It cannot flexibly scale the dynamic requirements of various resources according to the overall business , This leads to excessive allocation of some resources .
Overall speaking , It is the lack of diversity of tasks and resources within the cluster , The weak ability of scheduling to manage diverse tasks and resources leads to .
Solutions
Deploy different types of jobs , Improve the utilization rate of resources in time and space respectively .
Oversold resources ( Air separation is oversold ): The idle resources of online business are oversold to offline jobs , Improve overall resource utilization . Peak staggering use ( Time oversold ): The idle period of online business is filled with offline jobs , Reduce resource idling .
Technical challenges
Whether it is oversold by air or time , There is a lack of common peak resources , This problem will lead to the service quality of some businesses (QoS) Damage . How to improve resource utilization , Security business QoS Undamaged is a key technical challenge .
Besides , The diversity and complexity of cloud businesses further increase the difficulty of ensuring service quality :
One side , Perceived degree from load characteristics , It can be divided into white box applications , Black box application and gray box application . White box applications can be perceived by the system , Get... In real time QoS indicators ; Black box business cannot be perceived by the system , The system doesn't even know the application QoS What is it? ; Applications with a perceptibility between the two are called gray box applications . How to accurately quantify the service quality of black box business and locate interference sources is the technical challenge of capability generalization , It is also a research hotspot in the industry .
On the other hand , From the business complexity of the load , It can be divided into lightweight applications ( Such as microservices , Function calculation ), Traditional applications ( Such as monomer Application ) And super applications ( Such as HPC/AI) etc. . We need to overcome technical problems such as full stack collaborative awareness , Build a universal unified system .
Solution brief
According to the above cause analysis , further , Diversified businesses / Load and resource integration deployment scheduling , It can significantly improve the flexibility of resource allocation , So as to achieve the purpose of improving the efficiency of resource utilization . But it also brings greater technical challenges , Managed business / The more load , The more resource types , The more complex the dependency relationship is , The more complex the multi-objective optimization requirements of the system . Based on this , We divide it into the following development stages :
L0: Independent deployment : Cluster independent technology stack 、 Independent resource pool , Low cluster utilization (<20%).
L1: Shared deployment : Unified technology stack expands the scale of the cluster , Single type business shared resource deployment , Improve resource utilization based on dynamic elasticity , The utilization rate of cluster resources is low (<30%).
Related technology : Technology stack unification 、 Containerization 、 Stretch and stretch
L2:「 Mixed deployment 」: Unified technology stack expands the scale of the cluster , Deployment of shared resources for various types of businesses , Improve resource utilization based on oversold and isolation technology , The utilization rate of cluster resources is high (>40%).
Related technology : Oversold resources 、 Hierarchical isolation of resources 、 Feedback control
L3:「 Generic hybrid 」: Hybrid deployment business type generalization , Support the deployment of thousands of black box business shared resources on the public cloud , be based on QoS Quantitative perception ensures the service quality of key businesses .
Related technology :QoS quantitative / location 、 Precise control 、QoS Perceptual scheduling
L4:「 Integration deployment 」: On the basis of load type generalization , Fusion container 、 The virtual machine 、 Lightweight runtime and other diverse loads , combination HPC/AI+ Complex scenarios such as heterogeneous resource perception , Comprehensively improve the overall utilization of various resources .
Related technology : Heterogeneous resource aware scheduling 、 Unified scheduling
among ,L1~L2 To improve the cluster CPU Resource utilization is the main factor ,L3~L4 Generalize the technology of improving resource utilization .
The industry is currently engaged in internal business L2 Level exploration has significantly improved the overall utilization of clusters and even data centers , But public cloud generalization is still in its early stage , It's not commercial yet .
We are on the trend of combining future generics and converged deployment , It has built a set of sustainable resource utilization solutions , As shown in the figure below :
In order to achieve the best deployment effect , It needs to be controlled and optimized at multiple levels of task execution :
「 Cluster management 」: At the scheduling level, businesses with strong performance interference are deployed separately , Reduce unnecessary interference through task combination optimization .
「 Stand alone management 」: Stand alone management level real-time perception of resource competition , Eliminate the impact on key operations .
「 Resource isolation layer 」: Priority control by grading tasks , Ensure the resource requirements of high priority tasks .
At present, Huawei has realized based on the above framework L2 Level solutions , The relevant features have been verified in Huawei and launched in succession . Important breakthroughs have been made in technology at all levels :
「 Cluster management 」:
Predictive scheduling : Support predictive scheduling based on node physical resource utilization [3]、 Load balancing scheduling 、 Resource preemption scheduling and other features . Feature modeling : A set of general application portrait modeling components is designed and implemented , This component can automatically inject interference 、 Index collection and model output .
「 Stand alone management 」:
QoS quantitative : Real time detection of business based on quantitative model QoS And real-time control of interference sources . Topology layout : According to the hardware topology , Make dynamic affinity arrangement for business , With the resource quota unchanged , Improve overall performance . Power control : The increased resource utilization increases the risk of excessive power consumption of the whole machine , Power consumption changes need to be monitored in real time , Carry out targeted power consumption suppression . L3/MB control : The current underlying hardware provides L3 Cache and memory bandwidth isolation , But still need software dynamic control , To achieve a balance between interference control and resource utilization .
「 Resource isolation layer 」:
Hierarchical preemption : Provide hierarchical preemption capability for prioritized queued resources , Such as CPU、MEM、IO/NET etc. , among CPU Absolute suppression ability ( Avoid priority reversal ),NET Preemptive performance (<100ms) And other industry leaders . Flexible scheduling : Support tidal affinity 、CPU Burst Equal elastic scheduling capacity .
The above fine particle characteristics , We will also open to openEuler On , Please use more 、 Communicate more in the community .
Future plans
At present, we have verified and implemented the hybrid deployment scheme in some internal scenarios , It's reached L2 Stage . In the short term , We also need to break through the black box business QoS Ensure relevant technology and enter L3 Stage , Only to achieve L3 Only in this stage can more users benefit . In the long term , In addition to the container scenario , There are more load types 、 Resource types need to improve resource utilization , This needs to be scheduled in the cluster 、OS And other levels, there are more technological breakthroughs .
This article briefly introduces the thinking about the solution technology of improving the utilization of resources on the cloud , Follow up plans for the isolation technology involved , Feedback control technology , Perceptual scheduling technology is introduced in detail , Coming soon !
Reference material
Global cloud services spend hits US$55.9 billion in Q1 2022 Wang Kangjin , Jia Tong , Li Ying . Summary of research on job scheduling and resource management technology in off-line mixed Department . Journal of software ,2020,31(10):3100-3119 Volcano: On the management platform of off-line operation Department , Realize intelligent resource management and job scheduling
Join us
The resource utilization improvement technology mentioned in the article , from Cloud Native SIG、High Performance Network SIG,Kernel SIG, OpenStack SIG and Virt SIG Joint participation , Its source code will be in openEuler The community is gradually open source . If you are interested in related technologies , Welcome to watch and join . You can add a small assistant wechat , Add the corresponding SIG Wechat group .
This article is from WeChat official account. - openEuler(openEulercommunity).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .
边栏推荐
- Kirin Xin'an joins Ningxia commercial cipher Association
- R语言fpc包的dbscan函数对数据进行密度聚类分析、查看所有样本的聚类标签、table函数计算聚类簇标签与实际标签构成的二维列联表
- 杰理之发起对耳配对、回连、开启可发现、可连接的轮循函数【篇】
- R language uses ggplot2 function to visualize the histogram distribution of counting target variables that need to build Poisson regression model, and analyzes the feasibility of building Poisson regr
- How to buy bank financial products? Do you need a bank card?
- “本真”是什么意思
- LeetCode 535(C#)
- L1-027 rental (Lua)
- The strength index of specialized and new software development enterprises was released, and Kirin Xin'an was honored on the list
- R语言dplyr包select函数、group_by函数、filter函数和do函数获取dataframe中指定因子变量中指定水平中特定数值数据列的值第三大的值
猜你喜欢
Make insurance more "safe"! Kirin Xin'an one cloud multi-core cloud desktop won the bid of China Life Insurance, helping the innovation and development of financial and insurance information technolog
CMD command enters MySQL times service name or command error (fool teaching)
Experiment 1 of Compilation Principle: automatic implementation of lexical analyzer (Lex lexical analysis)
杰理之手动配对方式【篇】
Install mysql8 for Linux X ultra detailed graphic tutorial
Netease Yunxin participated in the preparation of the standard "real time audio and video service (RTC) basic capability requirements and evaluation methods" issued by the Chinese Academy of Communica
ASP. Net kindergarten chain management system source code
杰理之关于 TWS 声道配置【篇】
Kunpeng developer summit 2022 | Kirin Xin'an and Kunpeng jointly build a new ecosystem of computing industry
Dynamic addition of El upload upload component; El upload dynamically uploads files; El upload distinguishes which component uploads the file.
随机推荐
How to buy bank financial products? Do you need a bank card?
【Confluence】JVM内存调整
Redis master-slave and sentinel master-slave switchover are built step by step
我的创作纪念日
R语言使用ggplot2函数可视化需要构建泊松回归模型的计数目标变量的直方图分布并分析构建泊松回归模型的可行性
Solve the problem of remote rviz error reporting
L1-027 rental (Lua)
Download from MySQL official website: mysql8 for Linux X Version (Graphic explanation)
干货分享|DevExpress v22.1原版帮助文档下载集合
ASP.NET体育馆综合会员管理系统源码,免费分享
L1-025 positive integer a+b (Lua)
A pot of stew, a collection of common commands of NPM and yarn cnpm
R语言dplyr包select函数、group_by函数、filter函数和do函数获取dataframe中指定因子变量中指定水平中特定数值数据列的值第三大的值
2022年投资哪个理财产品收益高?
[RT thread env tool installation]
The project manager's "eight interview questions" is equal to a meeting
Numpy——2. Shape of array
Ucloud is a basic cloud computing service provider
[confluence] JVM memory adjustment
AI writes a poem