2022-08-02 23:13:00 【CSDN cloud native】
嘉宾 | 孟凡杰
出品 | CSDN云原生
CNCF云原生计算基金会2021年《FinOps Kubernetes Report》显示,迁移至 Kubernetes 平台后,68%的受访者表示所在企业计算资源成本有所增加,36%的受访者表示成本飙升超过20%.因此提升资源利用率,实现降本增效,已成为当前企业关注的重点.
But how to improve resource utilization,实现降本增效?这值得云原生从业者持续探讨和互相学习.
基于此,中国信通院、腾讯云、FinOps产业标准工作组联合发起《原动力x云原生正发声 降本增效大讲堂》系列直播活动.在2022年6月23日的第一讲活动上,Tencent Cloud is a container technology expert、FinOpsMeng Fanjie, head of product research and development, shared Yunyuan⽣Authors efficiency best practices case.
Meng Fanjie is committed to cloud-native cost optimization,是云原⽣Cost Optimization Open Source Items⽬Crane发起⼈,著有《Kubernetes⽣Production practice way》、《软件研发效能提升实践》.This article is compiled from Meng Fanjie's sharing.
当前,Cost optimization has become the core concern of enterprises migrating to the cloud.在腾讯内部,There are very high requirements for resource utilization,After mass from the years of research on business cloud technology exploration and cost optimization of actual combat,成效明显——总体规模5000万核,Resource utilization after co-location reaches65%,累计节省30亿元人民币.
Cloud-native cost management status and challenges
CNCF 2021annual survey shows⽰,Cloud-native deployment rate hits record high⾼,96%of organizations are already investigating or using⽤Kubernetes.也就是说,These organizations have either used in a production systemKubernetes,Or you are already doingPOC了.
同时,Flexera发布的《2021云计算市场发展状态报告》显示,30%~35%的云⽀The wasted.Data analysis and research based on Tengyun public cloud customer data also proves this point——Waste of resource costs in customer clusters⾮常严重,There are many customers who have asked about the⾼资源利⽤rate demands.Sampling data display,Average physical machine⽤率10%,Virtual machine average profit⽤率12%,But the container utilization on average only14%,Is much lower than we expected.
In the post-cloud-native era,Cost management faces many challenges:
去中⼼化:随着以Kubernetes为核⼼的云原⽣应⽤的蓬勃发展,traditional centralized financial budgets andITThe management model is transforming to business-oriented distributed decision-making;
不断上涨:CNCF调查显⽰,随着业务的快速发展,Enterprise cloud fees⽤以24%The annual growth rate is increasing rapidly;
动态变化:云原⽣The dynamic environment and the elastic energy⼒Incurring cloud charges⽤Constantly changing with business load;
浪费严重:Lack of awareness of resource optimization after the business goes to the cloud,Still manage resources with traditional resource allocation thinking,浪费严重.
The core of the cloud cost management⼼under the premise of securing the business,最⼩resource requirements.但由于Kubernetes原⽣能⼒innate not⾜,导致资源浪费.
资源配置(不会配):Inaccurate resource allocation based on experience leads to⼤waste.
弹性(不敢配):The hysteresis of threshold-based resiliency leads to business delays in resiliency.
业务稳定性(不能配):当CPU发⽣Grab incpu.sharesEquitable distribution of time⽚,Unable to ensure stability for latency-sensitive services.
Tencent's cloud native authors efficiency is based on best practiceFinOpsIn the framework of.
FinOps定义了⼀Series of Cloud Financial Management Rules and Best Practices,through help⼒⼯Engineering and Finance Team、技术和业务团队彼此合作,进⾏Data driven the cost of decision making,To enable an organization to gain the most⼤收益.其原则、角色、成熟度、阶段、The capabilities are shown in the figure below.
Based on the above methodology,Tencent open sourced a cost optimization projectCrane(Cloud Resource Analytics and Economics),Let Tencent's self-developed business cloud-native experience and tools help more people.CraneThe architecture has the following characteristics:
prediction is king:Extensible prediction algorithm;
optimization-based:Forecast-based reallocation of resources、成本可视化、Multi-dimensional expansion and contraction;
Stability is the root:Enhancements based on business prioritiesQoS;⼲jammer detection and active avoidance.
以CraneFramework on the basis of the authors of product architecture as shown in the figure below.
Specific push authors,FinOpsTeam is the core of cost reduction,as a centralized decision-making team,Locate the essential,Report up to management(CTO/CFO/COO),Receive management mandate to drive transformation and platform optimization across all business units;Break down each executed task into different distributed execution teams(平台运维、业务运维).
FinOpsThe team's day-to-day work covers cost reduction strategies、Waste cost analysis and recognition、Goal setting and delivery、Rate optimization、业务侧优化、平台侧优化.
在产品层面,Tencent launches data-driven cost analysis and outcome measurement.
⾃定义Spec Watcher捕获workload变动
基于Prometheus Metrics Beats每⽇Pull in the morning on the day of the business indicators
针对Metrics Beats做了⾃Define the storage optimization,Storage space reduced by orders of magnitude
The offline algorithm to evaluate
针对⼤Quantity of offline data accuracy evaluation prediction algorithm
Clustering with business portrait
⼤disk status and⾛势
best practice examples
Take the cluster optimization of a department of Tencent as an example,在优化之前,The state of the department is:
Node packing rate is uneven:近⼀Semi-cluster packing rate is not⾜50%;
Node,⽤率低:三分之⼆cluster peak profit⽤less than rate40%;
Business resources,⽤率低:CPU利⽤率15%,内存利⽤率25%;
effective elasticity⽐低:只有10%的HPAIn this year's pop up.
调研后发现,这是一个普遍现象,为了改善这些问题,Tencent does this.
目标设定drying with performance
To optimize Tencent's internal business cloud resources,Tencent defines the maturity model of cloud
This model evaluates each from the platform side and the business sideBGof cloud resources⽤情况
Overall Maturity Score = business side score * 50% + Platform side score * 50%
Scoring from homework、产品、部门维度层层汇总,And use this result as the evaluation reference index
Based on the prediction of trend analysis
Cost and waste identification
与计费APIintegration fee⽤展⽰
⽅Reliable optimization performance⼒
可⽀Specification optimization of in-situ lifting and distribution
弹性推荐、Elastic and predict elastic regularly
three yellow⾦Curve show⽰Source of recommended value
平台侧优化:Node capacity scaling and⽔Bit management
Cluster the broader visual
Overall benefit of the cluster⽤率
Node,⽤rate heat⼒图
Node capacity scaling
Zoom can define node capacity⽐例,放⼤节点可分配资源,Increase the rate of loading
Node water level control
可⾃定义节点⽔位,control node profit⽤Rate upper limit
Dynamic scheduler based on⽤The rate of loading,Ensure that the real⽤率与⽬Mark,⽤率⼀致
Configurable crunch-first scheduling policy,⽅return idle nodes
平台侧优化:Business grading and co-location
Based on the platform operational definitionPriorityClassConflict handling strategy
Business operations for the business
Conflict Detection and Active Avoidance
Flexible anomaly detection strategy
Comprehensive index consideration
CPI、Steal Time、CPU Utilization、Memory Utilization、Network IO、Disk IO
active avoidance strategy
⾼Excellent businessCPUabsolute preemption
Actively expelled low-quality business
The effectiveness of internal large-scale implementation
在腾讯内部⾃Research business⼤规模落地
Control millionsCPU核
全⾯上线⼀个⽉内,⼤Plate to the total number of nuclear cuts25%
If you face on cloud native authors also challenges,欢迎访问我们的开源项目地址:https://github.com/gocrane/crane 与我们联系.Looking forward to exchanging cost-reducing experience and technology with industry peers,用CraneThe ability to solve the cost for you to optimize core pain points.
【原动力×Cloud native is talking about cost reduction and efficiency increase lecture hall】第二期聚焦全场景在离线混部、K8s GPU资源效率提升、K8s资源拓扑感知调度主题,分别在7月28日、8月4日、8月11日晚20:00-21:00进行.点击『此处』进入活动专题,带你体验云原生降本增效实践案例、了解如何解决企业用云痛点、掌握降本增效关键技能……
