当前位置:网站首页>Vivo large scale kubernetes cluster automation operation and maintenance practice
Vivo large scale kubernetes cluster automation operation and maintenance practice
2022-06-13 10:15:00 【InfoQ】
One 、 background
- Manual black screen cluster O & M operation is required , There are operational errors and cluster configuration differences .
- The deployment scripting tool has no specific version control , It is not conducive to cluster upgrade and configuration change .
- Deploying scripts online takes a lot of time to validate , There are no specific test cases and CI verification .
- ansible The task is not split into modular installations , It should be broken into parts . Specific to the K8s、etcd、addons The role of modular management , Can be performed separately ansible Mission .
- Mainly through binary deployment , You need to maintain a cluster management system . The deployment process is cumbersome , Low efficiency .
- The parameter management of components is chaotic , Specify parameters from the command line .K8s The maximum number of components in the 100 The above parameter configuration . Every iteration of the big version changes .
Two 、 Cluster deployment practice
2.1 Introduction to cluster deployment
- Bootstrap OS
- Preinstall step
- Install Docker
- Install etcd
- Install Kubernetes Master
- Install Kubernetes node
- Configure network plugin
- Install Addons
- Postinstall setup
- 【 Maintainability 】 When the component parameter exceeds 50 Configuration becomes difficult to manage when there are more than .
- 【 Scalability 】 For upgrades , Versioned configuration parameters are easier to manage . Because the parameters of a large version of the community have not changed .
- 【 Programmability 】 Components can be (JSON/YAML) Object . If you enable dynamic kubelet configuration option , The modified parameters will take effect automatically , No need to restart the service .
- 【 Configurability 】 Many types of configurations cannot be represented as key-value form .
- Use kubeadm Yes K8s Life cycle management of clusters , Reduce the cost of maintaining the cluster .
- Use kubeadm Certificate management for , If the certificate is uploaded to secret Reduce the time consumption of the certificate in the host copy and regenerate the certificate .
- Use kubeadm Of kubeconfig Generate admin kubeconfig file .
- kubeadm Other functions such as image management 、 Configuration center upload-config、 Automatically label and stain control nodes .
- install coredns and kube-proxy addons.
- Use ansible The built-in module handles the deployment logic .
- Avoid using hostvars.
- Avoid using delegate_to.
- Enable –limit Pattern .
- wait .
2.2 CI Matrix test
- ansible-lint
- shellcheck
- yamllint
- syntax-check
- pep8
- Deployment cluster
- Expansion and contraction control node 、 Computing node 、etcd
- Upgrade cluster
- etcd、Docker、K8s and addons Parameter change, etc
- Check kube-apiserver Whether it works properly
- Check whether the network between nodes is normal
- Check whether the calculation node is normal
- K8s e2e test
- K8s conformance test
- Other tests
- stay K8s Cluster deployment gitlab-runner, And docking GitLab Warehouse .
- stay K8s Cluster deployment Containerized-Data-Importer (CDI)[4] Components , Used to create pvc The image file that stores the virtual machine .
- stay K8s Cluster deployment kubevirt, For creating virtual machines .
- Write in the code warehouse gitlab-ci.yaml[5], Plan the cluster test matrix .

- Developers submit PR.
- Trigger CI Automatically ansible Syntax check .
- perform ansible Script to create namespace,pvc and kubevirt Virtual machine template for , Finally, the virtual machine is K8s Up operation . It's mainly used here ansible Of K8s modular [6] To manage the creation and destruction of these resources .
- call ansible Script to deploy K8s colony .
- After the cluster is deployed, perform functional verification and performance test .
- The destruction kubevirt、pvc And so on . Delete the virtual machine , Release resources .

- Provide standard K8s API, adopt ansible Of K8s Modules can manage the lifecycle of these resources .
- Reuse the K8s Scheduling capability of , Resources are managed and controlled .
- Reuse the K8s Network capabilities , With namespace Isolation , Each cluster network does not affect each other .
3、 ... and 、Kubernetes-Operator practice
3.1 Operator Introduce
- kubernetes controller
- Deploy or manage an application , Such as a database 、etcd etc.
- User defined application lifecycle management
- Deploy
- upgrade
- Expansion and contraction capacity
- Backup
- Self repair
- wait
3.2 Kubernetes-Operator CR Introduce

3.3 Kubernetes-Operator framework

- Other service clusters can carry the services of the fault cluster ,kubernetes-operator No action is required .
- If other service clusters cannot carry the services of the failed cluster . The container platform starts estimating resources , call kubernetes-operator Create clusters , creating clusterDeployment Select a physical machine from the standby pool , Observe the current need to operate the machine IP Address generation corresponds to inventory And variables , establish configmap And mount to job. Perform cluster installation ansible Script , After the normal deployment of the cluster, start the business migration .
3.4 Kubernetes-Operator Execute the process

- The cluster administrator or container platform triggers the creation ClusterDeployment Of CR, To define the operation of the current cluster .
- ClusterDeployment The controller senses the change and enters the controller .
- Start to create machineSet And correlation machine resources .
- ClusterInstall Controller awareness ClusterDeployment and Machineset The change of , Start Statistics machine resources , establish configmap and job, Parameter specifies the of the operation ansible yml entrance , Perform expansion and contraction 、 Upgrade and installation .
- The scheduler senses job Created pod resources , To schedule .
- The scheduler calls K8s Client update pod Of binding resources .
- kubelet Perceive pod Scheduling results of , establish pod Start execution ansible playbook.
- job controller perception job Implementation status of , to update ClusterDeployment state . Under the general strategy job controller Will clean up configmap and job resources .
- NodeHealthy perception K8s Of node Is it ready, Synchronization machine The state of .
- addons The controller senses whether the cluster is ready, If ready To perform relevant addons Installation and upgrade of plug-ins .
Four 、 summary
边栏推荐
- C 11 new feature: static abstract members in interfaces
- 逐向双碳:东数西算中的绿色需求与竞争焦点
- Implementation of fruit mall wholesale platform based on SSM
- Blue Bridge Cup group 2021a - two way sorting
- LeetCode 2016. Maximum difference between incremental elements
- 电解电容、钽电容、普通电容
- About the problem of database: it can't be found after repeated inspection
- Execution order of subclass and parent constructor
- Trees and binary trees: traversal of binary trees
- 检验冗余码是否出错题型解法-摘录
猜你喜欢
MySQL利用E-R模型的数据库概念设计
MySQL monitoring tool PMM, let you go to a higher level (Part 2)
电解电容、钽电容、普通电容
ThingsBoard教程(二十):使用规则链过滤遥测数据
Docker deployment MySQL
UNIX Environment advanced programming --3-file io---3.10 file sharing
Blue Bridge Cup group 2021a - two way sorting
【20220526】UE5.0.2 release d11782b
ASCII码值是怎么计算的,怎么计算arccos的值
Idea life extension plug-in
随机推荐
信息文档管理与配置管理
基于SSM实现水果商城批发平台
go-zero微服务实战系列(三、API定义和表结构设计)
[Luogu p1090, ssl1040] merged fruit [pile]
【动态规划】入门篇
JS local storage
23. Lottery
[bearing fault decomposition] ITD bearing fault signal decomposition based on MATLAB [including Matlab source code 1871]
Pxxx local socket communication
Tree and binary tree: application of binary tree traversal
C Oracle multi table query
周末赠书:Power BI数据可视化实战
Blue Bridge Cup group 2021a - two way sorting
GPIO of hardware schematic diagram
关于#数据库#的问题:反复检查过了查不出来
Trees and binary trees: Construction of binary trees
IDEA 续命插件
五分钟内编写Pytorch模型
[pytorch environment installation]
记几次略有意思的 XSS 漏洞发现