当前位置：网站首页>How to perform disaster recovery and recovery for kubernetes cluster? (22)

How to perform disaster recovery and recovery for kubernetes cluster? (22)

2022-06-12 21:56:00 【wzlinux】

Kubernetes Hides all the complex details of container choreography , Let's focus on the application itself , There is no need to pay too much attention to how to deploy and maintain . Besides ,Kubernetes Multiple copies are also supported , It can guarantee the high availability of our business . For the cluster itself , We also need to ensure its high availability , You can refer to the official documents ： utilize Kubeadm To create a highly available cluster .

But these are not enough to let us rest easy , because Kubernetes While helping us choreograph the scheduling container , Many key data are often saved , For example, the cluster's own key data 、 secret key 、 Business configuration information 、 Business data, etc . We are using Kubernetes When , It is very necessary to perform disaster recovery , Prevent operational errors （ For example, large-scale non deletion ）、 Natural disasters 、 Disk damage cannot be repaired 、 Network anomalies 、 Data loss caused by power failure in the computer room , In severe cases, the entire cluster may even become unavailable .

So it's using Kubernetes When , We'd better do a disaster recovery to facilitate the recovery of the cluster , Rollback to an earlier stable state .

Kubernetes What needs to be backed up

In the face of Kubernetes Before the cluster is backed up , We first need to know what to back up .

We start from the whole Kubernetes The starting point is the architecture of , Take a look at the components of the entire cluster ：

#yyds Dry inventory # How to Kubernetes Cluster for disaster recovery and recovery ？(22)_github

As can be seen from the above figure , Whole Kubernetes Clusters can be divided into Master node （ left ） and Node node （ On the right side ）.

stay Master Node , We are running Etcd Cluster and Kubernetes Several major components of the control surface , such as kube-apiserver、kube-controller-manager、kube-scheduler and cloud-controller-manager（ Optional ） etc. .

In these components , except Etcd, Others are stateless services . Just promise Etcd The data is normal , No matter what happens to the other components , We can solve this problem by restarting or creating new instances , Will not be affected in any way . So we Just backup Etcd Data in .

It's over Master node , Let's see Node node .

Node Running on the node kubelet、kube-proxy Etc .Kubelet Responsible for maintaining each container instance , And the storage used by the container . To ensure the persistent storage of data , For key business critical data , I suggest that it be passed PV（Persistent Volume） To save and use . In view of this , We Also need to PV Make a backup .

If there is a problem with the node , We can add new nodes to the cluster , Replace the faulty node .

After watching Kubernetes After the official architecture of , Let's take a look at how to back up Etcd Data in and PV.

Yes Etcd Data backup and recovery

Etcd The government also provided Backed up documents , If you are interested, you can read . Here I have summarized some practical operations , So that you can use it for reference and conduct manual backup and recovery . Some certificate paths in the command line and endpoint The address needs to be changed according to the cluster parameters . The actual operation code is as follows ：

# 0.  The data backup 
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
snapshot save ./new.snapshot.db
# 1.  see  etcd  Cluster node 
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ 
--cacert=/etc/kubernetes/pki/etcd/ca.crt \ 
--cert=/etc/kubernetes/pki/etcd/peer.crt \ 
--key=/etc/kubernetes/pki/etcd/peer.key \
member list
# 2.  Stop... On all nodes  etcd！（ Note that all ！！）
##  If it is  static pod, You can listen to the following commands  stop
##  If it is  systemd  Managed , Can pass  systemctl stop etcd
mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
# 3.  Data cleaning 
##  On each node in turn , remove  etcd  data 
rm -rf /var/lib/etcd
# 4.  Data recovery 
##  On each node in turn , recovery  etcd  Old data 
##  Inside  name,initial-advertise-peer-urls,initial-cluster=controlplane
##  Equal parameter , It can be downloaded from  etcd pod  Of  yaml  From the file .
ETCDCTL_API=3 etcdctl snapshot restore ./old.snapshot.db \
--data-dir=/var/lib/etcd \
--name=controlplane \
--initial-advertise-peer-urls=https://172.17.0.18:2380 \
--initial-cluster=controlplane=https://172.17.0.18:2380
# 5.  recovery  etcd  service 
##  On each node in turn , Pull up  etcd  service 
mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
systemctl restart kubelet

     1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.

These backups , You need to run the command line manually . If your Etcd The cluster is running on Kubernetes In the cluster , You can use the following timing Job (CronJob) To help you automate 、 periodic （ As follows YAML The file will be updated every minute Etcd Make a backup ） Local backup Etcd The data of . About CronJob Partial content , We will introduce it in a separate chapter later . The automatic backup code is as follows ：

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: backup
  namespace: kube-system
spec:
  # activeDeadlineSeconds: 100
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            # Same image as in /etc/kubernetes/manifests/etcd.yaml
            image: k8s.gcr.io/etcd:3.2.24
            env:
            - name: ETCDCTL_API
              value: "3"
            command: ["/bin/sh"]
            args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
            volumeMounts:
            - mountPath: /etc/kubernetes/pki/etcd
              name: etcd-certs
              readOnly: true
            - mountPath: /backup
              name: backup
          restartPolicy: OnFailure
          hostNetwork: true
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
              type: DirectoryOrCreate
          - name: backup
            hostPath:
              path: /data/backup
              type: DirectoryOrCreate

     
      1.
      2.
      3.
      4.
      5.
      6.
      7.
      8.
      9.
      10.
      11.
      12.
      13.
      14.
      15.
      16.
      17.
      18.
      19.
      20.
      21.
      22.
      23.
      24.
      25.
      26.
      27.
      28.
      29.
      30.
      31.
      32.
      33.
      34.
      35.
      36.
      37.
      38.

Yes PV Backup your data

about PV Speaking of , Backup is troublesome .Kubernetes It does not provide storage capacity , It relies on various storage plug-ins to manage and use storage . So for stored backup operations , In especial PV Backup operations for , We need to rely on the... Of various cloud providers API To do it snapshot.

But the above for Etcd and PV The backup operation of is not very convenient , I recommend that you pass Velero To backup Kubernetes.Velero Powerful , But it's easy to operate , It can help you do the following 3 spot ：

Yes Kubernets Cluster backup and recovery .
Migrate the cluster .
Copy the configuration and objects of the cluster , For example, copy to other development and test clusters .

and Velero Also available for individual Namespace The ability to back up , If you only want to back up some key business and data , This is a very convenient function .

Said so much , Let's have a look Velero How to back up Kubernetes Of .

Use Velero Yes Kubernetes Make a backup

This is a Velero The architecture of the figure ：

#yyds Dry inventory # How to Kubernetes Cluster for disaster recovery and recovery ？(22)_k8s_02

Velero It's made up of two parts ：

A command line client , You can run locally , Through the command line to complete the Etcd as well as PV Backup operations for ; You can use it as well kubectl operation Kubernetes Back up as a resource Kubernetes.
One runs on kubernetes Services in the cluster （BackupController）, Responsible for performing specific backup and recovery operations .

Let's take a look at the specific process ：

Via local Velero The client sends a backup command , Like in the picture velero backup create test-project-s2i --include-namespaces test, This command will send to APIServer Create a Backup object .
BackupController Will monitor and verify this Backup The legitimacy of the object , For example, the definition of parameters .
BackupController Through to the APIServer Query the relevant data and start the backup work .
BackupController Back up the queried data to the remote object store .

Velero stay Kubernetes A lot of CRD （Custome Resource Definition） And related controllers , Through these operations, such as backup and recovery . therefore , Backup and recovery of the cluster , In essence, it is related to these CRD The operation of .BackupController Will be based on CRD To determine what to do .

Velero Supports two kinds of back-end storage CRD, Namely BackupStorageLocation and VolumeSnapshotLocation.

BackupStorageLocation Mainly used to define Kubernetes Data storage location of cluster resources , Cluster object data , instead of PVC and PV The data of . You can get from this Support List Find the current official and third-party supported back-end storage services , Mainly to support S3 Compatible storage is primary , such as AWS S3、 Alibaba cloud OSS、Minio etc. .
VolumeSnapshotLocation Mainly for PV Take a snapshot , The snapshot function is usually provided by Amazon EBS Volumes、Azure Managed Disks、Google Persistent Disks And so on , You can choose to use the services of various cloud vendors according to your needs . Or you use a special backup tool Restic, hold PV Data backup to Azure Files、 Alibaba cloud OSS In the middle . Alibaba cloud has provided be based on Velero Plug in for .

besides ,BackupController In the course of work , Other... Will also be created CRD, It is mainly used for internal logic processing . You can refer to Alibaba cloud file Further study .

If you don't have Alibaba cloud OSS, Or the cluster is an offline internal cluster , You can also build it yourself Minio, As an object storage service to replace Alibaba cloud OSS. You can refer to the official file Carry out detailed installation and configuration .

Summary

In a distributed world , It's hard for us to guarantee that everything is safe . When you are in Kubernetes When more and more businesses are deployed in the cluster , Disaster recovery for clusters and data is very necessary . In this year 7 month , Our common code hosting platform Github It happened Kubernetes fault , It leads to continuous 4 A serious breakdown of half an hour . therefore , I suggest that for critical business data , Remember to back up frequently .