当前位置：网站首页>Etcd build a highly available etcd cluster

Etcd build a highly available etcd cluster

2022-07-05 16:53:00 【Zhang quandan, Foxconn quality inspector】

On the production line , although etcd It's simple to use , It only needs put get watch These commands can make the whole data flow , On the production line etcd There are many, many problems , Including first, how to ensure safety , Second, highly available etcd How to build a cluster , Third, how to back up data , These are closely related to the security of the whole cluster .

Etcd Important parameters of members

ETCD_NAME： The name of the node , The only one in the cluster

There are several types of parameters , The first is the core parameter , The most basic parameters , Related to members , Every etcd Members have their own names , The default is default, So build it Etcd In clusters , When no parameters are added , This Etcd Members are called default.

ETCD_NAME="etcd-1"
ETCD_NAME="etcd-2"
ETCD_NAME="etcd-3"

ETCD_DATA_DIR： Data directory

etcd The final data falls on the disk ,etcd member member Name .etcd.

ETCD_DATA_DIR="/var/lib/etcd/default.etcd"

ETCD_LISTEN_PEER_URLS： Cluster communication listening address
ETCD_LISTEN_CLIENT_URLS： Client access listening address

Support two types url, One is peer url, because etcd It's clustering ,member and member The communication between them is going peer url Of , Client sent to etcd server This kind of request is to go client url Of , So these two different kinds of requests , Different types of data , It uses different ports to provide data .

For better protection peer Between , Maybe it has a higher priority , When doing network tuning in the future, you can aim at peer To ensure data at the network level .

So they are isolated ,client go client,peer go peer.

ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.31.71:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.31.71:2379"

Etcd Important cluster parameters

ETCD_INITIAL_CLUSTER_STATE： Current status of joining the cluster ,new It's a new cluster ,existing Indicates joining an existing cluster

Create a new cluster , Or start one etcd The instance joins the existing cluster .

ETCD_INITIAL_CLUSTER_TOKEN： Initialize cluster Token

ETCD_INITIAL_ADVERTISE_PEER_URLS： Cluster notification address
ETCD_ADVERTISE_CLIENT_URLS： Client notification address

What is the announced address

Etcd Safety related parameters

since peer and client There are two ports , Every port needs security guarantee ,etcd The most commonly used way of security is mutal TLS, It's two-way TLS, One is that the client should access server When , To verify server End , One is server The client should verify your client .

Each corresponding port has relevant TLS Configuration parameters , such as cert What is it? ,key What is it? ,client crl What is it? .

To visit a TLS Of etcd, Then take it with you key server as well as ca To visit .

--cert-file=/opt/etcd/ssl/server.pem \
--key-file=/opt/etcd/ssl/server-key.pem \
--peer-cert-file=/opt/etcd/ssl/server.pem \
--peer-key-file=/opt/etcd/ssl/server-key.pem \
--trusted-ca-file=/opt/etcd/ssl/ca.pem \
--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem \

With the above parameters, you can set up the cluster .

Disaster preparedness

With these parameters, the cluster can be built , There are many clusters built member Of , these member On different nodes , A node is broken , There are other nodes that store data , So the data won't be lost .

Through this situation , Most of the data is secure , But there will be some extreme situations , If all member Disappeared together , Then the data is lost , So this is unbearable .

such as etcd In order to pod Way to run ,pod Does it have a data disk mount Come outside , This pod For some reasons, it writes too much data , Was expelled , After the expulsion, all these data were lost ,etcd be-all pod Deported , The data is lost .

Losing data can mean very serious problems ,etcd Save the kubernetes All the important information in the cluster .

such as calico cni plug-in unit , He can have his own etcd, It will save all of the current cluster IP Assigned information , If this information is lost , Suppose you run away in this cluster 10W individual Pod, This cluster 10W individual pod it IP What kind of distribution has disappeared , When you start a new pod When , This pod Of IP Assigned elsewhere, you don't know , It's probably this IP Be assigned repeatedly , Assign to another node or current node pod above .

The problems it causes are very serious , The whole cluster of IP It's a mess , Then there may be two pod Grab a IP The situation of , When users access a service , I was supposed to access this service , As a result, I jumped to another service , This is a very serious problem , No one can bear this result .

therefore etcd Data security is very important , In addition to ensuring data security with multiple copies , We also need to back up regularly .

The advantage of backup is that even if the instance drops , Then you can also restore from backup , Although the timeliness is not so strong , The lost data is from the point in time when the backup is generated to the point in time of the current cluster , If the backup frequency is higher , Then the less data is lost .

etcd Support by itself snapshot Command to create a snapshot of the cluster data , Support at the same time restore The command of will play back the information in the snapshot , Recover data .

How to build a kubernetes High availability cluster , In fact, we still need to see etcd How to do high availability ,apiserver How to do high availability ,control manager and scheduel How does the scheduler do .

Above is etcd High availability management .

Etcd Capacity management 、 Debris removal

Capacity management

Etcd It gives some advice , A single object does not exceed 1.5M, When your data is big , Its synchronization overhead and memory snapshot overhead are very large , Will make the whole etcd The performance of , Therefore, it is suggested that the object should not be too large .

etcd The default capacity is 2G, It does not recommend more than 8G, Generally, the production system will be set to 8G.

Clean up disk fragments defrag

Above is the setting etcd The storage size is 16M, Then keep writing data into it , An error is reported when the capacity is exceeded as follows , The cluster was exploded by me , After exceeding the quota, there is no way to write data

Now in alarm state , Had an accident , There's no room , about alarm State cluster , Any write operation cannot succeed .

Can pass defrag Clean up the hard disk , At this time, you can actually clear some hard disk space , however alarm Of no space It's still there , Writing at this time will still report errors .

So we should first remove alarm, Then write again to succeed . If you do etcd This is bound to happen in the operation and maintenance of , Because the writing of data leads to the explosion of disk , Lead to db It burst , This time will come alarm state , After data cleaning, if you want to continue writing data , If you want to go, disable it first alarm, In this way, data can be written .

that defrag Is to clean up disk fragments , Disk fragment cleaning requires other operations , Such as the compact command ,etcd It is a multi version management system , stay bolt db Inside , All of it key All are version Information , So it will have a lot of information about historical versions , But many times, the historical version information may not be used , At this time, we want it to do some compression , Then it supports compact command , Let you specify reversion Version of , Then all previous versions will be cleared , This saves space .

secondly defrag Do disk defragmentation , Many fragments lead to low disk utilization , that defrag Just a moment ok 了 .

etcd In fact, later versions automatically support compact, You need to manually , In the old version etcd It needs operation and maintenance very much , It often happens that the hard disk is burst again , Then we need to go online to find out the reason , To go from compcat, Back etcd Support to do it automatically compcat Such an operation .

原网站

版权声明
本文为[Zhang quandan, Foxconn quality inspector]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207051620528992.html