当前位置：网站首页>Runtime reconfiguration of etcd

Runtime reconfiguration of etcd

2022-06-11 11:25:00 【Cotton wool】

original text ：Etcd Runtime reconfiguration of

Run time reconfiguration

etcd Designed to withstand machine failure .etcd The cluster automatically fails from temporary ( for example , Machine restart ) To recover , And for someone who has N A cluster of members can allow (N-1)/2 Continued failure of . When a member continues to fail , Whether it's due to hardware failure or disk damage , It loses access to the cluster . If the cluster continues to lose more than (N-1)/2 Members of , Then it can only fail miserably , Hopeless loss of quorum (quorum). Once the quorum is lost , The cluster cannot reach consistency and therefore cannot continue to receive updates .

etcd Built in support for progressive runtime reconfiguration , This allows users to update cluster members at run time .

Reconfiguration requests can only be processed when most of the cluster members are working properly . Strongly recommend In the product, the cluster size is always larger than 2, It is not safe to remove a member from a two member cluster . If there is any failure in the removal process , The cluster may not be able to move forward and needs to restart from a major failure .

Reconfigure use cases

Let's go through some common reasons to reconfigure a cluster , Most of them simply involve adding or removing members from a combination .

Cycle or upgrade multiple machines

If there are multiple cluster members due to planned maintenance ( Hardware upgrade , Network downtime ) You need to move , It is recommended to modify multiple members one at a time .
remove leader Is safe , But there was a brief downtime during the election process . If the cluster saves more than 50MB, recommend Migrate members' data directories .

Modify the cluster size

Increasing the cluster size can improve Tolerance of failure And provide better read performance . Because the client can read from any member , Increasing the number of members can improve the overall read throughput .

Reducing the cluster size can improve the write performance of the cluster , In exchange, it reduces elasticity . Writing to the cluster requires copying to the majority of the cluster members in order to be considered as committed . Reducing the cluster size reduces most of the number , So each write can be committed faster .

Replace the failed machine

If the machine has a hardware failure , The data directory is corrupt , Or some other fatal situation , It should be replaced as soon as possible . Machines that have failed but have not been removed have an adverse effect on the quorum and reduce tolerance for additional failures .

To replace the machine , Follow from the cluster Remove Members The advice of , And then again Add a new member The unknown that replaces it . If the cluster saves more than 50MB, And it can also access , recommend Migrate the data directory of the failed member .

Restart the cluster from most failures

If the majority of the cluster has been lost or all the nodes have been modified IP Address , You need manual action to recover safely .

The basic steps in the recovery process include Create a new cluster using old data , Force a single member to survive , And eventually use the runtime configuration to one at a time Add new members To this new cluster .

Cluster reconfiguration operation

Before any change ,etcd A simple majority of members (quorum) Must be available . For any other to etcd Writing , This is also a fundamental requirement .
All cluster changes are done one at a time ：

To update a single member peerURLs, Do an update operation
To replace a single member , Do an add and then a delete operation
To remove a member from 3 Add to 5, Do two additions
To remove a member from 5 Reduced to 3, Do two delete operations

All of these cases will use etcd Self contained etcdctl Command line tools .
If not etcdctl Modify members , have access to v2 HTTP members API perhaps v3 gRPC members API.

Test environment

name	IP	state
etcd1	192.168.4.10	The original
etcd2	192.168.4.20	The original
etcd3	192.168.4.30	newly added 、 Update or delete

etcd1 Example in /usr/lib/systemd/system/etcd.service Startup file ：

[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
EnvironmentFile=/etc/etcd/etcd.conf
ExecStart=/usr/bin/etcd
Restart=always

[Install]
WantedBy=multi-user.target

etcd1 Example in /etc/etcd/etcd.conf The configuration file ：

ETCD_NAME=etcd1
ETCD_DATA_DIR=/etc/etcd/data

ETCD_LISTEN_CLIENT_URLS=http://192.168.4.10:2379
ETCD_LISTEN_PEER_URLS=http://192.168.4.10:2380

ETCD_ADVERTISE_CLIENT_URLS=http://192.168.4.10:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.4.10:2380

ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
ETCD_INITIAL_CLUSTER=etcd1=http://192.168.4.10:2380,etcd2=http://192.168.4.20:2380

ETCD_ENABLE_V2=true

Check member information ：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member list
39d7dd629f95330e, started, etcd2, http://192.168.4.20:2380, http://192.168.4.20:2379, false
f9b6e5803038fabb, started, etcd1, http://192.168.4.10:2380, http://192.168.4.10:2379, false

see etcd1 and etcd2 Original test data in ：

[[email protected] ~]# ETCDCTL_API=2 etcdctl --endpoints="http://192.168.4.10:2379" get /docker-flannel/network/config
{
    
  "Network": "10.0.0.0/16",
  "SubnetLen": 24,
  "Backend": {
    
    "Type": "vxlan"
  }
}

[[email protected] ~]# ETCDCTL_API=2 etcdctl --endpoints="http://192.168.4.20:2379" get /docker-flannel/network/config
{
    
  "Network": "10.0.0.0/16",
  "SubnetLen": 24,
  "Backend": {
    
    "Type": "vxlan"
  }
}

Add a new member

There are two steps to adding members ：

adopt HTTP members API Add new members to the cluster , gRPC members API, perhaps etcdctl member add command .
Start the new member with the original configuration of the new layer , Include updated member list ( Add new members to future members )

Use etcdctl Appoint name and advertised peer URLs To add new members etcd3:192.168.4.30 To the cluster （ On any one etcd The implementation is OK ）:

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member add etcd3 --peer-urls=http://192.168.4.30:2380
Member 9825b911c2558475 added to cluster  b9b6bab8c2110fd

ETCD_NAME="etcd3"
ETCD_INITIAL_CLUSTER="etcd2=http://192.168.4.20:2380,etcd3=http://192.168.4.30:2380,etcd1=http://192.168.4.10:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.4.30:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

here , Look again etcd Cluster information , You can see http://192.168.4.30:2380 be in unstarted state ：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member list
39d7dd629f95330e, started, etcd2, http://192.168.4.20:2380, http://192.168.4.20:2379, false
9825b911c2558475, unstarted, , http://192.168.4.30:2380, , false
f9b6e5803038fabb, started, etcd1, http://192.168.4.10:2380, http://192.168.4.10:2379, false

etcdctl The cluster information about the new member has been given and the environment variables required to successfully start it have been printed out , Complete to etcd3 In the machine /etc/etcd/etcd.conf file ：

ETCD_NAME="etcd3"
ETCD_DATA_DIR=/etc/etcd/data

ETCD_LISTEN_CLIENT_URLS=http://192.168.4.30:2379
ETCD_LISTEN_PEER_URLS=http://192.168.4.30:2380

ETCD_ADVERTISE_CLIENT_URLS=http://192.168.4.30:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.4.30:2380

ETCD_INITIAL_CLUSTER="etcd2=http://192.168.4.20:2380,etcd3=http://192.168.4.30:2380,etcd1=http://192.168.4.10:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster

ETCD_ENABLE_V2=true

Execute after adding systemctl status etcd.service start-up .

The new member will run as part of the cluster and immediately start catching up with the other members of the cluster .

If you add multiple members , The best practice is to configure a single member at a time and verify that it starts correctly before adding more new members .

If you add a new member to a node's cluster , The cluster cannot continue to work until the new member is started , Because it requires two members as follower To reach agreement on consistency . This behavior only occurs in etcdctl member add Affect the time when the cluster and new members successfully establish a connection to existing members .

Check 3 platform etcd The state of ：

[[email protected] ~]# etcdctl --endpoints=http://192.168.4.10:2379,http://192.168.4.20:2379,http://192.168.4.30:2379 endpoint health
http://192.168.4.10:2379 is healthy: successfully committed proposal: took = 16.900029ms
http://192.168.4.30:2379 is healthy: successfully committed proposal: took = 17.184419ms
http://192.168.4.20:2379 is healthy: successfully committed proposal: took = 10.413913ms

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member list
39d7dd629f95330e, started, etcd2, http://192.168.4.20:2380, http://192.168.4.20:2379, false
9825b911c2558475, started, etcd3, http://192.168.4.30:2380, http://192.168.4.30:2379, false
f9b6e5803038fabb, started, etcd1, http://192.168.4.10:2380, http://192.168.4.10:2379, false

see etcd3 Whether the data in is synchronized ：

[[email protected] ~]# ETCDCTL_API=2 etcdctl --endpoints="http://192.168.4.30:2379" get /docker-flannel/network/config
{
    
  "Network": "10.0.0.0/16",
  "SubnetLen": 24,
  "Backend": {
    
    "Type": "vxlan"
  }
}

Update the members

to update advertise client URLs

To update members of advertise client URLs, After a simple update client URL Mark (--advertise-client-urls) Or the environment variable to restart the member (ETCD_ADVERTISE_CLIENT_URLS). After the restart, the members will release the updated URL. Error updated client URL Will not affect etcd The health of the cluster .

to update advertise peer URLs

To update members advertise peer URLs, First update it with the member command and then restart the member . Additional behavior is required because of updates peer URL The cluster wide configuration has been modified and can affect etcd The health of the cluster .

To update peer URL, First , We need to find the target members ID. Use etcdctl List all members ：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member list
39d7dd629f95330e, started, etcd2, http://192.168.4.20:2380, http://192.168.4.20:2379, false
9825b911c2558475, started, etcd3, http://192.168.4.30:2380, http://192.168.4.30:2379, false
f9b6e5803038fabb, started, etcd1, http://192.168.4.10:2380, http://192.168.4.10:2379, false

In this case , Update the members ID by 9825b911c2558475（etcd3） And modify its peerURLs The value is http://192.168.4.30:23800.

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member update 9825b911c2558475 --peer-urls=http://192.168.4.30:23800
Member 9825b911c2558475 updated in cluster  b9b6bab8c2110fd

View the member list again ：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member list
39d7dd629f95330e, started, etcd2, http://192.168.4.20:2380, http://192.168.4.20:2379, false
9825b911c2558475, started, etcd3, http://192.168.4.30:23800, http://192.168.4.30:2379, false
f9b6e5803038fabb, started, etcd1, http://192.168.4.10:2380, http://192.168.4.10:2379, false

Delete members

Suppose we want to delete the member ID yes 9825b911c2558475（etcd3）. It can be used remove Command to execute the delete :

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" member remove 9825b911c2558475
Member 9825b911c2558475 removed from cluster  b9b6bab8c2110fd

At this point, the target member will stop itself and print out the removal information in the log ,etcd The service will stop ：

etcd: the member has been permanently removed from the cluster

Can be safely removed leader, Of course in the new leader The cluster will not be active when elected (inactive). This duration is usually the election timeout plus the voting process .

that , You can view which node is leader：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379" endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.4.30:2379 |  f131e3371e51a36 |  3.4.13 |   25 kB |     false |      false |        62 |     198958 |             198958 |        |
| http://192.168.4.20:2379 | 441a3fcaee433945 |  3.4.13 |   25 kB |      true |      false |        62 |     198958 |             198958 |        |
| http://192.168.4.10:2379 | f9b6e5803038fabb |  3.4.13 |   25 kB |     false |      false |        62 |     198958 |             198958 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Or use --cluster, More convenient , No need to use --endpoints Flag to specify each endpoint separately .：

[[email protected] ~]# etcdctl --endpoints="http://192.168.4.10:2379,http://192.168.4.20:2379,http://192.168.4.30:2379" endpoint status -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.4.10:2379 | f9b6e5803038fabb |  3.4.13 |   25 kB |     false |      false |        62 |     198963 |             198963 |        |
| http://192.168.4.20:2379 | 441a3fcaee433945 |  3.4.13 |   25 kB |      true |      false |        62 |     198963 |             198963 |        |
| http://192.168.4.30:2379 |  f131e3371e51a36 |  3.4.13 |   25 kB |     false |      false |        62 |     198963 |             198963 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Strict reconfiguration check mode (-strict-reconfig-check)

As mentioned above , The best practice for adding new members is to configure a single member at a time and verify that it starts correctly before adding more new members . This step-by-step approach is very important , Because if the newly added member is not configured correctly ( for example peer URL Incorrect ), The cluster will lose a quorum . The loss of quorum occurs because the newly added members are counted by the quorum , Even if this member is inaccessible to other existing members . Similarly, the loss of quorum may occur when there are connection problems or operation problems .

To avoid this problem ,etcd Provide options -strict-reconfig-check. If this option is passed to etcd, etcd Reconfiguration request denied , If the number of members started will be less than the quorum of the reconfigured cluster .

It is recommended to enable this option . Of course , It is turned off by default for compatibility . The environment variable name is ： environment variable : ETCD_STRICT_RECONFIG_CHECK.

Reference resources ：https://doczhcn.gitbook.io/etcd/index/index-1/clustering/runtime-configuration#yan-ge-zhong-pei-zhi-jian-cha-mo-shi-strictreconfigcheck

原网站

版权声明
本文为[Cotton wool]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111106556838.html