当前位置:网站首页>redis cluster cluster, the ultimate solution?

redis cluster cluster, the ultimate solution?

2022-08-02 11:45:00 tar

前言

本文参考源码版本为 redis6.2

前面系列文章,我们聊了 redis 主从模式、哨兵模式,These are high availability guarantees for a single node,Limited to stand-alone memory,另外,由于 redis 持久化的特性,单个 redis Instances of memory shoulds not be too large.

What is the ultimate solution for distributed storage?加机器.One is not enough to add two、三台,Until enough to support your business.

因此,redis A clustered solution is also proposed.Nature is the data as soon as possible均分to multiple nodes,These nodes can provide external services at the same time,这样一来,Cluster mode both increases the overall storage capacity,Increased overall throughput.

How do these data slicing distribution?By keyword range or 关键字hash?redis 采用 hash way of data sharding,On the one hand, it can be avoided as much as possible数据倾斜,At the same time, it can quickly locate key 对应的分片节点.

我们知道,hash The difficulty lies in the modulus,Difficult to define the length of the die.如果定义小了,May be held frequently rehash 扩容;Definition is big,Metadata is too large,导致难以管理.这里 redis Defines the modulo length equal to 16384,也就是说总共有 16384 个分片(也叫槽).

Here's a compromise,Neither is small,也不算大.Subsequent operations are performed in units of shards,比如,一个 key After taking the modulo of the hash, the fragment number is obtained.、Data migration is also performed in units of shardsWhole piece migration.

redis 官方建议,cluster The cluster size should not exceed 1000 个,Thus each cluster node may have 16-7 个分片数据.

Each node only retains a part of the shard data,How does the node know which nodes other shards belong to??This is the cluster元数据管理了,Each node will record this 16384 Node mapping corresponding to each shard,The subsequent cluster node communication is all about maintaining the consistency of this metadata.

值得注意的是,redis cluster The client will also record it映射关系,It is convenient to send the request directly to the corresponding node.当然,Cluster nodes may also change,因此,Will let the client perceive and update accordingly through some notifications.

redis cluster 如何保证高可用呢?Of course, the master-slave mode,值得注意的是,Sentinels are no longer required for failover in clusters,为啥?

你想想,redis when a single node,Can't fault switch,So need sentinel to deal with;when in a cluster,When there are multiple nodes,一旦某个节点故障,Other nodes in the cluster can communicate with each other,Ability to execute Sentinel Swarms,so as to automatically故障转移.


集群:

About the Design of Distributed Clusters,We generally consider the following aspects:

  • 元数据存储,比如 Mapping of shards and storage nodes, etc.
  • 节点间通信,Including information exchange、健康状态等
  • 扩缩容,比如 Consider data migration
  • 高可用,当节点出现故障时,Timely and automatic failover

For distributed service metadata information management,We either adopt中心化的方式,Requests go straight to the middle tier,Then use common intermediate components to store metadata,比如 zk、etcd 等等;On the user side,like a single node,我画了张图,你可以参考下:

另外,也可以考虑使用去中心化的方式,Let each node maintain a metadata information,Some specific communication protocols can be used to exchange information between clusters,Then the client requests a direct connection to any node in the cluster,我也画了张图,大概是这样:

redis cluster is a decentralized model,Each node will maintain a complete metadata information;The ongoing exchange of information between cluster nodes,To ensure the overall uniformity of a cluster data.

Then the effect is,when you visit any node,The node can always find the correct node to process(Even if the request is not bound to it,but it knows who can handle it).

当然,If each node needs to store a metadata information(Mapping relationship between shards and nodes),在数据更新时,There must be a certain delay in data consistency,This requires higher communication efficiency between nodes.

另外,采用Fixed die length的哈希算法,It can more effectively reduce the need for cluster expansion and contraction迁移的数据量.

For the guarantee of high availability, traditional主从模式,不过,This failover is no longer done via Sentinel,Instead, it is negotiated between cluster nodes.,In essence, it also does the work of a sentinel.

整体来看,redis cluster The cluster grows like this:

一、信息互通

我们知道,当 cluster When adding or removing nodes from the cluster,will cause fragmentation(槽)unit data migration,也就是说,The mapping relationship between shards and some nodes will change,Since each node will have such a mapping relationship,因此,There needs to be a way,to propagate this change to all nodes.

槽(slots)The data structure is actually a binary array,数组长度为 2048 个字节,16384 个二进制位,也就是 2k 大小.

Cluster nodes pass PING,PONG way to pass the metadata information of the cluster,PING、PONG Both use the same data structure to carry information,The metadata information of both parties is known over and over again,多个来回,The entire cluster metadata information is consistent,这便是 Gossip 协议.

1. Gossip 协议

redis 采用流言蜚语协议,顾名思义,like gossip、Gossip is the same,一传十、Pass it down like this,Until the metadata information of all nodes is agreed.

redis cluster 如何实现 Gossip 协议的?我们知道,Each cluster node maintains information about other nodes in the cluster,Its mailing list is based on this list.

首先,This work is also handled by periodic time events,every time from the correspondence list随机选择 5 个节点,Then select the node that has not communicated for the longest time from this batch list.

然后构造 PING 请求,try to communicate with it,The request message will carry the hash slots that it is responsible for and the hash slots that are partially mastered by other nodes..

The last is to receive PONG 响应报文,The message and PING The request message is basically the same,The information contained is the hash slot processed by the other node and some other node information mastered,As for how much information to send to other nodes,This can be controlled by some parameters.

这样一来一回,The information between the two sides is open.,By the way, it also got through the information of other nodes in the cluster held by both parties..And then a few more back and forth like this,The cluster information is basically the same.

你也注意到了,The communication nodes above are randomly selected,If a node has not been communicating,The node cannot get through?

没错,redis cluster also considered this,Therefore, nodes that have not communicated for a long time will be selected periodically.,Then perform the above process to communicate.

2. 槽位迁移感知

我们知道,正因为 cluster 采用去中心化的模式,In order to more efficiently and accurately locate specific nodes,Usually the client also needs to cache metadata information(The correspondence between hash slots and nodes),因此,It is easy to occur the client cache update not in time.

假如,The client requested a request that did not contain key 对应的哈希槽,How the cluster will respond?

To ensure that clients are not affected by such metadata changes,cluster Some corresponding instructions are provided to process,比如 MOVED、ASK 等指令.When the client receives these instructions,will do things like redirect、Update client cache and other operations,我们具体来看看:

1)MOVED

当节点发现键所在的槽并非由自己负责处理的时候,节点就会向客户端返回一个 MOVED 错误,指引客户端转向至正在负责槽的节点.

MOVED 错误的格式为:

MOVED <slot> <ip>:<port>
复制代码

其中 slot 为键所在的槽,而 ip 和 port 则是负责处理槽 slot 的节点的 IP 地址和端口号.

当客户端接收到节点返回的 MOVED 错误时,客户端会根据 MOVED provided in the error IP 地址和端口号,Turn to responsible processing tank slot 的节点,and to this node重新发送the command you wanted to execute before.

一个集群客户端通常会与集群中的多个节点创建套接字连接,而所谓的节点转向实际上就是换一个套接字来发送命令.

如果客户端尚未与想要转向的节点创建套接字连接,那么客户端会先根据 MOVED 错误提供的 IP 地址和端口号来连接节点,然后再进行转向.

2)ASK

在进行重新分片期间,Migrate a slot from the source node to the target node过程中,可能会出现这样一种情况:属于被迁移槽的一部分键值对保存在源节点里面,而另一部分键值对则保存在目标节点里面.

当客户端向源节点发送一个与数据库键有关的命令,并且命令要处理的数据库键恰好就属于正在被迁移的槽时:

  • 源节点会先在自己的数据库里面查找指定的键,如果找到的话,就直接执行客户端发送的命令.
  • 反之,如果源节点没能在自己的数据库里面找到指定的键,那么这个键有可能已经被迁移到了目标节点,源节点将向客户端返回一个 ASK 错误,指引客户端转向正在导入槽的目标节点,并再次发送the command you wanted to execute before.

If the request happens to encounter a hash slot that is migrating,and the requesting node cannot find the data,Then the client will receive the following response:

ASK <slot> <ip>:<port>
复制代码

我同样也画了张图,你可以参考下:

ASK 和 MOVED 都会导致客户端转向,它们有哪些区别?

MOVED Responsibility on behalf of the slot已经完成moved from one node to another,在客户端收到关于槽 i 的MOVED 之后,Client cache relationships will also刷新,The back node is about the slot i requests can be sent directly to MOVED 所指向的节点.

与此相反,ASK Just one of the two nodes used in the process of migrating slots临时措施:在客户端收到关于槽 i 的 ASK 之后,客户端只会在接下来的一次命令请求中将关于槽 i 的命令请求发送至 ASK the indicated node;

这个过程Client cache will not be flushed,因此,The process will still go「原节点 -> ASK Redirect target node」这一流程.

二、数据迁移

Remember what the purpose of our cluster is?Single machine capacity is insufficient,Need to expand into a multi-machine cluster,Then divide the data as evenly as possible into each node.How to meet dynamic expansion and shrinkage,Minimize data migration as much as possible?

1. 一致性哈希?

首先说明,redis cluster 采用的哈希算法并不是一致性哈希,Just look a little similarity,Take it out here and mention it separately.

一致性哈希,从名称上来看,Somewhat misleading,这里的一致性,It is not the data consistency of our common master-slave replica.

而是指,在一个集群中,Even if the increase or decrease the node,The modulus of the hash does not change because of this,这样一来,同一个 key 经过 after hash calculation,It's the same before and after hash 值,In order to achieve the purpose of minimizing data migration as much as possible.怎么做到的?

要想 hash 值 (hash(key) % d)不变,只有一个办法,maintain mode d 不变.Consistent hash adopts fixed-length mode = 2^31 -1,Then we put these points(槽)map to a ring network,Each slot corresponds to a type of hash value:

当然,实际情况下,We serve a limited number of nodes,所以,These slots need to be mapped to specific nodes.规则是,Each slot finds the nearest node in a clockwise direction, which is the storage server to which it belongs.,如下图:

这样一来,When we add or delete nodes,Only affects data between two service nodes,Other services are not affected,For example, we delete the service node 3:

可以看到,原节点 2 与 节点 3 Between the data have been migrated to the node 4,而节点 1 和 节点 2 则不受影响.

接着,你可能会问,hash Partitions may have some hot spots key,Which leads to data skew problem,如何解决?

To solve the data skew problem,The essence is to put those hot spots key Distributed to multiple service nodes,That is to put these hot spots key Break up further.

The approach taken by consistent hashing is to use moreVirtual service node,这样一来,热点 key will be more evenly distributed to different virtual nodes,Then map virtual nodes to real nodes,In this way, the data distribution of the actual nodes is more evenly distributed..

2. 哈希槽(分片):

redis cluster 采用 crc16 哈希算法,and use a fixed-length modulo 16384,其中,这 16484 Hash shards are also called 哈希槽,These hash slots are then distributed as evenly as possible to different service nodes.

Choose a fixed length mod,Is it similar to consistent hashing??别急,我们继续往下看.

Suppose we have three service nodes,After distributing as evenly as possible,分配关系如下:

  • 节点 A Contains hash slot slaves 0 到 5500.
  • 节点 B Contains hash slot slaves 5501 到 11000.
  • 节点 C Contains hash slot slaves 11001 到 16383.

We can easily add or delete nodes,when we add a node D 时,节点 A、B、C part of the data will be migrated to the node D;当我们删除节点 A 时,节点 A The data will be migrated to the node B、节点 C.

Did you find anything different?没错,和一致性哈希不同的是,redis cluster Full participation of cluster nodes,and only part of the data redistribution,So as to achieve the effect of data as uniform as possible between cluster nodes.

值得注意的是,Data migration is in hash slot units,也就是说,Data in the same slot will only be migrated to one destination node.

Let's take a look at a few pictures,Suppose we have three primary node cluster 集群(从节点忽略):

when we add nodes 4 时,会从节点 1、节点 2、节点 3 Migrate some hash slot data to the node 4:

然后,when we remove nodes 1 时,会将节点1data are migrated to the nodes respectively 2、节点 3、节点 4,然后再删除节点 1:

好,我们再来对比下 redis cluster 哈希槽(分片)与 The difference between consistent hashing:

首先,相同点在于,两者都采用固定长度Modulo remainder,The model does not change,That means the hash slot is fixed.这样一来,Whether expanding or shrinking,Most of the data corresponding to the slot will not change,It will not involve large-scale data migration..

不同点在于:

  • Die length:Modulo Adoption of Consistent Hash 2^32-1,而 redis cluster 采用 16384 (2^14)
  • Scope of data migration:扩缩容的时候,Consistency hash affects only the relevant nodes before and after the data migration;而 redis cluster full participation,Get as much data as possible.

从效果上看,redis cluster Cluster data is more evenly divided,也就是数据倾斜的几率更小.

另外,值得注意的是,redis cluster 为什么采用 16384 This mod is long?The author gives two reasons:

  • 首先,When nodes communicate between clusters,Will carry the slot configuration information of the current node,And this information is carried in binary bits,刚好 2k 大小.如果过大,Communication between cluster nodes will consume more bandwidth.
  • 其次,一般情况下,The cluster will not exceed 1000 个主节点,因此,16384 A slot is enough.

三、高可用

老生常谈,Let's first look at the two basic conditions for high availability:

  • 首先,To be replicated from the node
  • 其次,There must be multiple nodes participating in voting

然后来看,How to implement high availability:

  • 首先,To periodically check cluster node status
  • Then vote(共识)
  • If it is agreed that the node is faulty,failover

1. 故障检测

集群中的每个节点都会定期地向集群中的其他节点发送 PING 消息,以此来检测对方是否在线.

如果接收 PING 消息的节点没有在规定的时间内,向发送 PING 消息的节点返回 PONG 消息,那么发送 PING 消息的节点就会将接收 PING 消息的节点标记为疑似下线(probable fail,PFAIL).

集群中的各个节点会通过互相发送消息的方式来交换集群中各个节点的状态信息,例如某个节点是处于在线状态、疑似下线状态(PFAIL),还是已下线状态(FAIL).

当一个主节点 A 通过消息得知主节点 B 认为主节点 C 进入了疑似下线状态时,主节点 A 会在自己的 clusterState.nodes 字典中找到主节点 C 所对应的 clusterNode 结构,并将主节点 B 的下线报告(failure report)添加到 clusterNode 结构的 fail_reports 链表里面.

如果在一个集群里面,半数以上The master nodes responsible for processing slots all assign a master node x 报告为疑似下线,那么这个主节点 x 将被标记为已下线(FAIL),将主节点 x 标记为已下线的节点会向集群广播一条关于主节点 x 的 FAIL 消息,所有收到这条FAIL消息的节点都会立即将主节点 x 标记为已下线.

2. 选举新的主节点

当主节点 x is marked as offline,一小段时间后,the whole cluster will know.这个时候,The slave node has been busy campaigning for the master node.

There is a configuration epoch within the cluster,初始值为 0,Every time a failover is performed,就自增 1.As you develop software,类似于版本号.

within the current epoch,Each masternode has one chance to vote,And it votes for the first slave node that asks the master node to vote.

好,Next, we will start the campaign from the node.,Let's look at the rules:

  • When the slave node finds that its master node has been marked as offline,从节点会向集群广播一条CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 消息,要求所有收到这条消息、并且具有Masternodes with voting rightsvote for this slave.
  • 如果一个主节点具有投票权(它正在负责处理槽),并且这个主节点尚未投票给其他从节点,那么主节点将向要求投票的从节点返回一条 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK消息,表示这个主节点支持从节点成为新的主节点.
  • 每个参与选举的从节点都会接收 CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 消息,并根据自己收到了多少条这种消息来统计自己获得了多少主节点的支持.
  • 如果集群里有 N 个具有Masternodes with voting rights,那么当一个从节点收集到大于等于 N / 2 + 1 张支持票时,这个从节点就会当选为新的主节点.
  • 因为在每一个配置纪元里面,每个具有Masternodes with voting rights只能投一次票,所以如果有 N 个主节点进行投票,那么具有大于等于 N / 2 + 1 张支持票的从节点只会有一个,这确保了新的主节点只会有一个.
  • 如果在一个配置纪元里面没有从节点能收集到足够多的支持票,那么集群进入一个新的配置纪元,并再次进行选举,直到选出新的主节点为止.

值得注意的是,这里通过 Raft Algorithm to elect slave nodes leader,and by this slave node leader to perform failover work.

In the previous article we mentioned,哨兵选举 leader 的方法也是类似,But in the Sentinel cluster,Sentinels are elected leader,Then by the sentinel leader to perform failover work.

所以,现在你明白为什么 redis cluster No need for sentinel clusters?因为,cluster The master node in the cluster has already assumed the responsibility of the sentinel cluster(right to vote).

3. 故障转移

When a slave node is eligible to be elected as the new master node,The slave node will then perform the actual failover,我们开看看,What to do to complete a failover:

  • 被选中的从节点会执行 SLAVEOF no one 命令,成为新的主节点.
  • 新的主节点会撤销所有对已下线主节点的槽指派,并将这些槽全部指派给自己.
  • 新的主节点向集群广播一条 PONG 消息,这条 PONG 消息可以让集群中的其他节点立即知道这个节点已经由从节点变成了主节点,并且这个主节点已经接管了原本由已下线节点负责处理的槽.
  • 新的主节点开始接收和自己负责处理的槽有关的命令请求,故障转移完成.

4. 手动切换

cluster In addition to automatic failover, the cluster,Manual switching is also supported.When a slave node receives cluster failover 命令之后,执行手动切换,Let's take a look at what to do:

首先,The slave node first sends a mfstart 包.Notify the master node that the slave node is about to start a manual switchover.

然后,主节点会阻塞所有客户端命令的执行.After the master node in the periodic function clusterCron 中发送 ping When the package is packaged, a special mark will be made in the header part of the package.

When the slave node receives the master node's ping package and after detecting a special marker,The replication offset of the master node will be obtained from the packet header.

Slave node in periodic function clusterCron Check whether the currently processed offset is equal to the master replication offset,When equal, start the switching process.

最后,切换完成后,The master node will send all blocked client commands by sending +MOVED 指令重定向到新的主节点.Through this process you can see,No data loss during manual master-slave switchover,Nor will any execution commands be lost,There will only be a temporary pause during the switching process.


总结

cluster Is the official distributed cluster solution,相对于单机版 redis Has higher storage capacity and higher throughput.换句话说,The ultimate solution for mass storage,Is to engage in clusters,多机器.

cluster The cluster adopts a decentralized approach,Each node in the cluster, as well as the client, stores(或者说缓存)a copy of the same metadata.

As metadata may be updated,Therefore, the entire cluster adopts Gossip 协议进行数据交换,该协议的特点是 一传十、A rumor-like approach,集群节点越多,The longer it takes to complete a full sync.

所以,You will see this decentralized scheme,Nature is limited by overall size cluster nodes,比如,It is difficult to support the scale of ultra-large clusters(1w+节点、10w+节点),The official recommendation is not to exceed 1000 节点.基于此,Split up the business,使用不同的 redis Clustering is also a good solution.

Then there is the expansion and contraction of the cluster,This operation will involve the migration of hash slots,Changes to reduce data hashes,A fixed die length is used 16384;

In this way, the number of nodes changes,will not affect the specific key 对应的哈希值,然后,We only need to migrate the hash slot data of some nodes to achieve the overall data balance of the cluster,It can also effectively avoid data skew.

最后就是高可用,All master nodes inside the cluster become the final referee,Has the right to judge faults and the right to vote for the new master node,基于此,There is no need for additional sentinels to participate in failover(In fact, this part of the work is equivalent to the work of the sentinel.)

对于高可用,We first have enough copies(从节点),Then there is the master node(拥有选举权)And to the right fault detection mechanism、Consensus election algorithm、Perform failover, etc.




相关参考:
原网站

版权声明
本文为[tar]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/214/202208021119119067.html