当前位置:网站首页>Analysis of etcd core mechanism
Analysis of etcd core mechanism
2020-11-06 01:23:00 【itread01】
ETCD The whole mechanism
etcd It's a decentralized 、 reliable key-value Storage system , It is suitable for storing key data in distributed systems .
etcd Multiple nodes in the cluster pass through Raft The algorithm completes distributed consistency coordination , The algorithm selects a master node as leader, from leader Responsible for data synchronization and distribution . When leader After a fault occurs, the system will automatically re select another node as leader, And complete the data synchronization again .
etcd Clustering is mainly based on quorum Mechanism , namely : When more than half of the nodes in the cluster are available , Clustering can continue to provide services ,quorum The mechanism is widely used in distributed consistency algorithms , I don't go into detail here .
raft Information updates and etcd The call is based on a two-stage mechanism :
The first stage leader Record log (uncommited); Log copy to follower;follower Respond to , Successful operation , Response client ; The caller calls leader,leader Will kv Data is stored in a log , And use real-time algorithms raft Make a copy of
The second stage leader commit; notice follower; When copied to N+1 After nodes , Local submission , Back to the client , Finally leader Asynchronous notification follower Notice of completion
ETCD The core API analysis
etcd Provided api There are mainly kv Related to 、lease Related to and related to watch, Looking at the source code, we can see that :
kv Related interfaces :
type KV interface { // Put puts a key-value pair into etcd. // Note that key,value can be plain bytes array and string is // an immutable representation of that bytes array. // To get a string of bytes, do string([]byte{0x10, 0x20}). Put(ctx context.Context, key, val string, opts ...OpOption) (*PutResponse, error) // Get retrieves keys. // By default, Get will return the value for "key", if any. // When passed WithRange(end), Get will return the keys in the range [key, end). // When passed WithFromKey(), Get returns keys greater than or equal to key. // When passed WithRev(rev) with rev > 0, Get retrieves keys at the given revision; // if the required revision is compacted, the request will fail with ErrCompacted . // When passed WithLimit(limit), the number of returned keys is bounded by limit. // When passed WithSort(), the keys will be sorted. Get(ctx context.Context, key string, opts ...OpOption) (*GetResponse, error) // Delete deletes a key, or optionally using WithRange(end), [key, end). Delete(ctx context.Context, key string, opts ...OpOption) (*DeleteResponse, error) // Compact compacts etcd KV history before the given rev. Compact(ctx context.Context, rev int64, opts ...CompactOption) (*CompactResponse, error) // Txn creates a transaction. Txn(ctx context.Context) Txn }
There are mainly Put、Get、Delete、Compact、Do and Txn Method ;Put Used to direct to etcd Write messages in the cluster , With key value Storage in the form of ;Get According to key View its corresponding stored in etcd Information in ;Delete By deleting key To delete etcd Information in ;Compact Method is used to compress etcd History of events stored in key value pairs , Avoid the unlimited and continuous growth of event history ;Txn Method to process multiple requests in a single transaction ,etcd The transaction mode is :
if compare
then op
else op
commit
lease Related interfaces :
type Lease interface { // Grant creates a new lease. Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error) // Revoke revokes the given lease. Revoke(ctx context.Context, id LeaseID) (*LeaseRevokeResponse, error) // TimeToLive retrieves the lease information of the given lease ID. TimeToLive(ctx context.Context, id LeaseID, opts ...LeaseOption) (*LeaseTimeToLiveResponse, error) // Leases retrieves all leases. Leases(ctx context.Context) (*LeaseLeasesResponse, error) // KeepAlive keeps the given lease alive forever. If the keepalive response // posted to the channel is not consumed immediately, the lease client will // continue sending keep alive requests to the etcd server at least every // second until latest response is consumed. // // The returned "LeaseKeepAliveResponse" channel closes if underlying keep // alive stream is interrupted in some way the client cannot handle itself; // given context "ctx" is canceled or timed out. "LeaseKeepAliveResponse" // from this closed channel is nil. // // If client keep alive loop halts with an unexpected error (e.g. "etcdserver: // no leader") or canceled by the caller (e.g. context.Canceled), the error // is returned. Otherwise, it retries. // // TODO(v4.0): post errors to last keep alive message before closing // (see https://github.com/coreos/etcd/pull/7866) KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error) // KeepAliveOnce renews the lease once. The response corresponds to the // first message from calling KeepAlive. If the response has a recoverable // error, KeepAliveOnce will retry the RPC with a new keep alive message. // // In most of the cases, Keepalive should be used instead of KeepAliveOnce. KeepAliveOnce(ctx context.Context, id LeaseID) (*LeaseKeepAliveResponse, error) // Close releases all resources Lease keeps for efficient communication // with the etcd server. Close() error }
lease Is a common concept in distributed systems , Used to represent a decentralized lease . Typically , In a decentralized system, it is necessary to detect whether a node is alive or not , We need a lease mechanism .
Grant Method is used to create a lease , When the server is given time to live Did not receive in time keepAlive When the lease expires ;Revoke Rescind a lease , All attached to the lease key Will expire and be deleted ;TimeToLive Get lease information ;KeepAlive By streaming from client to server keep alive Request and streaming from server to client keep alive Answer to maintain the lease ; To detect whether a program is alive in a distributed system , You can create a lease in the program , And periodically call... In the program KeepAlive Methods . If everything goes well , The lease of this node will be consistent , If this program crashes , In the end, the lease will automatically expire , stay etcd in , Allow multiple key Related to the same lease above , Can greatly reduce lease The cost of rearranging objects .
watch Related interfaces :
type Watcher interface { // Watch watches on a key or prefix. The watched events will be returned // through the returned channel. If revisions waiting to be sent over the // watch are compacted, then the watch will be canceled by the server, the // client will post a compacted error watch response, and the channel will close. // If the context "ctx" is canceled or timed out, returned "WatchChan" is closed, // and "WatchResponse" from this closed channel has zero events and nil "Err()". // The context "ctx" MUST be canceled, as soon as watcher is no longer being used, // to release the associated resources. // // If the context is "context.Background/TODO", returned "WatchChan" will // not be closed and block until event is triggered, except when server // returns a non-recoverable error (e.g. ErrCompacted). // For example, when context passed with "WithRequireLeader" and the // connected server has no leader (e.g. due to network partition), // error "etcdserver: no leader" (ErrNoLeader) will be returned, // and then "WatchChan" is closed with non-nil "Err()". // In order to prevent a watch stream being stuck in a partitioned node, // make sure to wrap context with "WithRequireLeader". // // Otherwise, as long as the context has not been canceled or timed out, // watch will retry on other recoverable errors forever until reconnected. // // TODO: explicitly set context error in the last "WatchResponse" message and close channel? // Currently, client contexts are overwritten with "valCtx" that never closes. // TODO(v3.4): configure watch retry policy, limit maximum retry number // (see https://github.com/etcd-io/etcd/issues/8980) Watch(ctx context.Context, key string, opts ...OpOption) WatchChan // RequestProgress requests a progress notify response be sent in all watch channels. RequestProgress(ctx context.Context) error // Close closes the watcher and cancels all watch requests. Close() error }
etcd Of Watch The mechanism can subscribe to... In real time etcd Incremental data updates in ,watch Support the specified individual key, You can also specify a key At the beginning of .Watch Observe what is going to happen or has happened , Both input and output are streams ; The input stream is used to create and cancel observations , Output stream transport events . An observation RPC It can be done in multiple at once key On the scale of observation , And to observe fluidization events for multiple , The entire event history can be viewed from the last compressed revision .
ETCD Data version mechanism
etcd There are mainly term Express leader The term of office of ,revision Represents the version of global data . When clustering happens Leader Switch ,term The value of will be +1, In node failure , perhaps Leader There is a problem with the node network , Or pull it up again after stopping the whole cluster , It's going to happen Leader Switch to ; When the data changes , Including the establishment of 、 modify 、 Delete , Its revision The corresponding city will be +1, Across in the clump Leader Between terms of office ,revision Will keep global monotonically increasing , Any modification in the cluster corresponds to a unique revision, So we can go through revision To support the data MVCC, It can also support data Watch.
For every one of them KeyValue Data node ,etcd Three versions are recorded in :
- The first version is called create_revision, yes KeyValue When it's set up revision;
- The second one is called mod_revision, When the data is manipulated revision;
- The third one version It's just a counter , On behalf of KeyValue How many times has it been modified .
In the same Leader During the term of office , All the modification operations , Its corresponding term Values are always equal , and revision It keeps monotonically increasing . When cluster is restarted , Corresponding to all modification operations term Add to the value 1 了 .
ETCD And MVCC Concurrency control
Speaking of mvcc Everyone is no stranger ,mysql Of innodb Use... In mvcc Realize high concurrency data access , Multi version processing of data , And through the visibility of the transaction, the transaction can see the version of the data it should see , Again , stay etcd Also used in mvcc Concurrency control .
etcd Support for the same Key Initiate multiple data changes , Each data revision corresponds to a version number .etcd The corresponding data of each modification is recorded , That's a key stay etcd There are multiple historical versions in . If you do not specify the version number when querying the data ,etcd Will return to Key The corresponding latest version , At the same time etcd It also supports specifying a version number to query historical data .
etcd Record every change , Use watch When subscribing to materials , Can support from any historical moment ( Appoint revision) Start building a watcher, On the client side with etcd Create a data pipeline between ,etcd Will push from specified revision All data changes started .etcd Provided watch Mechanism guarantees , The Key After the data of the project were revised later , Push to the client immediately through this data pipeline .
Analysis of its original code shows that :
type revision struct { // main is the main revision of a set of changes that happen atomically. main int64 // sub is the the sub revision of a change in a set of changes that happen // atomically. Each change has different increasing sub revision in that // set. sub int64 } func (a revision) GreaterThan(b revision) bool { if a.main > b.main { return true } if a.main < b.main { return false } return a.sub > b.sub }
stay etcd Of mvcc There is one in the implementation revision Structure ,main Represents the transaction of the current operation id, Global self increasing logical time stamp ,sub Represents the child of the current operation within the transaction id, It's self increasing in business , From 0 Start ; Through GreaterThan Method to compare transaction versions .
ETCD Storage data structure
etcd All of the data in is stored in a btree In the data structure of , The btree Stored on disk , And through mmap The method of mapping to memory is used to support fast access ,treeIndex The definition of is as follows :
type treeIndex struct { sync.RWMutex tree *btree.BTree } func newTreeIndex() index { return &treeIndex{ tree: btree.New(32), } }
index In pairs btree There are Put、Get、Revision、Range And Visit etc. , With Put For example , The original code is as follows :
func (ti *treeIndex) Put(key []byte, rev revision) { keyi := &keyIndex{key: key} ti.Lock() defer ti.Unlock() item := ti.tree.Get(keyi) if item == nil { keyi.put(rev.main, rev.sub) ti.tree.ReplaceOrInsert(keyi) return } okeyi := item.(*keyIndex) okeyi.put(rev.main, rev.sub) }
It can be seen from the source code that btree The reading and writing operations of data are completed under lock , So as to ensure the consistency of the data under the concurrency
版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
边栏推荐
- Leetcode's ransom letter
- 关于Kubernetes 与 OAM 构建统一、标准化的应用管理平台知识!(附网盘链接)
- Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
- 快快使用ModelArts,零基础小白也能玩转AI!
- Face to face Manual Chapter 16: explanation and implementation of fair lock of code peasant association lock and reentrantlock
- 阿里云Q2营收破纪录背后,云的打开方式正在重塑
- Flink的DataSource三部曲之二:内置connector
- PN8162 20W PD快充芯片,PD快充充电器方案
- 6.3 handlerexceptionresolver exception handling (in-depth analysis of SSM and project practice)
- I'm afraid that the spread sequence calculation of arbitrage strategy is not as simple as you think
猜你喜欢
采购供应商系统是什么?采购供应商管理平台解决方案
Grouping operation aligned with specified datum
2019年的一个小目标,成为csdn的博客专家,纪念一下
前端都应懂的入门基础-github基础
EOS创始人BM: UE,UBI,URI有什么区别?
在大规模 Kubernetes 集群上实现高 SLO 的方法
Subordination judgment in structured data
Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
Computer TCP / IP interview 10 even asked, how many can you withstand?
PHPSHE 短信插件说明
随机推荐
Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思
Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
Want to do read-write separation, give you some small experience
Using consult to realize service discovery: instance ID customization
Serilog原始碼解析——使用方法
关于Kubernetes 与 OAM 构建统一、标准化的应用管理平台知识!(附网盘链接)
Vuejs development specification
基於MVC的RESTFul風格API實戰
I've been rejected by the product manager. Why don't you know
Architecture article collection
Natural language processing - wrong word recognition (based on Python) kenlm, pycorrector
TRON智能钱包PHP开发包【零TRX归集】
After reading this article, I understand a lot of webpack scaffolding
一篇文章带你了解CSS 分页实例
Working principle of gradient descent algorithm in machine learning
git rebase的時候捅婁子了,怎麼辦?線上等……
Tool class under JUC package, its name is locksupport! Did you make it?
Let the front-end siege division develop independently from the back-end: Mock.js
Python3 e-learning case 4: writing web proxy
Flink的DataSource三部曲之二:内置connector