当前位置:网站首页>Analysis of etcd core mechanism
Analysis of etcd core mechanism
2020-11-06 01:23:00 【itread01】
ETCD The whole mechanism
etcd It's a decentralized 、 reliable key-value Storage system , It is suitable for storing key data in distributed systems .
etcd Multiple nodes in the cluster pass through Raft The algorithm completes distributed consistency coordination , The algorithm selects a master node as leader, from leader Responsible for data synchronization and distribution . When leader After a fault occurs, the system will automatically re select another node as leader, And complete the data synchronization again .
etcd Clustering is mainly based on quorum Mechanism , namely : When more than half of the nodes in the cluster are available , Clustering can continue to provide services ,quorum The mechanism is widely used in distributed consistency algorithms , I don't go into detail here .
raft Information updates and etcd The call is based on a two-stage mechanism :
The first stage leader Record log (uncommited); Log copy to follower;follower Respond to , Successful operation , Response client ; The caller calls leader,leader Will kv Data is stored in a log , And use real-time algorithms raft Make a copy of
The second stage leader commit; notice follower; When copied to N+1 After nodes , Local submission , Back to the client , Finally leader Asynchronous notification follower Notice of completion
ETCD The core API analysis
etcd Provided api There are mainly kv Related to 、lease Related to and related to watch, Looking at the source code, we can see that :
kv Related interfaces :
type KV interface {
// Put puts a key-value pair into etcd.
// Note that key,value can be plain bytes array and string is
// an immutable representation of that bytes array.
// To get a string of bytes, do string([]byte{0x10, 0x20}).
Put(ctx context.Context, key, val string, opts ...OpOption) (*PutResponse, error)
// Get retrieves keys.
// By default, Get will return the value for "key", if any.
// When passed WithRange(end), Get will return the keys in the range [key, end).
// When passed WithFromKey(), Get returns keys greater than or equal to key.
// When passed WithRev(rev) with rev > 0, Get retrieves keys at the given revision;
// if the required revision is compacted, the request will fail with ErrCompacted .
// When passed WithLimit(limit), the number of returned keys is bounded by limit.
// When passed WithSort(), the keys will be sorted.
Get(ctx context.Context, key string, opts ...OpOption) (*GetResponse, error)
// Delete deletes a key, or optionally using WithRange(end), [key, end).
Delete(ctx context.Context, key string, opts ...OpOption) (*DeleteResponse, error)
// Compact compacts etcd KV history before the given rev.
Compact(ctx context.Context, rev int64, opts ...CompactOption) (*CompactResponse, error)
// Txn creates a transaction.
Txn(ctx context.Context) Txn
}
There are mainly Put、Get、Delete、Compact、Do and Txn Method ;Put Used to direct to etcd Write messages in the cluster , With key value Storage in the form of ;Get According to key View its corresponding stored in etcd Information in ;Delete By deleting key To delete etcd Information in ;Compact Method is used to compress etcd History of events stored in key value pairs , Avoid the unlimited and continuous growth of event history ;Txn Method to process multiple requests in a single transaction ,etcd The transaction mode is :
if compare
then op
else op
commit
lease Related interfaces :
type Lease interface {
// Grant creates a new lease.
Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error)
// Revoke revokes the given lease.
Revoke(ctx context.Context, id LeaseID) (*LeaseRevokeResponse, error)
// TimeToLive retrieves the lease information of the given lease ID.
TimeToLive(ctx context.Context, id LeaseID, opts ...LeaseOption) (*LeaseTimeToLiveResponse, error)
// Leases retrieves all leases.
Leases(ctx context.Context) (*LeaseLeasesResponse, error)
// KeepAlive keeps the given lease alive forever. If the keepalive response
// posted to the channel is not consumed immediately, the lease client will
// continue sending keep alive requests to the etcd server at least every
// second until latest response is consumed.
//
// The returned "LeaseKeepAliveResponse" channel closes if underlying keep
// alive stream is interrupted in some way the client cannot handle itself;
// given context "ctx" is canceled or timed out. "LeaseKeepAliveResponse"
// from this closed channel is nil.
//
// If client keep alive loop halts with an unexpected error (e.g. "etcdserver:
// no leader") or canceled by the caller (e.g. context.Canceled), the error
// is returned. Otherwise, it retries.
//
// TODO(v4.0): post errors to last keep alive message before closing
// (see https://github.com/coreos/etcd/pull/7866)
KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error)
// KeepAliveOnce renews the lease once. The response corresponds to the
// first message from calling KeepAlive. If the response has a recoverable
// error, KeepAliveOnce will retry the RPC with a new keep alive message.
//
// In most of the cases, Keepalive should be used instead of KeepAliveOnce.
KeepAliveOnce(ctx context.Context, id LeaseID) (*LeaseKeepAliveResponse, error)
// Close releases all resources Lease keeps for efficient communication
// with the etcd server.
Close() error
}
lease Is a common concept in distributed systems , Used to represent a decentralized lease . Typically , In a decentralized system, it is necessary to detect whether a node is alive or not , We need a lease mechanism .
Grant Method is used to create a lease , When the server is given time to live Did not receive in time keepAlive When the lease expires ;Revoke Rescind a lease , All attached to the lease key Will expire and be deleted ;TimeToLive Get lease information ;KeepAlive By streaming from client to server keep alive Request and streaming from server to client keep alive Answer to maintain the lease ; To detect whether a program is alive in a distributed system , You can create a lease in the program , And periodically call... In the program KeepAlive Methods . If everything goes well , The lease of this node will be consistent , If this program crashes , In the end, the lease will automatically expire , stay etcd in , Allow multiple key Related to the same lease above , Can greatly reduce lease The cost of rearranging objects .
watch Related interfaces :
type Watcher interface {
// Watch watches on a key or prefix. The watched events will be returned
// through the returned channel. If revisions waiting to be sent over the
// watch are compacted, then the watch will be canceled by the server, the
// client will post a compacted error watch response, and the channel will close.
// If the context "ctx" is canceled or timed out, returned "WatchChan" is closed,
// and "WatchResponse" from this closed channel has zero events and nil "Err()".
// The context "ctx" MUST be canceled, as soon as watcher is no longer being used,
// to release the associated resources.
//
// If the context is "context.Background/TODO", returned "WatchChan" will
// not be closed and block until event is triggered, except when server
// returns a non-recoverable error (e.g. ErrCompacted).
// For example, when context passed with "WithRequireLeader" and the
// connected server has no leader (e.g. due to network partition),
// error "etcdserver: no leader" (ErrNoLeader) will be returned,
// and then "WatchChan" is closed with non-nil "Err()".
// In order to prevent a watch stream being stuck in a partitioned node,
// make sure to wrap context with "WithRequireLeader".
//
// Otherwise, as long as the context has not been canceled or timed out,
// watch will retry on other recoverable errors forever until reconnected.
//
// TODO: explicitly set context error in the last "WatchResponse" message and close channel?
// Currently, client contexts are overwritten with "valCtx" that never closes.
// TODO(v3.4): configure watch retry policy, limit maximum retry number
// (see https://github.com/etcd-io/etcd/issues/8980)
Watch(ctx context.Context, key string, opts ...OpOption) WatchChan
// RequestProgress requests a progress notify response be sent in all watch channels.
RequestProgress(ctx context.Context) error
// Close closes the watcher and cancels all watch requests.
Close() error
}
etcd Of Watch The mechanism can subscribe to... In real time etcd Incremental data updates in ,watch Support the specified individual key, You can also specify a key At the beginning of .Watch Observe what is going to happen or has happened , Both input and output are streams ; The input stream is used to create and cancel observations , Output stream transport events . An observation RPC It can be done in multiple at once key On the scale of observation , And to observe fluidization events for multiple , The entire event history can be viewed from the last compressed revision .
ETCD Data version mechanism
etcd There are mainly term Express leader The term of office of ,revision Represents the version of global data . When clustering happens Leader Switch ,term The value of will be +1, In node failure , perhaps Leader There is a problem with the node network , Or pull it up again after stopping the whole cluster , It's going to happen Leader Switch to ; When the data changes , Including the establishment of 、 modify 、 Delete , Its revision The corresponding city will be +1, Across in the clump Leader Between terms of office ,revision Will keep global monotonically increasing , Any modification in the cluster corresponds to a unique revision, So we can go through revision To support the data MVCC, It can also support data Watch.
For every one of them KeyValue Data node ,etcd Three versions are recorded in :
- The first version is called create_revision, yes KeyValue When it's set up revision;
- The second one is called mod_revision, When the data is manipulated revision;
- The third one version It's just a counter , On behalf of KeyValue How many times has it been modified .
In the same Leader During the term of office , All the modification operations , Its corresponding term Values are always equal , and revision It keeps monotonically increasing . When cluster is restarted , Corresponding to all modification operations term Add to the value 1 了 .
ETCD And MVCC Concurrency control
Speaking of mvcc Everyone is no stranger ,mysql Of innodb Use... In mvcc Realize high concurrency data access , Multi version processing of data , And through the visibility of the transaction, the transaction can see the version of the data it should see , Again , stay etcd Also used in mvcc Concurrency control .
etcd Support for the same Key Initiate multiple data changes , Each data revision corresponds to a version number .etcd The corresponding data of each modification is recorded , That's a key stay etcd There are multiple historical versions in . If you do not specify the version number when querying the data ,etcd Will return to Key The corresponding latest version , At the same time etcd It also supports specifying a version number to query historical data .
etcd Record every change , Use watch When subscribing to materials , Can support from any historical moment ( Appoint revision) Start building a watcher, On the client side with etcd Create a data pipeline between ,etcd Will push from specified revision All data changes started .etcd Provided watch Mechanism guarantees , The Key After the data of the project were revised later , Push to the client immediately through this data pipeline .
Analysis of its original code shows that :
type revision struct {
// main is the main revision of a set of changes that happen atomically.
main int64
// sub is the the sub revision of a change in a set of changes that happen
// atomically. Each change has different increasing sub revision in that
// set.
sub int64
}
func (a revision) GreaterThan(b revision) bool {
if a.main > b.main {
return true
}
if a.main < b.main {
return false
}
return a.sub > b.sub
}
stay etcd Of mvcc There is one in the implementation revision Structure ,main Represents the transaction of the current operation id, Global self increasing logical time stamp ,sub Represents the child of the current operation within the transaction id, It's self increasing in business , From 0 Start ; Through GreaterThan Method to compare transaction versions .
ETCD Storage data structure
etcd All of the data in is stored in a btree In the data structure of , The btree Stored on disk , And through mmap The method of mapping to memory is used to support fast access ,treeIndex The definition of is as follows :
type treeIndex struct {
sync.RWMutex
tree *btree.BTree
}
func newTreeIndex() index {
return &treeIndex{
tree: btree.New(32),
}
}
index In pairs btree There are Put、Get、Revision、Range And Visit etc. , With Put For example , The original code is as follows :
func (ti *treeIndex) Put(key []byte, rev revision) {
keyi := &keyIndex{key: key}
ti.Lock()
defer ti.Unlock()
item := ti.tree.Get(keyi)
if item == nil {
keyi.put(rev.main, rev.sub)
ti.tree.ReplaceOrInsert(keyi)
return
}
okeyi := item.(*keyIndex)
okeyi.put(rev.main, rev.sub)
}
It can be seen from the source code that btree The reading and writing operations of data are completed under lock , So as to ensure the consistency of the data under the concurrency
版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
边栏推荐
- 至联云分享:IPFS/Filecoin值不值得投资?
- Let the front-end siege division develop independently from the back-end: Mock.js
- Skywalking series blog 1 - install stand-alone skywalking
- 2019年的一个小目标,成为csdn的博客专家,纪念一下
- 快快使用ModelArts,零基礎小白也能玩轉AI!
- Leetcode's ransom letter
- 6.2 handleradapter adapter processor (in-depth analysis of SSM and project practice)
- H5 makes its own video player (JS Part 2)
- 做外包真的很难,身为外包的我也无奈叹息。
- keras model.compile Loss function and optimizer
猜你喜欢

华为云“四个可靠”的方法论

vue-codemirror基本用法:实现搜索功能、代码折叠功能、获取编辑器值及时验证

Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思

一篇文章带你了解CSS3圆角知识

做外包真的很难,身为外包的我也无奈叹息。

Python download module to accelerate the implementation of recording

至联云分享:IPFS/Filecoin值不值得投资?

CCR炒币机器人:“比特币”数字货币的大佬,你不得不了解的知识

快快使用ModelArts,零基础小白也能玩转AI!

Flink的DataSource三部曲之二:内置connector
随机推荐
5.5 controlleradvice notes - SSM in depth analysis and project practice
hadoop 命令总结
6.3 handlerexceptionresolver exception handling (in-depth analysis of SSM and project practice)
Skywalking series blog 5-apm-customize-enhance-plugin
OPTIMIZER_ Trace details
Tool class under JUC package, its name is locksupport! Did you make it?
htmlcss
Deep understanding of common methods of JS array
嘗試從零開始構建我的商城 (二) :使用JWT保護我們的資訊保安,完善Swagger配置
The choice of enterprise database is usually decided by the system architect - the newstack
I think it is necessary to write a general idempotent component
Keyboard entry lottery random draw
Brief introduction and advantages and disadvantages of deepwalk model
Skywalking series blog 2-skywalking using
Flink的DataSource三部曲之二:内置connector
Didi elasticsearch cluster cross version upgrade and platform reconfiguration
Every day we say we need to do performance optimization. What are we optimizing?
Programmer introspection checklist
vue-codemirror基本用法:实现搜索功能、代码折叠功能、获取编辑器值及时验证
Filecoin的经济模型与未来价值是如何支撑FIL币价格破千的