当前位置:网站首页>Analysis of etcd core mechanism

Analysis of etcd core mechanism

2020-11-06 01:23:00 itread01

ETCD The whole mechanism

etcd It's a decentralized 、 reliable key-value Storage system , It is suitable for storing key data in distributed systems .

etcd Multiple nodes in the cluster pass through Raft The algorithm completes distributed consistency coordination , The algorithm selects a master node as leader, from leader Responsible for data synchronization and distribution . When leader After a fault occurs, the system will automatically re select another node as leader, And complete the data synchronization again .

etcd Clustering is mainly based on quorum Mechanism , namely : When more than half of the nodes in the cluster are available , Clustering can continue to provide services ,quorum The mechanism is widely used in distributed consistency algorithms , I don't go into detail here .

raft Information updates and etcd The call is based on a two-stage mechanism :

The first stage  leader Record log (uncommited); Log copy to follower;follower Respond to , Successful operation , Response client ; The caller calls leader,leader Will kv Data is stored in a log , And use real-time algorithms raft Make a copy of

The second stage  leader commit; notice follower; When copied to N+1 After nodes , Local submission , Back to the client , Finally leader Asynchronous notification follower Notice of completion

 

ETCD The core API analysis

etcd Provided api There are mainly kv Related to 、lease Related to and related to watch, Looking at the source code, we can see that :

kv Related interfaces :

type KV interface {
	// Put puts a key-value pair into etcd.
	// Note that key,value can be plain bytes array and string is
	// an immutable representation of that bytes array.
	// To get a string of bytes, do string([]byte{0x10, 0x20}).
	Put(ctx context.Context, key, val string, opts ...OpOption) (*PutResponse, error)

	// Get retrieves keys.
	// By default, Get will return the value for "key", if any.
	// When passed WithRange(end), Get will return the keys in the range [key, end).
	// When passed WithFromKey(), Get returns keys greater than or equal to key.
	// When passed WithRev(rev) with rev > 0, Get retrieves keys at the given revision;
	// if the required revision is compacted, the request will fail with ErrCompacted .
	// When passed WithLimit(limit), the number of returned keys is bounded by limit.
	// When passed WithSort(), the keys will be sorted.
	Get(ctx context.Context, key string, opts ...OpOption) (*GetResponse, error)

	// Delete deletes a key, or optionally using WithRange(end), [key, end).
	Delete(ctx context.Context, key string, opts ...OpOption) (*DeleteResponse, error)

	// Compact compacts etcd KV history before the given rev.
	Compact(ctx context.Context, rev int64, opts ...CompactOption) (*CompactResponse, error)

	// Txn creates a transaction.
	Txn(ctx context.Context) Txn
}

There are mainly Put、Get、Delete、Compact、Do and Txn Method ;Put Used to direct to etcd Write messages in the cluster , With key value Storage in the form of ;Get According to key View its corresponding stored in etcd Information in ;Delete By deleting key To delete etcd Information in ;Compact Method is used to compress etcd History of events stored in key value pairs , Avoid the unlimited and continuous growth of event history ;Txn Method to process multiple requests in a single transaction ,etcd The transaction mode is :

if compare

then op

else op

commit

lease Related interfaces :

type Lease interface {
	// Grant creates a new lease.
	Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error)

	// Revoke revokes the given lease.
	Revoke(ctx context.Context, id LeaseID) (*LeaseRevokeResponse, error)

	// TimeToLive retrieves the lease information of the given lease ID.
	TimeToLive(ctx context.Context, id LeaseID, opts ...LeaseOption) (*LeaseTimeToLiveResponse, error)

	// Leases retrieves all leases.
	Leases(ctx context.Context) (*LeaseLeasesResponse, error)

	// KeepAlive keeps the given lease alive forever. If the keepalive response
	// posted to the channel is not consumed immediately, the lease client will
	// continue sending keep alive requests to the etcd server at least every
	// second until latest response is consumed.
	//
	// The returned "LeaseKeepAliveResponse" channel closes if underlying keep
	// alive stream is interrupted in some way the client cannot handle itself;
	// given context "ctx" is canceled or timed out. "LeaseKeepAliveResponse"
	// from this closed channel is nil.
	//
	// If client keep alive loop halts with an unexpected error (e.g. "etcdserver:
	// no leader") or canceled by the caller (e.g. context.Canceled), the error
	// is returned. Otherwise, it retries.
	//
	// TODO(v4.0): post errors to last keep alive message before closing
	// (see https://github.com/coreos/etcd/pull/7866)
	KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error)

	// KeepAliveOnce renews the lease once. The response corresponds to the
	// first message from calling KeepAlive. If the response has a recoverable
	// error, KeepAliveOnce will retry the RPC with a new keep alive message.
	//
	// In most of the cases, Keepalive should be used instead of KeepAliveOnce.
	KeepAliveOnce(ctx context.Context, id LeaseID) (*LeaseKeepAliveResponse, error)

	// Close releases all resources Lease keeps for efficient communication
	// with the etcd server.
	Close() error
}

lease Is a common concept in distributed systems , Used to represent a decentralized lease . Typically , In a decentralized system, it is necessary to detect whether a node is alive or not , We need a lease mechanism .

Grant Method is used to create a lease , When the server is given time to live Did not receive in time keepAlive When the lease expires ;Revoke Rescind a lease , All attached to the lease key Will expire and be deleted ;TimeToLive Get lease information ;KeepAlive By streaming from client to server keep alive Request and streaming from server to client keep alive Answer to maintain the lease ; To detect whether a program is alive in a distributed system , You can create a lease in the program , And periodically call... In the program KeepAlive Methods . If everything goes well , The lease of this node will be consistent , If this program crashes , In the end, the lease will automatically expire , stay etcd in , Allow multiple key Related to the same lease above , Can greatly reduce lease The cost of rearranging objects .

watch Related interfaces :

type Watcher interface {
	// Watch watches on a key or prefix. The watched events will be returned
	// through the returned channel. If revisions waiting to be sent over the
	// watch are compacted, then the watch will be canceled by the server, the
	// client will post a compacted error watch response, and the channel will close.
	// If the context "ctx" is canceled or timed out, returned "WatchChan" is closed,
	// and "WatchResponse" from this closed channel has zero events and nil "Err()".
	// The context "ctx" MUST be canceled, as soon as watcher is no longer being used,
	// to release the associated resources.
	//
	// If the context is "context.Background/TODO", returned "WatchChan" will
	// not be closed and block until event is triggered, except when server
	// returns a non-recoverable error (e.g. ErrCompacted).
	// For example, when context passed with "WithRequireLeader" and the
	// connected server has no leader (e.g. due to network partition),
	// error "etcdserver: no leader" (ErrNoLeader) will be returned,
	// and then "WatchChan" is closed with non-nil "Err()".
	// In order to prevent a watch stream being stuck in a partitioned node,
	// make sure to wrap context with "WithRequireLeader".
	//
	// Otherwise, as long as the context has not been canceled or timed out,
	// watch will retry on other recoverable errors forever until reconnected.
	//
	// TODO: explicitly set context error in the last "WatchResponse" message and close channel?
	// Currently, client contexts are overwritten with "valCtx" that never closes.
	// TODO(v3.4): configure watch retry policy, limit maximum retry number
	// (see https://github.com/etcd-io/etcd/issues/8980)
	Watch(ctx context.Context, key string, opts ...OpOption) WatchChan

	// RequestProgress requests a progress notify response be sent in all watch channels.
	RequestProgress(ctx context.Context) error

	// Close closes the watcher and cancels all watch requests.
	Close() error
}

etcd Of Watch The mechanism can subscribe to... In real time etcd Incremental data updates in ,watch Support the specified individual key, You can also specify a key At the beginning of .Watch Observe what is going to happen or has happened , Both input and output are streams ; The input stream is used to create and cancel observations , Output stream transport events . An observation RPC It can be done in multiple at once key On the scale of observation , And to observe fluidization events for multiple , The entire event history can be viewed from the last compressed revision .

 

ETCD Data version mechanism

etcd There are mainly term Express leader The term of office of ,revision Represents the version of global data . When clustering happens Leader Switch ,term The value of will be +1, In node failure , perhaps Leader There is a problem with the node network , Or pull it up again after stopping the whole cluster , It's going to happen Leader Switch to ; When the data changes , Including the establishment of 、 modify 、 Delete , Its revision The corresponding city will be +1, Across in the clump Leader Between terms of office ,revision Will keep global monotonically increasing , Any modification in the cluster corresponds to a unique revision, So we can go through revision To support the data MVCC, It can also support data Watch.

For every one of them KeyValue Data node ,etcd Three versions are recorded in :

  • The first version is called create_revision, yes KeyValue When it's set up revision;
  • The second one is called mod_revision, When the data is manipulated revision;
  • The third one version It's just a counter , On behalf of KeyValue How many times has it been modified .

In the same Leader During the term of office , All the modification operations , Its corresponding term Values are always equal , and revision It keeps monotonically increasing . When cluster is restarted , Corresponding to all modification operations term Add to the value 1 了 .

 

ETCD And MVCC Concurrency control

Speaking of mvcc Everyone is no stranger ,mysql Of innodb Use... In mvcc Realize high concurrency data access , Multi version processing of data , And through the visibility of the transaction, the transaction can see the version of the data it should see , Again , stay etcd Also used in mvcc Concurrency control .

etcd Support for the same Key Initiate multiple data changes , Each data revision corresponds to a version number .etcd The corresponding data of each modification is recorded , That's a key stay etcd There are multiple historical versions in . If you do not specify the version number when querying the data ,etcd Will return to Key The corresponding latest version , At the same time etcd It also supports specifying a version number to query historical data .

etcd Record every change , Use watch When subscribing to materials , Can support from any historical moment ( Appoint revision) Start building a watcher, On the client side with etcd Create a data pipeline between ,etcd Will push from specified revision All data changes started .etcd Provided watch Mechanism guarantees , The Key After the data of the project were revised later , Push to the client immediately through this data pipeline .

Analysis of its original code shows that :

type revision struct {
	// main is the main revision of a set of changes that happen atomically.
	main int64

	// sub is the the sub revision of a change in a set of changes that happen
	// atomically. Each change has different increasing sub revision in that
	// set.
	sub int64
}

func (a revision) GreaterThan(b revision) bool {
	if a.main > b.main {
		return true
	}
	if a.main < b.main {
		return false
	}
	return a.sub > b.sub
}

stay etcd Of mvcc There is one in the implementation revision Structure ,main Represents the transaction of the current operation id, Global self increasing logical time stamp ,sub Represents the child of the current operation within the transaction id, It's self increasing in business , From 0 Start ; Through GreaterThan Method to compare transaction versions .

 

ETCD Storage data structure

etcd All of the data in is stored in a btree In the data structure of , The btree Stored on disk , And through mmap The method of mapping to memory is used to support fast access ,treeIndex The definition of is as follows :

type treeIndex struct {
	sync.RWMutex
	tree *btree.BTree
}

func newTreeIndex() index {
	return &treeIndex{
		tree: btree.New(32),
	}
}

index In pairs btree There are Put、Get、Revision、Range And Visit etc. , With Put For example , The original code is as follows :

func (ti *treeIndex) Put(key []byte, rev revision) {
	keyi := &keyIndex{key: key}

	ti.Lock()
	defer ti.Unlock()
	item := ti.tree.Get(keyi)
	if item == nil {
		keyi.put(rev.main, rev.sub)
		ti.tree.ReplaceOrInsert(keyi)
		return
	}
	okeyi := item.(*keyIndex)
	okeyi.put(rev.main, rev.sub)
}

It can be seen from the source code that btree The reading and writing operations of data are completed under lock , So as to ensure the consistency of the data under the concurrency

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢