当前位置：网站首页>Matrixcube unveils the complete distributed storage system matrixkv implemented in 102-300 lines

Matrixcube unveils the complete distributed storage system matrixkv implemented in 102-300 lines

2022-07-25 22:36:00 【MatrixOrigin】

The last article introduced in detail MatrixCube Function and architecture of ,MatrixCube yes MatrixOne Database is an important component of distributed capability . Today we will use a simple distributed storage demo Experiment to complete the experience MatrixCube The function of .

MatrixKV Project introduction

This demo Project is called MatrixKV, stay Github Your warehouse address is ：https://github.com/matrixorigin/matrixkv

MatrixKV Is a simple distributed strong consistency KV The storage system , use Pebble As the underlying storage engine ,MatrixCube As a distributed component , And customize the simplest read-write request interface . Users can simply request to read and write data at any node , You can also read the required data from any node .

If the TiDB Students who are familiar with the architecture can put MatrixKV Equivalent to a TiKV+PD, and MatrixKV Which uses RocksDB Instead of Pebble.

This experiment is based on Docker Simulate a small MatrixKV In the form of clusters , To further illustrate MatrixCube Function and operation mechanism of .

First step ： Environmental preparation

Tool preparation

We need to use in this experiment docker And docker-compose Tools , Therefore, it needs to be installed docker And docker-compose. Generally speaking, it can be installed directly Docker-desktop, It comes with it docker engine ,CLI Tools and Compose plug-in unit . The official provides complete installation packages for various operating systems ：https://www.docker.com/products/docker-desktop/

After installation, you can use the following command to check whether the installation is complete , If the installation is successfully completed, the corresponding version will be displayed .

docker -vdocker-compose -v

Docker It's cross platform , Therefore, this experiment has no requirements for the operating system , But it's recommended macOS12+ perhaps CentOS8+（ Because it has been completely verified ）. This tutorial is in macOS12 Allowed in the environment of . In this test, there is only one hard disk of a single machine ,Prophet Load rebalance each node Rebalance The function of cannot be used , Therefore, in this test, there will be unbalanced node load and data volume , You can better experience this function in a complete multi machine system .

Clone Code

take MatrixKV Code Clone To local .

git clone https://github.com/matrixorigin/matrixkv

The second step ：MatrixKV Cluster configuration

In the last article , We mentioned MatrixCube be based on Raft Build a distributed consensus protocol , Therefore, at least three nodes are required as the minimum deployment scale , The first three nodes belong to scheduling Prophet node . The small cluster we prepared for this experiment has four nodes , Three of them are Prophet node , One is the data node . We use docker Simulate on a single machine in the form of container packaging .

Prophet Node set

We can see in the /cfg Folder has node0-node3 Configuration file for , among Node0-Node2 Are all Prophet node ,Node3 For data nodes .Prophet The configuration of nodes is in Node0 Examples are as follows ：

#raft-group Of RPC mailing address , Nodes send messages through this address raft message and snapshot.addr-raft = "node0:8081" # Addresses open to clients , The customer communicates with cube To interact with customized business requests .addr-client = "node0:8082"#Cube Data storage directory , Each node will count the storage usage according to the disk where the directory is located , Report to the scheduling node .dir-data = "/tmp/matrixkv"[raft]#Cube Would be right Raft Write a request to do batch, Multiple write requests will be merged into one Raft-Log, Only once Raft Of Proposal, This configuration specifies a Proposal Size , This  # The configuration depends on the specific situation of the application max-entry-bytes = "200MB"[replication]# 1.  One raft-group The largest copy of down time, When a copy of down time Beyond that , The scheduling node will think that the copy has failed for a long time ,#  Then a suitable node will be selected in the cluster to recreate the replica . If the later node is restarted , Then the scheduling node will notify this copy #  Destroy yourself .# 2.  The default setting here is 30 minute , At this time, we think that the equipment generally fails, which can be in 30 Complete troubleshooting and recovery within minutes , If you exceed this time, it means that you can't   #  recovery . Here we are for the convenience of doing experiments , Set to 15 second .max-peer-down-time = "15s"[prophet]# The Prophet The name of the scheduling node name = "node0"# The Prophet The scheduling node is external RPC Address rpc-addr = "node0:8083"# Specify this node as Prophet node prophet-node = true[prophet.schedule]# Cube All nodes in the cluster will regularly send heartbeats to the scheduled Leader node , When a node does not send heartbeat for more than a certain time ,#  Then the scheduling node will change the status of this node to Down, And will put this node , be-all Shard Rebuild on other nodes of the cluster ,#  When this node is restored , All of the... On this node Shard Will receive the destroyed scheduling message .#  It is also set here for the convenience of experiment 10 second , The default is 30 minute .max-container-down-time = "10s"#Prophet There's an embedded ETCD As a component for storing metadata [prophet.embed-etcd]#Cube Of Prophet The scheduling node will start successively ,  Suppose we have node0, node1, node2 Three scheduling nodes ,  The first thing to start is node0 node ,  that node0 The node will # Form a single copy etcd,  about node0 for , `join` Parameters do not need to be filled ,  hinder node1, node1 When it starts , `join` Set to node1# Of Etcd Of Peer addressjoin = ""# Embedded Etcd Of client addressclient-urls = "http://0.0.0.0:8084"# Embedded Etcd Of advertise client address,  Do not fill in ,  Default and `client-urls` Agreement advertise-client-urls = "http://node0:8084"# Embedded Etcd Of peer addresspeer-urls = "http://0.0.0.0:8085"# Embedded Etcd Of advertise peer address,  Do not fill in ,  Default and `peer-urls` Agreement advertise-peer-urls = "http://node0:8085"[prophet.replication]# Every Shard How many copies at most , When Cube When the scheduling node periodically inspects , Find out Shard When the number of copies of does not match this value , Will execute the call to create or delete a copy # Degree operation .max-replicas = 3

Node1 And Node2 In addition to the configuration required in ETCD In the configuration section join The front node , Others are almost the same as Node0 There is no difference .

Data node settings

and Node3 As a data node , The configuration is relatively simple , except prophet-node Set to false outside , There are no other parts that need additional configuration .

addr-raft = "node3:8081"addr-client = "node3:8082"dir-data = "/tmp/matrixkv"[raft]max-entry-bytes = "200MB"[prophet]name = "node3"rpc-addr = "node3:8083"prophet-node = falseexternal-etcd = [    "http://node0:8084",    "http://node1:8084",    "http://node2:8084",]

Docker-Compose Set up

Docker-compose Based on the docker-compose.yml To start the container , Among them, we need to change the data directory of each node to the directory specified by ourselves . We use Node0 For example .

node0:    image: matrixkv    ports:      - "8080:8080"    volumes:      - ./cfg/node0.toml:/etc/cube.toml      # /data/node0 It needs to be modified to a local directory specified by the user       - /data/node0:/data/matrixkv    command:       - --addr=node0:8080      - --cfg=/etc/cube.toml      # shard will split after 1024 bytes      - --shard-capacity=1024

The third step ： Cluster start

After configuring these options , stay MatrixKV In the code base , We have already written the construction of image dockerfile And start the build process Makefile.

We are directly in the MatrixKV Run under the path of make docker command , It will be MatrixKV Package the whole into a mirror .

# If it is MAC X86 Architecture platform or Linux, You can run the following commands directly (make docker)# If it is MAC Of ARM edition , You need to Makefile Medium docker build -t matrixkv -f Dockerfile . Change to docker buildx build --platform linux/amd64 -t matrixkv -f Dockerfile .make docker

In addition, pay attention to domestic users if they may encounter go The source site is too slow to download the dependent Library , Can be in Dockerfile add go Chinese origin site settings ：

RUN go env -w GOPROXY=https://goproxy.cn,direct

And then through docker-compose up The order will MatrixKV Four images of are started according to different node configurations , So as to form our Node0 To Node3 Four node cluster .

docker-compose up

stay docker desktop We should be able to see our 4 individual MatrixKV All nodes of are started in the form of images .

Launch Docker

You can see in the following log that each node starts listening 8080 Port time , It means that the cluster has been started .

At the same time, we can see that many folders for storing data and some initial files have been generated in the data directory we specified .

If you shut down the cluster, you can stop the process in the command line you started , Or maybe in Docker desktop Stop any node in a graphical interface .

Step four ： Read / write request interface and routing

After starting the cluster , We can read and write data to the cluster .MatrixKV Several very simple data reading and writing interfaces are packaged ：

Data writing SET：

curl -X POST  -H 'Content-Type: application/json' -d '{"key":"k1","value":"v1"}' http://127.0.0.1:8080/set

data fetch GET：
```
curl http://127.0.0.1:8080/get?key=k1
```

Data deletion DELETE

curl -X POST -H 'Content-Type: application/json' -d '{"key":"k1"}' http://127.0.0.1:8080/delete

The last article introduced MatrixCube Medium Shard Proxy, This component enables us to initiate requests from any node of the cluster , Whether it's writing , Read or delete request ,Shard Proxy Will automatically route the request to the corresponding processing node .

For example, we can be in node0 Write data on , And in the node0 To node3 Can be read on , It's exactly the same .

// towards node0 Initiate write request curl -X POST  -H 'Content-Type: application/json' -d '{"key":"k1","value":"v1"}' http://127.0.0.1:8080/set// from node0-node3 To read curl http://127.0.0.1:8080/get?key=k1curl http://127.0.0.1:8081/get?key=k1curl http://127.0.0.1:8082/get?key=k1curl http://127.0.0.1:8083/get?key=k1

Here, if the system configuration of the experiment and the scale of writing and reading data are larger , You can also verify some more extreme scenarios , For example, there are multiple clients reading the data of each node quickly , And every time the data written is read by the client, it can be guaranteed to be up-to-date and consistent , In this way, we can verify MatrixCube Strong consistency of , Ensure that the data read from any node at any time is up-to-date and consistent .

Step five ： Data fragment query and splitting

MatrixCube It will be generated when the amount of data written reaches a certain level Shard split , stay MatrixKV in , We will Shard The size of is set to 1024Byte. Therefore, data written beyond this size will be split .MatrixKV It provides a simple query of how many nodes are in the current cluster or current node Shard The interface of .

# In the current cluster Shard situation curl http://127.0.0.1:8080/shards# In the current node Shard situation curl http://127.0.0.1:8080/shards?local=true

After starting the cluster, we can see that in the initial state, the cluster has only 3 individual Shard,id Respectively 4, 6, 8, And their actual storage nodes are node0,node2 And node3 in .

And before we write a more than 1024Byte After the data of , We can see node0,node2 And node3 Medium Shard All split , Each original Shard Have formed two new Shard, In the initial state 3 individual Shard Turned into 11,12,13,15,16,17 six Shard.

#test.json It's test data , The data content needs to be strictly in accordance with Key,Value Format specification , such as {"key":"item0","value":"XXXXXXXXXXX"}curl -X POST  -H 'Content-Type: application/json' [email protected] http://127.0.0.1:8083/set

At the same time, we can still access the data we write at any node .

Step six ： Node changes and replica generation

Now let's take a look at MatrixCube The function of high availability guarantee . We can go through Docker desktop To manually shut down a single container , So as to simulate the machine failure in the real environment .

In the fifth step, after we input a large data, the whole system exists 6 individual Shard, Every Shard Yes 3 individual Replica. We will now node3 Manual shutdown .

Try to visit again node3 Our orders all ended in failure .

But read requests are initiated from other nodes , Data can still be read , This is the embodiment of the overall high availability of the distributed system .

According to our previous settings ,store3 The heart of 10 Not sent within seconds Prophet,Prophet Think of this as Store It's offline , By checking the current copy, we find , be-all Shard There are only two Replica, In order to satisfy the 3 Requirements for copies ,Prophet Will automatically start to find idle nodes , take Shard Copy to above , Here we are node1, Then let's look at each node Shard The situation of .

You can see node1 There was no Shard Of , Now also with node0 and node2 It's all the same 6 individual shard. This is the same. Prophet Automatic copy generation , Always ensure that there are three copies in the system to ensure high availability .

In addition to copy generation , If something goes wrong Shard Of Raft Group Leader, So this Shard Of Raft Group The election will be relaunched , Then elect a new Leader, Again by Leader Initiate a request for new copy generation . You can test this by yourself , And verify through the log information .

MatrixKV Code scanning

Through the whole experiment, we have completely experienced in MatrixCube With the help of a stand-alone KV Storage engine Pebble It becomes a distributed KV Storage . And it needs MatrixKV The code implemented by itself is very simple . Generally speaking, there is only 4 individual go file , Less than 300 Line code can complete MatrixKV All construction .

/cmd/matrixkv.go: The entrance of the whole program startup , Perform the most basic initialization and start the service .
/pkg/config/config.go: Defined a MatrixKV Data structure of overall configuration .
/pkg/metadata/metadata.go： Defines the user and MatrixKV Read / write the data structure of the interactive request .
/pkg/server/server.go： This is a MatrixKV The main function of , Three main things have been done ：
- Definition MatrixKV server Data structure of .
- Definition Set/Get/Delete And other related requests Executor Concrete realization .
- call Pebble Library as a stand-alone storage engine , Realization MatrixCube designated DataStorage Interface , take MatrixCube Of Config Item is set to the corresponding method .

Benefit time

Please add a little assistant wechat WeChat ID：MatrixOrigin001

Send your MatrixKV First experience complete screen recording , You can get a limited amount MatrixOrigin T A shirt .
Send your MatrixKV For the first time, experience the complete screen recording and release it on CSDN, You get value 200 Jingdong card of yuan + limited MatrixOrigin T A shirt .

summary

As MatrixCube The second part of the series , We are based on MatrixCube and Pebble A custom distributed storage system implemented by MatrixKV The experiment of , Further display MatrixCube The operating mechanism of , It also shows 300 Lines of code can quickly build a complete, strong and consistent distributed storage system . We will bring the next issue MatrixCube More in-depth code elaboration , Coming soon .

原网站

版权声明
本文为[MatrixOrigin]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207252234045391.html