当前位置:网站首页>Tidb unsafe recover (tikv downtime is greater than or equal to half the number of replicas)

Tidb unsafe recover (tikv downtime is greater than or equal to half the number of replicas)

2022-06-11 17:33:00 On the way to data communication

One 、 background

name Number
tikv4
copy 3

Two 、 Environmental preparation

1. install jq

# ubuntu
apt install jq
# centos
yum install jq

2. Prepare the data

May adopt sysbench Or write your own script to do it

3. Simulate downtime

Delete the corresponding tikv Data directory or forced shrink tikv

4. The phenomenon

mysql> select * from region_1 limit 10;
ERROR 9005 (HY000): Region is unavailable
ERROR 9002 (HY000): TiKV server timeout

3、 ... and 、 Simulation scenario

1. Two downtime tikv

Because there are three copies , It's just two downtime tikv, No loss of data , For recovery methods, refer to 3、 ... and kv Two downtime

2. Three downtime tikv

2.1. View unconnected store

#  Record  "state_name": "Disconnected" Of store id( My is 4,5,1253)
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port store

2.2. View copy loss

#  View those that have lost more than half of their copies region
region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'
#  View lost copies of region
 region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'
 # jq Please refer to https://asktug.com/t/topic/63086

2.3. close pd Dispatch , Avoid exceptions during recovery

#  Go into interactive mode 
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port -i
#  Execute the following commands respectively 
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set leader-schedule-limit 0
config set merge-schedule-limit 0
#  Check whether the scheduling is closed 
operator show

2.4. stop it tikv process ( Prevent execution unsafe-recover remove-fail-stores The file lock failed )

tiup cluster stop cluster_name -R tikv

2.5. Conduct unsafe-recover remove-fail-stores

2.5.1 take tikvctl Move to all States Normal kv In machine

scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb
scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb
scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb

2.5.2 perform tikvctl command

# 4.0.x  Version command ,-s Refer to store id,--all-regions It means everything region,-r  Can be used to specify region Instead of --all-regions
# unsafe-recover remove-fail-stores( The faulty machine starts from the specified  Region  Of  peer  Remove from list )

./tikv-ctl --db /data/tikv/tikv-data28016/db unsafe-recover remove-fail-stores -s 1253,4,5 --all-regions

# 5.x  Version command 

./tikv-ctl --data-dir /data/tikv/tikv-data28016 unsafe-recover remove-fail-stores -s 1253,4,5 --all-regions

The above steps have removed the lost two copies of region Come back , The next step is to lose all three copies region The recovery of

2.6 Repair the missing three copies region

2.6.1 see region The situation of

curl http://tidb_ip:10080/regions/1189
{
    
 "start_key": "dIAAAAAAAAAZ",
 "end_key": "dIAAAAAAAAAb",
 "start_key_hex": "748000000000000019",
 "end_key_hex": "74800000000000001b",
 "region_id": 52,
 "frames": [
  {
    
   "db_name": "mysql",
   "table_name": "stats_buckets",
   "table_id": 25,
   "is_record": false,
   "index_name": "tbl",
   "index_id": 1
  },
  {
    
   "db_name": "mysql",
   "table_name": "stats_buckets",
   "table_id": 25,
   "is_record": true
  }
 ]

2.6.2 Create an empty region

# v4  edition 
./tikv-ctl --db /data/tidb/tidb-data/tikv-20160/db recreate-region -p pd_ip:pd_port -r region_id
# v5  edition 
./tikv-ctl --data-dir /data/tidb/tidb-data/tikv-20160/ recreate-region -p pd_ip:pd_port -r region_id

2.7. recovery pd Dispatch

#  Go into interactive mode 
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port -i
#  Execute the following commands respectively ( The value is the value before closing )
config set region-schedule-limit 2048
config set replica-schedule-limit 64
config set leader-schedule-limit 4
config set merge-schedule-limit 8

2.8. start-up tikv colony

tiup cluster start cluster_name -R tikv

notes : At this time, the cluster can normally access , But the data will be lost , The nodes need to be expanded to ensure three copies

原网站

版权声明
本文为[On the way to data communication]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111719373509.html