当前位置:网站首页>Tidb unsafe recover (tikv downtime is greater than or equal to half the number of replicas)
Tidb unsafe recover (tikv downtime is greater than or equal to half the number of replicas)
2022-06-11 17:33:00 【On the way to data communication】
One 、 background
| name | Number |
|---|---|
| tikv | 4 |
| copy | 3 |
Two 、 Environmental preparation
1. install jq
# ubuntu
apt install jq
# centos
yum install jq
2. Prepare the data
May adopt sysbench Or write your own script to do it
3. Simulate downtime
Delete the corresponding tikv Data directory or forced shrink tikv
4. The phenomenon
mysql> select * from region_1 limit 10;
ERROR 9005 (HY000): Region is unavailable
ERROR 9002 (HY000): TiKV server timeout
3、 ... and 、 Simulation scenario
1. Two downtime tikv
Because there are three copies , It's just two downtime tikv, No loss of data , For recovery methods, refer to 3、 ... and kv Two downtime
2. Three downtime tikv
2.1. View unconnected store
# Record "state_name": "Disconnected" Of store id( My is 4,5,1253)
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port store
2.2. View copy loss
# View those that have lost more than half of their copies region
region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'
# View lost copies of region
region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1253,4,5) then . else empty end) | length>=$total)}'
# jq Please refer to https://asktug.com/t/topic/63086
2.3. close pd Dispatch , Avoid exceptions during recovery
# Go into interactive mode
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port -i
# Execute the following commands respectively
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set leader-schedule-limit 0
config set merge-schedule-limit 0
# Check whether the scheduling is closed
operator show
2.4. stop it tikv process ( Prevent execution unsafe-recover remove-fail-stores The file lock failed )
tiup cluster stop cluster_name -R tikv
2.5. Conduct unsafe-recover remove-fail-stores
2.5.1 take tikvctl Move to all States Normal kv In machine
scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb
scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb
scp /data/tidb/.tiup/components/ctl/v4.0.13/tikv-ctl [email protected]:/home/tidb
2.5.2 perform tikvctl command
# 4.0.x Version command ,-s Refer to store id,--all-regions It means everything region,-r Can be used to specify region Instead of --all-regions
# unsafe-recover remove-fail-stores( The faulty machine starts from the specified Region Of peer Remove from list )
./tikv-ctl --db /data/tikv/tikv-data28016/db unsafe-recover remove-fail-stores -s 1253,4,5 --all-regions
# 5.x Version command
./tikv-ctl --data-dir /data/tikv/tikv-data28016 unsafe-recover remove-fail-stores -s 1253,4,5 --all-regions
The above steps have removed the lost two copies of region Come back , The next step is to lose all three copies region The recovery of
2.6 Repair the missing three copies region
2.6.1 see region The situation of
curl http://tidb_ip:10080/regions/1189
{
"start_key": "dIAAAAAAAAAZ",
"end_key": "dIAAAAAAAAAb",
"start_key_hex": "748000000000000019",
"end_key_hex": "74800000000000001b",
"region_id": 52,
"frames": [
{
"db_name": "mysql",
"table_name": "stats_buckets",
"table_id": 25,
"is_record": false,
"index_name": "tbl",
"index_id": 1
},
{
"db_name": "mysql",
"table_name": "stats_buckets",
"table_id": 25,
"is_record": true
}
]
2.6.2 Create an empty region
# v4 edition
./tikv-ctl --db /data/tidb/tidb-data/tikv-20160/db recreate-region -p pd_ip:pd_port -r region_id
# v5 edition
./tikv-ctl --data-dir /data/tidb/tidb-data/tikv-20160/ recreate-region -p pd_ip:pd_port -r region_id
2.7. recovery pd Dispatch
# Go into interactive mode
tiup ctl:v4.0.13 pd -u http://pd_ip:pd_port -i
# Execute the following commands respectively ( The value is the value before closing )
config set region-schedule-limit 2048
config set replica-schedule-limit 64
config set leader-schedule-limit 4
config set merge-schedule-limit 8
2.8. start-up tikv colony
tiup cluster start cluster_name -R tikv
notes : At this time, the cluster can normally access , But the data will be lost , The nodes need to be expanded to ensure three copies
边栏推荐
- [online problem] timeout waiting for connection from pool
- 6-8 创建、遍历链表
- 6-3 批量求和(*)
- RecyclerView缓存复用解析,源码解读
- Automated testing selenium
- There are so many open source projects. This time, I'll show you the differences between different versions and understand the meaning of alpha version, beta version and RC version
- 为什么udp流设置1316字节
- Activity | authing's first channel cooperation activity came to a successful conclusion
- 7-2 h0107. Pig-Latin
- Is it safe for Xiaobai to open an account directly on the flush?
猜你喜欢

vscode配置eslint自动格式化报错“Auto Fix is enabled by default. Use the single string form“

04_特征工程—特征选择

子类继承了什么、多态、 向上转型

Authoring share | understanding saml2 protocol

QLineEdit 设置输入掩码

Guide to Dama data management knowledge system: percentage of chapter scores

【深度学习基础】神经网络的学习(3)

Chapter II relational database

自动化测试-Selenium

Derivation of child numbering formula for nodes numbered I in full k-ary tree
随机推荐
tidb-cdc日志tables are not eligible to replicate
6-1 从文件读取字符串(*)
信息安全数学基础 Chapter 3——有限域(二)
Windows technology - how to view the instruction set, model, attribute and other details supported by the CPU, and how to use the CPU-Z tool to view the processor, memory, graphics card, motherboard,
[MySQL] detailed explanation of redo log, undo log and binlog (4)
有效的括号---2022/02/23
Authing Share|理解 SAML2 协议
There are so many open source projects. This time, I'll show you the differences between different versions and understand the meaning of alpha version, beta version and RC version
Leetcode力扣刷题
6-3 读文章(*)
tidb-cdc创建任务报错 Unknown or incorrect time zone
Service学习笔记02-实战 startService 与bindService
Hash表、 继承
6-7 文件读写操作
LeetCode-1005. Maximized array sum after K negations
Kubernetes deploys elk and collects container logs using filebeat
Classification and method of feature fusion
Cs0006 C failed to find metadata file "c:\users\... Problem
JPA failed to save multiple entities circularly
Summary of clustering methods