当前位置:网站首页>tidb-dm报警DM_sync_process_exists_with_error排查
tidb-dm报警DM_sync_process_exists_with_error排查
2022-07-02 09:32:00 【与数据交流的路上】
一、背景
dm同步任务报警DM_sync_process_exists_with_error,一分钟后自动恢复,想着排查一下原因
二、观测日志报错
1.dm日志
[2022/06/28 14:31:13.364 +00:00] [ERROR] [db.go:201] ["execute statements failed after retry"] [task=task-name] [unit="binlog replication"] [queries="[sql]"] [arguments="[[]]"] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: commit, RawCause: invalid connection"]
2.上游mysql日志
2022-06-28T14:31:19.413211Z 28801 [Note] Aborted connection 28801 to db: 'unconnected' user: '***' host: 'ip' (Got an error reading communication packets)
2022-06-28T14:31:22.154980Z 28802 [Note] Aborted connection 28802 to db: 'unconnected' user: '***' host: 'ip' (Got an error reading communication packets)
2022-06-28T14:31:32.158508Z 28804 [Note] Start binlog_dump to master_thread_id(28804) slave_server(429505412), pos(mysql-bin-changelog.103037, 36247149)
2022-06-28T14:31:32.158739Z 28803 [Note] Start binlog_dump to master_thread_id(28803) slave_server(429505202), pos(mysql-bin-changelog.103037, 40373779)
3.下游tidb日志
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]
4.下游tikv日志
[2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2641161, leader may Some(id: 2641164 store_id: 5)\" not_leader { region_id: 2641161 leader { id: 2641164 store_id: 5 } }"]
[2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2641165, leader may Some(id: 2641167 store_id: 4)\" not_leader { region_id: 2641165 leader { id: 2641167 store_id: 4 } }"]
[2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2709997, leader may Some(id: 2709999 store_id: 4)\" not_leader { region_id: 2709997 leader { id: 2709999 store_id: 4 } }"]
[2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2839445, leader may Some(id: 2839447 store_id: 4)\" not_leader { region_id: 2839445 leader { id: 2839447 store_id: 4 } }"]
[2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"]
[2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"]
[2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"]
[2022/06/28 14:31:05.617 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Key is locked (will clean up) primary_lock: 748000000F000 lock_version: 434222311815512066 key: 748000009725552F000 lock_ttl: 3003 txn_size: 1"]
[2022/06/28 14:31:05.634 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Key is locked (will clean up) primary_lock: 7480000000092F000 lock_version: 434222311815512092 key: 748000000000 lock_ttl: 3018 txn_size: 5"]
[2022/06/28 14:31:15.389 +00:00] [ERROR] [kv.rs:931] ["KvService response batch commands fail"]
[2022/06/28 14:31:15.432 +00:00] [ERROR] [kv.rs:931] ["KvService response batch commands fail"]
5.pd日志
[2022/06/28 14:30:55.329 +00:00] [INFO] [operator_controller.go:424] ["add operator"] [region-id=2641161] [operator="\"transfer-hot-read-leader {transfer leader: store 1 to 5} (kind:hot-region,leader, region:2641161(25913,5), createAt:2022-06-28 14:30:55.329497692 +0000 UTC m=+8421773.911777457, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 1 to store 5])\""] ["additional info"=]
[2022/06/28 14:30:55.329 +00:00] [INFO] [operator_controller.go:620] ["send schedule command"] [region-id=2641161] [step="transfer leader from store 1 to store 5"] [source=create]
[2022/06/28 14:30:55.342 +00:00] [INFO] [cluster.go:567] ["leader changed"] [region-id=2641161] [from=1] [to=5]
[2022/06/28 14:30:55.342 +00:00] [INFO] [operator_controller.go:537] ["operator finish"] [region-id=2641161] [takes=12.961676ms] [operator="\"transfer-hot-read-leader {transfer leader: store 1 to 5} (kind:hot-region,leader, region:2641161(25913,5), createAt:2022-06-28 14:30:55.329497692 +0000 UTC m=+8421773.911777457, startAt:2022-06-28 14:30:55.329597613 +0000 UTC m=+8421773.911877386, currentStep:1, steps:[transfer leader from store 1 to store 5]) finished\""] ["additional info"=]
6.监控 cluster_tidb --> kv errors

三、结论
可以看到这个报警的引起是由于dm-worker产生报错invalid connection,而这个报错这是由于tidb出现了wait response is cancelled,而tidb出现了这种问题则是由于tikv出现了锁和backoff导致的,至于为什么出现锁和backoff,可以看到pd的日志对hot-read-leader做了调度,这是产生backoff的关键,而lock的原因则要从业务sql中去查找
官方文档:锁冲突描述文档
边栏推荐
- 洛谷 P4281 [AHOI2008]紧急集合 / 聚会(树上倍增 LCA)
- Huawei game failed to initialize init with error code 907135000
- P1055 [noip2008 popularization group] ISBN number
- Verilog 和VHDL有符号数和无符号数相关运算
- 三.芯片启动和时钟系统
- php中self和static在方法中的区别
- MTK full dump抓取
- 2022 love analysis · panoramic report of digital manufacturers of state-owned enterprises
- PCL eigen introduction and simple use
- Flick two open, realized a batch lookup join (with source code)
猜你喜欢

实验电镜距离测量之Matlab处理

What are the software product management systems? Inventory of 12 best product management tools

The most detailed MySQL installation tutorial

Special topic of binary tree -- acwing 1497 Traversal of the tree (use post and mid order traversal to build a binary tree)

【深入浅出玩转FPGA学习3-----基本语法】

Implement custom drawer component in quick application

【AI应用】海康威视iVMS-4200软件安装

二叉树专题--AcWing 19. 二叉树的下一个节点(找树中节点的后继)

Hdu1236 ranking (structure Sorting)

如何使用IDE自动签名调试鸿蒙应用
随机推荐
【深入浅出玩转FPGA学习5-----复位设计】
AppGallery Connect场景化开发实战—图片存储分享
Win11 arm system configuration Net core environment variable
Special topic of binary tree -- acwing 3540 Binary search tree building (use the board to build a binary search tree and output the pre -, middle -, and post sequence traversal)
快应用中实现自定义抽屉组件
【快应用】text组件里的文字很多,旁边的div样式会被拉伸如何解决
TIPC protocol
Nodejs+express+mysql simple blog building
ImportError: cannot import name ‘Digraph‘ from ‘graphviz‘
Thanos Receiver
How to transfer event objects and user-defined parameters simultaneously in Huawei express applications
二叉树专题--AcWing 1497. 树的遍历(利用后、中序遍历,构建二叉树)
Common methods of JS array
Complement (Mathematical Simulation
Calculate the sum of sequences
How to use ide to automatically sign and debug Hongmeng application
Jenkins安装
2022-06-17
TIPC messaging3
How to implement tabbar title bar with list component