当前位置:网站首页>tidb-dm报警DM_sync_process_exists_with_error排查
tidb-dm报警DM_sync_process_exists_with_error排查
2022-07-05 13:57:00 【TiDB社区干货传送门】
作者: db_user 原文来源:https://tidb.net/blog/c84753ca
一、背景
dm同步任务报警DM_sync_process_exists_with_error,一分钟后自动恢复,想着排查一下原因
二、观测日志报错
1.dm日志
[2022/06/28 14:31:13.364 +00:00] [ERROR] [db.go:201] ["execute statements failed after retry"] [task=task-name] [unit="binlog replication"] [queries="[sql]"] [arguments="[[]]"] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: commit, RawCause: invalid connection"]2.上游mysql日志
2022-06-28T14:31:19.413211Z 28801 [Note] Aborted connection 28801 to db: 'unconnected' user: '***' host: 'ip' (Got an error reading communication packets)2022-06-28T14:31:22.154980Z 28802 [Note] Aborted connection 28802 to db: 'unconnected' user: '***' host: 'ip' (Got an error reading communication packets)2022-06-28T14:31:32.158508Z 28804 [Note] Start binlog_dump to master_thread_id(28804) slave_server(429505412), pos(mysql-bin-changelog.103037, 36247149)2022-06-28T14:31:32.158739Z 28803 [Note] Start binlog_dump to master_thread_id(28803) slave_server(429505202), pos(mysql-bin-changelog.103037, 40373779)3.下游tidb日志
[2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"][2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"][2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"][2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"][2022/06/28 14:31:12.419 +00:00] [WARN] [client_batch.go:638] ["wait response is cancelled"] [to=dm_worker_ip:20160] [cause="context canceled"]4.下游tikv日志
[2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2641161, leader may Some(id: 2641164 store_id: 5)\" not_leader { region_id: 2641161 leader { id: 2641164 store_id: 5 } }"][2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2641165, leader may Some(id: 2641167 store_id: 4)\" not_leader { region_id: 2641165 leader { id: 2641167 store_id: 4 } }"][2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2709997, leader may Some(id: 2709999 store_id: 4)\" not_leader { region_id: 2709997 leader { id: 2709999 store_id: 4 } }"][2022/06/28 14:31:12.585 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2839445, leader may Some(id: 2839447 store_id: 4)\" not_leader { region_id: 2839445 leader { id: 2839447 store_id: 4 } }"][2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"][2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"][2022/06/28 14:31:20.400 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 2957169, leader may Some(id: 2957170 store_id: 1)\" not_leader { region_id: 2957169 leader { id: 2957170 store_id: 1 } }"][2022/06/28 14:31:05.617 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Key is locked (will clean up) primary_lock: 748000000F000 lock_version: 434222311815512066 key: 748000009725552F000 lock_ttl: 3003 txn_size: 1"][2022/06/28 14:31:05.634 +00:00] [WARN] [endpoint.rs:537] [error-response] [err="Key is locked (will clean up) primary_lock: 7480000000092F000 lock_version: 434222311815512092 key: 748000000000 lock_ttl: 3018 txn_size: 5"][2022/06/28 14:31:15.389 +00:00] [ERROR] [kv.rs:931] ["KvService response batch commands fail"][2022/06/28 14:31:15.432 +00:00] [ERROR] [kv.rs:931] ["KvService response batch commands fail"]5.pd日志
[2022/06/28 14:30:55.329 +00:00] [INFO] [operator_controller.go:424] ["add operator"] [region-id=2641161] [operator="\"transfer-hot-read-leader {transfer leader: store 1 to 5} (kind:hot-region,leader, region:2641161(25913,5), createAt:2022-06-28 14:30:55.329497692 +0000 UTC m=+8421773.911777457, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 1 to store 5])\""] ["additional info"=][2022/06/28 14:30:55.329 +00:00] [INFO] [operator_controller.go:620] ["send schedule command"] [region-id=2641161] [step="transfer leader from store 1 to store 5"] [source=create][2022/06/28 14:30:55.342 +00:00] [INFO] [cluster.go:567] ["leader changed"] [region-id=2641161] [from=1] [to=5][2022/06/28 14:30:55.342 +00:00] [INFO] [operator_controller.go:537] ["operator finish"] [region-id=2641161] [takes=12.961676ms] [operator="\"transfer-hot-read-leader {transfer leader: store 1 to 5} (kind:hot-region,leader, region:2641161(25913,5), createAt:2022-06-28 14:30:55.329497692 +0000 UTC m=+8421773.911777457, startAt:2022-06-28 14:30:55.329597613 +0000 UTC m=+8421773.911877386, currentStep:1, steps:[transfer leader from store 1 to store 5]) finished\""] ["additional info"=]6.监控 cluster_tidb --> kv errors

三、结论
可以看到这个报警的引起是由于dm-worker产生报错invalid connection,而这个报错这是由于tidb出现了wait response is cancelled,而tidb出现了这种问题则是由于tikv出现了锁和backoff导致的,至于为什么出现锁和backoff,可以看到pd的日志对hot-read-leader做了调度,这是产生backoff的关键,而lock的原因则要从业务sql中去查找
官方文档:锁冲突描述文档
边栏推荐
- Basic characteristics and isolation level of transactions
- [machine learning notes] several methods of splitting data into training sets and test sets
- Google EventBus 使用详解
- 神经网络物联网未来发展趋势怎么样
- Ueditor + PHP enables Alibaba cloud OSS upload
- [machine learning notes] how to solve over fitting and under fitting
- 研究生可以不用学英语?只要考研英语或六级分数高!
- 深拷贝真难
- [buuctf.reverse] 152-154
- Zibll theme external chain redirection go page beautification tutorial
猜你喜欢
Jetpack compose introduction to mastery

嵌入式软件架构设计-消息交互
![Primary code audit [no dolls (modification)] assessment](/img/b8/82c32e95d1b72f75823ca91c97138e.jpg)
Primary code audit [no dolls (modification)] assessment

那些考研后才知道的事

Xampp configuring multiple items

清大科越冲刺科创板:年营收2亿 拟募资7.5亿

Laravel framework operation error: no application encryption key has been specified
![[machine learning notes] several methods of splitting data into training sets and test sets](/img/f6/eca239bb4b1764a1495ccd9a868ec1.jpg)
[machine learning notes] several methods of splitting data into training sets and test sets

Kotlin协程利用CoroutineContext实现网络请求失败后重试逻辑
Jetpack Compose入门到精通
随机推荐
嵌入式软件架构设计-消息交互
3W原则[通俗易懂]
研究生可以不用学英语?只要考研英语或六级分数高!
About the problem and solution of 403 error in wampserver
蓝桥杯学习2022.7.5(上午)
ZABBIX monitoring
[public class preview]: basis and practice of video quality evaluation
故障分析 | MySQL 耗尽主机内存一例分析
治臻新能源冲刺科创板:年营收2.2亿 上汽创投是股东
[buuctf.reverse] 152-154
upload (1-6)
Blue Bridge Cup study 2022.7.5 (morning)
Basic characteristics and isolation level of transactions
Zhubo Huangyu: it's really bad not to understand these gold frying skills
asp. Net read TXT file
Requests + BS4 crawl Douban top250 movie information
Kotlin collaboration uses coroutinecontext to implement the retry logic after a network request fails
04_solr7.3之solrJ7.3的使用
LeetCode_2(两数相加)
Scientific running robot pancakeswap clip robot latest detailed tutorial