GreatSQL The original content of the community shall not be used without authorization , For reprint, please contact the editor and indicate the source .

[toc]

MGR Similar to traditional master-slave replication , In the operation process, we mainly focus on the operation status of each node , as well as Secondary Whether the transaction of the node has delay . This article describes how to monitor MGR Node status 、 Transaction status, etc .

1. Monitoring node status

By inquiring performance_schema.replication_group_members Watch to know MGR The state of each node ：

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+
| group_replication_applier | af39db70-6850-11ec-94c9-00155d064000 | 192.168.6.27 |        4306 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | b05c0838-6850-11ec-a06b-00155d064000 | 192.168.6.27 |        4307 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | b0f86046-6850-11ec-92fe-00155d064000 | 192.168.6.27 |        4308 | ONLINE       | SECONDARY   | 8.0.25         |
+---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+

The main columns in the output result are interpreted as follows ：

MEMBER_ID The column value is the value of each node server_uuid, Used to uniquely identify each node , In command line mode , call udf Time passes in MEMBER_ID To specify each node .
MEMBER_ROLE Represents the role of each node , If it is PRIMARY Indicates that the node can accept read-write transactions , If it is SECONDARY It means that the node can only accept read-only transactions . If only one node is PRIMARY, The rest are SECONDARY, It means that it is currently in Single master mode ; If all nodes are PRIMARY, It means that it is currently in Multi master mode .
MEMBER_STATE Indicates the status of each node , There are several States ：ONLINE、RECOVERING、OFFLINE、ERROR、UNREACHABLE etc. , Here are several States .
- ONLINE, Indicates that the node is in a normal state , Services available .
- RECOVERING, Indicates that the node is undergoing distributed recovery , Waiting to join the cluster , At this time, it may be from donor Node utilization clone Copy the data , Or transmission binlog in .
- OFFLINE, Indicates that the node is currently offline . remind , Just about to join or rejoin the cluster , There may also be a very short moment of status displayed as OFFLINE.
- ERROR, Indicates that the node is currently in an error state , Cannot be a member of a cluster . When the node is in the process of distributed recovery or application transaction , It is also possible to be in this state . When the node is in ERROR In the state of , Cannot participate in the adjudication of cluster transactions . When the node is joining or rejoining the cluster , After completing the compatibility check, it becomes official MGR Before node , It may also appear as ERROR state .
- UNREACHABLE, When the group communication message sending and receiving timeout , The fault detection mechanism will mark this node as suspicious , It is suspected that it may not be able to connect with other nodes , For example, when a node is accidentally disconnected . When you see other nodes on a node in UNREACHABLE In the state of , It may mean that some nodes are partitioned , That is, multiple nodes are split into two or more subsets , Nodes in the subset can communicate with each other , But subsets cannot communicate with each other .

When the state of the node is not ONLINE when , You should immediately give an alarm and check what happened .

When the node state changes , Or there are nodes to join 、 Exit time , surface performance_schema.replication_group_members All data will be updated , Each node will exchange and share these status information , Therefore, you can view... At any node .

2. MGR Transaction status monitoring

Another important thing to focus on is Secondary The transaction status of the node , More specifically, focus on the transaction to be authenticated and the queue size of the transaction to be applied . Execute the following SQL You can view it , Mainly focus on non Primary Node COUNT_TRANSACTIONS_IN_QUEUE and COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE Whether the values of these two columns are larger ：

mysql> SELECT MEMBER_ID AS id, COUNT_TRANSACTIONS_IN_QUEUE AS trx_tobe_verified, COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE AS trx_tobe_applied, COUNT_TRANSACTIONS_CHECKED AS trx_chkd, COUNT_TRANSACTIONS_REMOTE_APPLIED AS trx_done, COUNT_TRANSACTIONS_LOCAL_PROPOSED AS proposed FROM performance_schema.replication_group_member_stats;
+--------------------------------------+-------------------+------------------+----------+----------+----------+
| id                                   | trx_tobe_verified | trx_tobe_applied | trx_chkd | trx_done | proposed |
+--------------------------------------+-------------------+------------------+----------+----------+----------+
| 4ebd3504-11d9-11ec-8f92-70b5e873a570 |                 0 |                0 |   422248 |        6 |   422248 |
| 549b92bf-11d9-11ec-88e1-70b5e873a570 |                 0 |           238391 |   422079 |   183692 |        0 |
| 5596116c-11d9-11ec-8624-70b5e873a570 |              2936 |           238519 |   422115 |   183598 |        0 |
| ed5fe7ba-37c2-11ec-8e12-70b5e873a570 |              2976 |           238123 |   422167 |   184044 |        0 |
+--------------------------------------+-------------------+------------------+----------+----------+----------+

among ,COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE The value of indicates waiting to be apply Transaction queue size ,COUNT_TRANSACTIONS_IN_QUEUE Indicates the size of the transaction queue waiting to be authenticated , Either of these values is greater than 0, It means that there is a certain degree of delay .

You can also focus on the changes of the above two values , See if the two queues are gradually increasing or decreasing , Judge according to this Primary Is the node " Running too fast " 了 , perhaps Secondary Is the node " Running too slowly ".

Mention it more , When enabling flow control （flow control） when , When the above two values exceed the corresponding threshold （group_replication_flow_control_applier_threshold and group_replication_flow_control_certifier_threshold The default threshold is 25000）, Will trigger the flow control mechanism .

3. Other monitoring

in addition , You can also view the gap between the received transactions and the completed transactions to judge ：

mysql> SELECT RECEIVED_TRANSACTION_SET FROM performance_schema.replication_connection_status WHERE  channel_name = 'group_replication_applier' UNION ALL SELECT variable_value FROM performance_schema.global_variables WHERE  variable_name = 'gtid_executed'\G
*************************** 1. row *************************** RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3124520 *************************** 2. row *************************** RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3078139

You can see , Received transaction GTID It's already here 3124520, The local only executes to 3078139, The gap between the two is 46381.

By the way, you can continue to pay attention to the change of this difference , Estimate whether the local node can level the delay in time , It will still increase the delay .

in addition , When the original primary node fails , When you want to manually select a node as the new master node , You should also first determine which node has executed the transaction GTID More valuable , This node should be preferred .

4. Summary

This paper introduces MGR The main focus of monitoring , Including node status and replication delay status , And how to predict whether the replication delay will continue to expand or catch up in time .

Reference material 、 file

disclaimer

Due to limited personal level , There are inevitable mistakes and omissions in the column , Do not directly copy the commands in the document 、 The method is directly applied to the online production environment . Readers must fully understand and verify the test environment before formal implementation , Avoid damaging or damaging the production environment .

Enjoy GreatSQL :)

Article recommendation ：

GreatSQL Quarterly Report （2021.12.26）

https://mp.weixin.qq.com/s/FZ...

Technology sharing |sysbench Usage analysis of pressure measuring tools

https://mp.weixin.qq.com/s/m1...

Fault analysis | linux disk io High utilization , Analyze the correct posture

https://mp.weixin.qq.com/s/7c...

Technology sharing | Flashback at MySQL Implementation and improvement in

https://mp.weixin.qq.com/s/6j...

Wan Da #20, How to filter data in index push down

https://mp.weixin.qq.com/s/pt...

About GreatSQL

GreatSQL It is maintained by Wanli database MySQL Branch , Focus on Improvement MGR Reliability and performance , Support InnoDB Parallel query feature , It is suitable for financial grade applications MySQL Branch version .

Gitee:

https://gitee.com/GreatSQL/Gr...

GitHub:

https://github.com/GreatSQL/G...

Bilibili：

https://space.bilibili.com/13...

WeChat &QQ Group ：

Searchable add GreatSQL Community assistant wechat friend , Send verification information “ Add group ” Join in GreatSQL/MGR Exchange wechat group

QQ Group ：533341697

Wechat assistant ：wanlidbc