当前位置:网站首页>Troubleshooting of datanode entering stale status

Troubleshooting of datanode entering stale status

2022-06-23 17:04:00 Java OTA

First say DataNode Why are you in Stale state
By default ,DataNode Every time 3s towards NameNode Send a heartbeat , If NameNode continued 30s No heartbeat received , Just put DataNode Marked as Stale state ; In another 10 I haven't received my heartbeat for minutes , Just mark it as dead state

NameNode There is one jmx indicators hadoop_namenode_numstaledatanodes, Get into statle State of DataNode Number , Normally, this value should be 0, If not 0 The alarm should be triggered

DataNode There is one jmx indicators hadoop_datanode_heartbeatstotalnumops, Indicates the number of heartbeats sent , adopt prometheus function increase(hadoop_datanode_heartbeatstotalnumops[1m]), We can draw 1 Number of heartbeats sent in minutes

The monitoring finds that there are nodes with heartbeat times of 0 The situation of :
 Insert picture description here
Observe this period of time DataNode Of JVM state , Find out GC Very often ,1 Minutes up to 90 Time :
 Insert picture description here
Check the log of this node , Found a warning log :

 Insert picture description here

The main code is org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan()
Is roughly DataNode It will scan the data blocks on the disk regularly , Check whether it is consistent with the data block information in memory . Get the lock before starting the comparison , When the lock is released after completion, a check will be performed , If the lock is held longer than the threshold (300ms), The warning log will be printed
Here the lock is held for 36s, It's a little too long , Guess why DataNode Storage configuration is unreasonable , Only one disk is configured , And the amount of data is large , There are many data blocks , The comparison takes a long time
And this time and DataNode The time of missing heartbeat is just the same
Spot check the time points of abnormal heartbeat transmission for several times , Have found this warning log
The high probability is that this affects the heartbeat transmission

The official also has the corresponding issue:

https://www.mail-archive.com/[email protected]/msg43698.html
https://issues.apache.org/jira/browse/HDFS-16013
https://issues.apache.org/jira/browse/HDFS-15415
stay 3.2.2, 3.3.1, 3.4.0 This problem is solved in version , In addition to optimizing performance , The key is to remove the lock , Timely and time-consuming , It will not be affected by holding the lock for a long time DataNode A healthy state

It is difficult for us to upgrade the version
First, continue to observe , Let's see if this situation will have a greater impact
In addition to the upgraded version , hold DataNode Change to multiple directories , One smaller disk per directory , It should also have an optimization effect

Reference article :
How to identify datanode stale

To talk about DataNode How to talk to NameNode Sending heartbeat
DataNode And DirectoryScanner analysis

原网站

版权声明
本文为[Java OTA]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231608246788.html