当前位置：网站首页>Datanode data block missing problem finding

Datanode data block missing problem finding

2022-07-28 12:55:00 【Heaven has love】

HDFS Data block loss problem

List of articles

HDFS Data block loss problem

Problem finding

Running Spark During the mission , The code is submitted , After the code is submitted, it must pass yarn Upload to HDFS above , Then each node can see jar It's packed .

But at this time, it suddenly reported an error NameNode Entered safe mode .【 notes ： After entering safe mode, you can only view hdfs The above , Can't create . So upload jar Package failed 】. So I went to check the cluster , Find out HDFS Hang up , The data loss rate reaches 90% many , then HBase I've also hung up 【 This is of course ,HDFS After entering safe mode HBase Of Master The process will also hang up 】.

So here comes the question .

Problem location

Because the root cause is HDFS The loss of data blocks caused a series of problems , Then you can only find HDFS 了 , So I started the journey of tracing logs , See the error log as , Most error reporting logs are NameNode Entered safe mode , Until we find this at the source Namenode It is switched from the backup state , And the other one NameNode Obviously, it hangs earlier than it .

Switch logs , Keep tracking ...

Before you see the error log, it was originally this type 【 These are not Error Level log is INFO Grade 】

org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /xxx/xxx/dfs/dn

Then there is this kind of log

2019-12-05 16:47:40,269 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-402578992-132.46.XXX-XX-XXXX881445150 blk_1093729618_19992285 URI file:/xxx/xxx/dfs/dn/current/BP-402578992-132.46.XXX-XX-XXXX881445150/current/finalized/subdir16/subdir29/blk_1093729618
2019-12-05 16:47:40,272 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-402578992-132.46.XXX-XX-XXXX881445150 blk_1093333597_19596217 URI file:/xxx/xxx/dfs/dn/current/BP-402578992-132.46.XXX-XX-XXXX881445150/current/finalized/subdir10/subdir18/blk_1093333597

The back is NameNode An error is reported when entering the safe mode .

This explanation datanode Start self checking , Then some block information is removed , So here comes the question .

Please recall hdfs Start process ：DataNode Yes, I will NameNode Report your block position .

So the problem lies in DataNode, Why did he lose all the pieces ？

Find the cause

Because the cluster has been expanded these days , The expanded nodes have fewer hard disks than the original nodes 9 block , Therefore, these nodes may be modified in the wrong place during configuration . So I went to CDH View the management page , Sure enough .

The original cluster 10 All the hard disks are configured hdfs above , But now there is only one piece left , So that's the problem .

The emergence of new problems

Enter the management page and change back , But when changing, change all nodes to 10 A hard disk 【 The corresponding configuration is 10 A catalog 】, So new problems arise , Those directories without hard disks will soon be full .

So the cluster enters the alarm state again .

A small step forward

So after finding the reason, delete the directories of the expanded node 【 Delete these machines separately , Other nodes do not move 】, Restart the cluster , I found that there were fewer errors , But it did go back to the original block lost , This loss is because the data from Startup to problem discovery has entered these directories without attached hard disks , Then the deleted data will be deleted .

View the impact

This time, though, I lost a piece , But the loss rate is less than 0.01, therefore NameNode Did not enter safe mode , But the data is indeed lost ,CDH Display missing 103 Block .

Check that the data is lost

Redirect the missing block to the log

hdfs fsck / | egrep -v '^\.+$' | grep -v eplica > log

hdfs fsck -list-corruptfileblocks
# Check the status of a file 
hdfs fsck /path/to/corrupt/file -locations -blocks -files

Can be in log See the missing data block

Look on the Internet, it is said that restoring data is like this 【 Don't believe 】, Make complaints ： The world article a big copy , The key is to copy the wrong

hdfs debug recoverLease -path /hbase/data/default/xxx/7f1eb0a88a2f8f960cbe975ec84905a5/r/c5abbd43022b4b688767a1722bb2e4be -retries 10

But in fact, I feel that they are all erroneous , Did he really recover the data ？ Of course not , So what does it do ？
as you know,HDFS It is based on lease . In layman's terms ： Each time the client checks hdfs The lease must be obtained before the read-write request of , Release the lease after the end . Then this command is to release the lease manually . Click to enter the official website to query this command , The last part of the article is the order

Query the data to see if it can be retrieved or re stored , If possible , Come again . If not, there is no way .
After checking what data is lost, delete it , It's useless to keep it , Of course, if your disk can be restored without deletion, say another .
The delete command

#/xxx/xxx  For the corresponding path eg:/hbase/data/default/test/region01/c/uuid1
hdfs fsck /xxx/xxx  -delete

After repair, right HBase Influence

This lost data , The missing block corresponds to HBase For the above StoreFile. therefore HBase There is nothing wrong with metadata , So it can still find every region. But if it's this region What is missing is StoreFile, So this region Some corresponding data will be lost . That is, I lost this StoreFile Corresponding data .

Of course, if you whole region All are lost instead of one StoreFile Let's put it another way .

原网站

版权声明
本文为[Heaven has love]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130929071764.html