当前位置:网站首页>Datanode data block missing problem finding
Datanode data block missing problem finding
2022-07-28 12:55:00 【Heaven has love】
HDFS Data block loss problem
List of articles
Problem finding
Running Spark During the mission , The code is submitted , After the code is submitted, it must pass yarn Upload to HDFS above , Then each node can see jar It's packed .
But at this time, it suddenly reported an error NameNode Entered safe mode .【 notes : After entering safe mode, you can only view hdfs The above , Can't create . So upload jar Package failed 】. So I went to check the cluster , Find out HDFS Hang up , The data loss rate reaches 90% many , then HBase I've also hung up 【 This is of course ,HDFS After entering safe mode HBase Of Master The process will also hang up 】.
So here comes the question .
Problem location
Because the root cause is HDFS The loss of data blocks caused a series of problems , Then you can only find HDFS 了 , So I started the journey of tracing logs , See the error log as , Most error reporting logs are NameNode Entered safe mode , Until we find this at the source Namenode It is switched from the backup state , And the other one NameNode Obviously, it hangs earlier than it .
Switch logs , Keep tracking ...
Before you see the error log, it was originally this type 【 These are not Error Level log is INFO Grade 】
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /xxx/xxx/dfs/dn
Then there is this kind of log
2019-12-05 16:47:40,269 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-402578992-132.46.XXX-XX-XXXX881445150 blk_1093729618_19992285 URI file:/xxx/xxx/dfs/dn/current/BP-402578992-132.46.XXX-XX-XXXX881445150/current/finalized/subdir16/subdir29/blk_1093729618
2019-12-05 16:47:40,272 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-402578992-132.46.XXX-XX-XXXX881445150 blk_1093333597_19596217 URI file:/xxx/xxx/dfs/dn/current/BP-402578992-132.46.XXX-XX-XXXX881445150/current/finalized/subdir10/subdir18/blk_1093333597
The back is NameNode An error is reported when entering the safe mode .
This explanation datanode Start self checking , Then some block information is removed , So here comes the question .
Please recall hdfs Start process :DataNode Yes, I will NameNode Report your block position .
So the problem lies in DataNode, Why did he lose all the pieces ?
Find the cause
Because the cluster has been expanded these days , The expanded nodes have fewer hard disks than the original nodes 9 block , Therefore, these nodes may be modified in the wrong place during configuration . So I went to CDH View the management page , Sure enough .
The original cluster 10 All the hard disks are configured hdfs above , But now there is only one piece left , So that's the problem .
The emergence of new problems
Enter the management page and change back , But when changing, change all nodes to 10 A hard disk 【 The corresponding configuration is 10 A catalog 】, So new problems arise , Those directories without hard disks will soon be full .
So the cluster enters the alarm state again .
A small step forward
So after finding the reason, delete the directories of the expanded node 【 Delete these machines separately , Other nodes do not move 】, Restart the cluster , I found that there were fewer errors , But it did go back to the original block lost , This loss is because the data from Startup to problem discovery has entered these directories without attached hard disks , Then the deleted data will be deleted .
View the impact
This time, though, I lost a piece , But the loss rate is less than 0.01, therefore NameNode Did not enter safe mode , But the data is indeed lost ,CDH Display missing 103 Block .
Check that the data is lost
Redirect the missing block to the log
hdfs fsck / | egrep -v '^\.+$' | grep -v eplica > log
hdfs fsck -list-corruptfileblocks
# Check the status of a file
hdfs fsck /path/to/corrupt/file -locations -blocks -files
Can be in log See the missing data block
Look on the Internet, it is said that restoring data is like this 【 Don't believe 】, Make complaints : The world article a big copy , The key is to copy the wrong
hdfs debug recoverLease -path /hbase/data/default/xxx/7f1eb0a88a2f8f960cbe975ec84905a5/r/c5abbd43022b4b688767a1722bb2e4be -retries 10
But in fact, I feel that they are all erroneous , Did he really recover the data ? Of course not , So what does it do ?
as you know,HDFS It is based on lease . In layman's terms : Each time the client checks hdfs The lease must be obtained before the read-write request of , Release the lease after the end . Then this command is to release the lease manually . Click to enter the official website to query this command , The last part of the article is the order
Query the data to see if it can be retrieved or re stored , If possible , Come again . If not, there is no way .
After checking what data is lost, delete it , It's useless to keep it , Of course, if your disk can be restored without deletion, say another .
The delete command
#/xxx/xxx For the corresponding path eg:/hbase/data/default/test/region01/c/uuid1
hdfs fsck /xxx/xxx -delete
After repair, right HBase Influence
This lost data , The missing block corresponds to HBase For the above StoreFile. therefore HBase There is nothing wrong with metadata , So it can still find every region. But if it's this region What is missing is StoreFile, So this region Some corresponding data will be lost . That is, I lost this StoreFile Corresponding data .
Of course, if you whole region All are lost instead of one StoreFile Let's put it another way .
边栏推荐
- The input string contains an array of numbers and non characters, such as a123x456. Take the consecutive numbers as an integer, store them in an array in turn, such as 123 in a[0], 456 in a[1], and ou
- Vs code is not in its original position after being updated
- The openatom openharmony sub forum was successfully held, and ecological and industrial development entered a new journey
- Initialization examples of several modes of mma8452q
- 1331. Array sequence number conversion: simple simulation question
- Block reversal (summer vacation daily question 7)
- FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be depreca
- Connected Block & food chain - (summary of parallel search set)
- Application and download of dart 3D radiative transfer model
- 机器学习基础-主成分分析PCA-16
猜你喜欢

Minimally invasive electrophysiology has passed the registration: a listed enterprise with annual revenue of 190million minimally invasive mass production

Hongjiu fruit passed the hearing: five month operating profit of 900million Ali and China agricultural reclamation are shareholders

GMT installation and use

单调栈Monotonic Stack

Cloud native - runtime environment

The openatom openharmony sub forum was successfully held, and ecological and industrial development entered a new journey

Quick read in

Redis implements distributed locks

Problem solving during copilot trial

归并排序
随机推荐
Hc-05 Bluetooth module debugging slave mode and master mode experience
FlexPro软件:生产、研究和开发中的测量数据分析
VS1003调试例程
C for循环内定义int i变量出现的重定义问题
Quick read in
Machine learning practice - neural network-21
苏黎世联邦理工学院 | 具有可变形注意Transformer 的基于参考的图像超分辨率(ECCV2022))
Block reversal (summer vacation daily question 7)
Which big model is better? Openbmb releases bmlist to give you the answer!
【Base】优化性能到底在优化啥?
非标自动化设备企业如何借助ERP系统,做好产品质量管理?
XIII Actual combat - the role of common dependence
Uncover why devaxpress WinForms, an interface control, discards the popular maskbox property
机器学习实战-决策树-22
01 introduction to pyechars features, version and installation
Minimally invasive electrophysiology has passed the registration: a listed enterprise with annual revenue of 190million minimally invasive mass production
GMT installation and use
What if the right button of win11 start menu doesn't respond
Deployment之滚动更新策略。
Did kafaka lose the message