当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
2020-11-06 21:50:00 【Fuda Dajia architect's daily question】
Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .
2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .
MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .
-
Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
(1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
(2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file . -
Solution
These two situations need different solutions :
(1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
(2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .
版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢
边栏推荐
- C calls SendMessage to refresh the taskbar icon (the icon does not disappear at the end of forcing)
- 小熊派开发板实践:智慧路灯沙箱实验之真实设备接入
- Windows 10 蓝牙管理页面'添加蓝牙或其他设备'选项点击无响应的解决方案
- Countdown | 2020 PostgreSQL Asia Conference - agenda arrangement of Chinese sub Forum
- An article taught you to download cool dog music using Python web crawler
- MRAM高速缓存的组成
- What grammar is it? ]
- 实用工具类函数(持续更新)
- 1万辆!理想汽车召回全部缺陷车:已发生事故97起,亏损将扩大
- Utility class functions (continuous update)
猜你喜欢
Bitcoin once exceeded 14000 US dollars and is about to face the test of the US election
Method of code refactoring -- Analysis of method refactoring
What grammar is it? ]
How to make characters move
STM32F030K6T6兼容替换灵动MM32F031K6T6
小熊派开发板实践:智慧路灯沙箱实验之真实设备接入
2020-09-03:裸写算法:回形矩阵遍历。
list转换map(根据key来拆分list,相同key的value为一个list)
What are the highlights of Huawei mate 40 series with HMS?
Metersphere developer's Manual
随机推荐
The 4th China BIM (digital construction) manager Summit Forum will be held in Hangzhou in 2020
ES6 learning notes (5): easy to understand ES6's built-in extension objects
Message queue - Analysis
Count the number of project code lines
list转换map(根据key来拆分list,相同key的value为一个list)
Countdown | 2020 PostgreSQL Asia Conference - agenda arrangement of Chinese sub Forum
[self taught unity2d legendary game development] map editor
This project allows you to quickly learn about a programming language in a few minutes
Method of code refactoring -- Analysis of method refactoring
File download manager realized by electron
html+ vue.js Implementing paging compatible IE
2020-08-19:TCP是通过什么机制保障可靠性的?
How does filecoin's economic model and future value support the price of fil currency breaking through thousands
STM32F030K6T6兼容替换灵动MM32F031K6T6
Git rebase is in trouble. What to do? Waiting line
How to prepare for the system design interview
win7 APPCRASH(解决方法)(转)
The method of realizing high SLO on large scale kubernetes cluster
2020-08-20:GO语言中的协程与Python中的协程的区别?
小熊派开发板实践:智慧路灯沙箱实验之真实设备接入