当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
2020-11-06 21:50:00 【Fuda Dajia architect's daily question】
Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .
2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .
MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .
-
Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
(1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
(2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file . -
Solution
These two situations need different solutions :
(1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
(2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .
版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢
边栏推荐
- Detect certificate expiration script
- Jenkins installation and deployment process
- The isolation level of transaction and its problems
- 2020-08-20:GO语言中的协程与Python中的协程的区别?
- An article will take you to understand CSS alignment
- All the way, I was forced to talk about C code debugging skills and remote debugging
- How to manage the authority of database account?
- Zhou Jie: database system of East China Normal University
- 迅为iMX6开发板-设备树内核-menuconfig的使用
- Interviewer: how about shardingsphere
猜你喜欢

Pn8162 20W PD fast charging chip, PD fast charging charger scheme

How about small and medium-sized enterprises choose shared office?

STM32F030K6T6兼容替换灵动MM32F031K6T6

2020-09-09:裸写算法:两个线程轮流打印数字1-100。

谷歌浏览器实现视频播放加速功能

【涂鸦物联网足迹】涂鸦云平台全景介绍

#JVM 类加载机制

Stickinengine architecture 11 message queue
![[forward] how to view UserData in Lua](/img/3b/00bc81122d330c9d59909994e61027.jpg)
[forward] how to view UserData in Lua
![[elastic search engine]](/img/3b/00bc81122d330c9d59909994e61027.jpg)
[elastic search engine]
随机推荐
Unexpected element.. required element
Exclusive interview with Alibaba cloud database for 2020 PostgreSQL Asia Conference: Zeng Wenjing
迅为iMX6开发板-设备树内核-menuconfig的使用
Method of code refactoring -- Analysis of method refactoring
Take you to learn the new methods in Es5
Metersphere developer's Manual
2020-08-19:TCP是通过什么机制保障可靠性的?
Visual rolling [contrast beauty]
2020-08-15:什么情况下数据任务需要优化?
Big data processing black Technology: revealing the parallel computing technology of Pb level data warehouse gaussdb (DWS)
The 4th China BIM (digital construction) manager Summit Forum will be held in Hangzhou in 2020
Detect certificate expiration script
谷歌浏览器实现视频播放加速功能
jenkins安装部署过程简记
Using iceberg on kubernetes to create a new generation of cloud original data Lake
An article taught you to use HTML5 SVG tags
The method of local search port number occupation in Windows system
How to make characters move
How much disk space does a new empty file take?
Jenkins installation and deployment process