当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
2020-11-06 21:50:00 【Fuda Dajia architect's daily question】
Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .
2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .
MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .
-
Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
(1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
(2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file . -
Solution
These two situations need different solutions :
(1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
(2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .
版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢
边栏推荐
- Unity performance optimization
- A concise tutorial for Nacos, ribbon and feign
- How to make characters move
- 递归、回溯算法常用数学基础公式
- Event monitoring problem
- ES6 learning notes (5): easy to understand ES6's built-in extension objects
- Introduction to Huawei cloud micro certification examination
- To teach you to easily understand the basic usage of Vue codemirror: mainly to achieve code editing, verification prompt, code formatting
- What is the meaning of sector sealing of filecoin mining machine since the main network of filecoin was put online
- Open source a set of minimalist front and rear end separation project scaffold
猜你喜欢
Windows 10 蓝牙管理页面'添加蓝牙或其他设备'选项点击无响应的解决方案
C calls SendMessage to refresh the taskbar icon (the icon does not disappear at the end of forcing)
2020-08-20:GO语言中的协程与Python中的协程的区别?
STM32F030C6T6兼容替换MM32SPIN05PF
What is the meaning of sector sealing of filecoin mining machine since the main network of filecoin was put online
How to manage the authority of database account?
An article will introduce you to CSS3 background knowledge
Zero basis to build a web search engine of its own
2020-08-30:裸写算法:二叉树两个节点的最近公共祖先。
jenkins安装部署过程简记
随机推荐
Stickinengine architecture 11 message queue
Cloudquery v1.2.0 release
Vue communication and cross component listening state Vue communication
How about small and medium-sized enterprises choose shared office?
An article will introduce you to CSS3 background knowledge
Pn8162 20W PD fast charging chip, PD fast charging charger scheme
Road to simple HTML + JS to achieve the most simple game Tetris
预留电池接口,内置充放电电路及电量计,迅为助力轻松搞定手持应用
ado.net and asp.net The relationship between
Common mathematical basic formulas of recursive and backtracking algorithms
2020-08-20:GO语言中的协程与Python中的协程的区别?
Utility class functions (continuous update)
html+ vue.js Implementing paging compatible IE
git远程库回退指定版本
An article will take you to understand CSS3 fillet knowledge
[self taught unity2d legendary game development] map editor
Summary of front-end performance optimization that every front-end engineer should understand:
迅为-iMX6ULL开发板上配置AP热点
ES6 learning notes (2): teach you to play with class inheritance and class objects
Contract trading system development | construction of smart contract trading platform