当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)

2020-11-06 21:50:00 Fuda Dajia architect's daily question

Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .

2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .

MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .

  1. Why there are so many small files
    In at least two scenarios, a large number of small files will be generated :
    (1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
    (2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file .

  2. Solution
    These two situations need different solutions :
    (1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
    (2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
    ① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
    ② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
    ③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .


Comment on

版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢