当前位置：网站首页>Chapter 8. MapReduce production experience

Chapter 8. MapReduce production experience

2022-07-03 06:21:00 【Control the spiritual field】

8.1 MapReduce The reason for the slow running

MapReduce The bottleneck of program efficiency lies in two aspects ：
1） Computer performance
CPU、 Memory 、 disk 、 The Internet
2）I/O Operation optimization
（1） Data skew
（2）Map Running too long , Lead to Reduce Waiting too long
（3） Too many small files

8.2 MapReduce Common tuning parameters

8.2.1 Map Stage tuning

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-C321GUiK-1653447928839)(../../_resources/b116edd9e4616bf689dc557595e25af4-3.png)]

1） Custom partition , Reduce data skew ;
Defining classes , Inherit Partitioner Interface , rewrite getPartition Method
2） Reduce the number of overflow
mapreduce.task.io.sort.mb
Shuffle The size of the ring buffer , Default 100m, Sure Up to 200m
mapreduce.map.sort.spill.percent
Threshold for ring buffer overflow , Default 80% , Can be improved 90%
3） Increase each time Merge The number of merges
mapreduce.task.io.sort.factor Default 10, Sure Up to 20（ High memory requirements , If there's not enough memory , It also needs to be reduced ）
4） It can be adopted in advance without affecting the business results Combiner
job.setCombinerClass(xxxReducer.class);
5） To reduce the number of disks IO, May adopt Snappy perhaps LZO Compress
conf.setBoolean(“mapreduce.map.output.compress”, true);
conf.setClass(“mapreduce.map.output.compress.codec”,
SnappyCodec.class,CompressionCodec.class);（ Enterprises often use Snappy Compress ）
6）mapreduce.map.memory.mb Default MapTask Maximum memory 1024MB.
According to 128m Data corresponds to 1G Memory principle increase this memory .
7）mapreduce.map.java.opts： control MapTask Heap memory size .（ If there's not enough memory , newspaper ：java.lang.OutOfMemoryError）
8）mapreduce.map.cpu.vcores Default MapTask Of CPU Check the number 1. Computing intensive tasks can increase CPU Check the number
9） Exception retry
mapreduce.map.maxattempts Every Map Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is ：4. Properly improve the performance of the machine .

8.2.2 Reduce Stage tuning

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-jxbXU3AN-1653447928840)(../../_resources/ea9d519d97351e1c7c32c8ae3bd05a3b-2.png)]

1）mapreduce.reduce.shuffle.parallelcopies Every Reduce Go to Map The parallel number of pull data in , The default value is 5. It can be raised to 10.
2）mapreduce.reduce.shuffle.input.buffer.percent
Buffer The size accounts for Reduce The ratio of available memory , The default value is 0.7. It can be raised to 0.8
3）mapreduce.reduce.shuffle.merge.percent Buffer What percentage of the data in starts to be written to disk , The default value is 0.66. It can be raised to 0.75
4）mapreduce.reduce.memory.mb Default ReduceTask Maximum memory 1024MB, according to 128m Data corresponds to 1G Memory principle , Appropriately increase the memory to 4-6G
5）mapreduce.reduce.java.opts： control ReduceTask Heap memory size .（ If there's not enough memory , newspaper ：java.lang.OutOfMemoryError）
6）mapreduce.reduce.cpu.vcores Default ReduceTask Of CPU Check the number 1 individual . It can be raised to 2-4 individual .
7）mapreduce.reduce.maxattempts Every Reduce Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is ：4.
8）mapreduce.job.reduce.slowstart.completedmaps When MapTask Only when the proportion of completion reaches this value will it be ReduceTask Application resources . The default is 0.05.
9）mapreduce.task.timeout If one Task There was no entry... For a certain period of time , That is, new data will not be read , No output data , Think of it as Task be in Block state , Maybe it's stuck , Maybe it will stick forever , To prevent the user program from ever Block Don't quit , The timeout is forced to be set （ Unit millisecond ）, The default is 600000（10 minute ）. If your program takes too long to process each input data , It is suggested to increase the parameter .
10） If you don't have to Reduce, Try not to use

8.3 MapReduce Data skew

1） Data skew phenomenon

Data frequency tilt — The amount of data in one area is much larger than that in other areas
Data size skew — Some records are much larger than the average
Here's the picture ,99% The data of has been run ,Reducer3、Reducer4 Still running , This is a manifestation of data skew
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ZfXzPPD7-1653447928841)(../../_resources/36a4d48478a0ec544f5c8fa53f371b24-3.png)]

2） Ways to reduce data skew

（1） First, check whether the data skew caused by too many null values
In the production environment , You can filter out null values directly ; If you want to retain control , Just customize the partition , Break up null values plus random numbers , Finally, the second polymerization .
（2） Can be in map Stage ahead of time , Best in map Stage processing . Such as ：Combiner、MapJoin
（3） Set up multiple reduce Number

原网站

版权声明
本文为[Control the spiritual field]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207030618181196.html