当前位置:网站首页>Chapter 8. MapReduce production experience
Chapter 8. MapReduce production experience
2022-07-03 06:21:00 【Control the spiritual field】
8.1 MapReduce The reason for the slow running
MapReduce The bottleneck of program efficiency lies in two aspects :
1) Computer performance
CPU、 Memory 、 disk 、 The Internet
2)I/O Operation optimization
(1) Data skew
(2)Map Running too long , Lead to Reduce Waiting too long
(3) Too many small files
8.2 MapReduce Common tuning parameters
8.2.1 Map Stage tuning
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-C321GUiK-1653447928839)(../../_resources/b116edd9e4616bf689dc557595e25af4-3.png)]](/img/31/a29c139772bc82c43840333b5c9a26.png)
1) Custom partition , Reduce data skew ;
Defining classes , Inherit Partitioner Interface , rewrite getPartition Method
2) Reduce the number of overflow
mapreduce.task.io.sort.mb
Shuffle The size of the ring buffer , Default 100m, Sure Up to 200m
mapreduce.map.sort.spill.percent
Threshold for ring buffer overflow , Default 80% , Can be improved 90%
3) Increase each time Merge The number of merges
mapreduce.task.io.sort.factor Default 10, Sure Up to 20( High memory requirements , If there's not enough memory , It also needs to be reduced )
4) It can be adopted in advance without affecting the business results Combiner
job.setCombinerClass(xxxReducer.class);
5) To reduce the number of disks IO, May adopt Snappy perhaps LZO Compress
conf.setBoolean(“mapreduce.map.output.compress”, true);
conf.setClass(“mapreduce.map.output.compress.codec”,
SnappyCodec.class,CompressionCodec.class);( Enterprises often use Snappy Compress )
6)mapreduce.map.memory.mb Default MapTask Maximum memory 1024MB.
According to 128m Data corresponds to 1G Memory principle increase this memory .
7)mapreduce.map.java.opts: control MapTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
8)mapreduce.map.cpu.vcores Default MapTask Of CPU Check the number 1. Computing intensive tasks can increase CPU Check the number
9) Exception retry
mapreduce.map.maxattempts Every Map Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4. Properly improve the performance of the machine .
8.2.2 Reduce Stage tuning
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-jxbXU3AN-1653447928840)(../../_resources/ea9d519d97351e1c7c32c8ae3bd05a3b-2.png)]](/img/6d/2c9eece28cbd71cc38d3a70cc70559.png)
1)mapreduce.reduce.shuffle.parallelcopies Every Reduce Go to Map The parallel number of pull data in , The default value is 5. It can be raised to 10.
2)mapreduce.reduce.shuffle.input.buffer.percent
Buffer The size accounts for Reduce The ratio of available memory , The default value is 0.7. It can be raised to 0.8
3)mapreduce.reduce.shuffle.merge.percent Buffer What percentage of the data in starts to be written to disk , The default value is 0.66. It can be raised to 0.75
4)mapreduce.reduce.memory.mb Default ReduceTask Maximum memory 1024MB, according to 128m Data corresponds to 1G Memory principle , Appropriately increase the memory to 4-6G
5)mapreduce.reduce.java.opts: control ReduceTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
6)mapreduce.reduce.cpu.vcores Default ReduceTask Of CPU Check the number 1 individual . It can be raised to 2-4 individual .
7)mapreduce.reduce.maxattempts Every Reduce Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4.
8)mapreduce.job.reduce.slowstart.completedmaps When MapTask Only when the proportion of completion reaches this value will it be ReduceTask Application resources . The default is 0.05.
9)mapreduce.task.timeout If one Task There was no entry... For a certain period of time , That is, new data will not be read , No output data , Think of it as Task be in Block state , Maybe it's stuck , Maybe it will stick forever , To prevent the user program from ever Block Don't quit , The timeout is forced to be set ( Unit millisecond ), The default is 600000(10 minute ). If your program takes too long to process each input data , It is suggested to increase the parameter .
10) If you don't have to Reduce, Try not to use
8.3 MapReduce Data skew
1) Data skew phenomenon
Data frequency tilt — The amount of data in one area is much larger than that in other areas
Data size skew — Some records are much larger than the average
Here's the picture ,99% The data of has been run ,Reducer3、Reducer4 Still running , This is a manifestation of data skew ![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ZfXzPPD7-1653447928841)(../../_resources/36a4d48478a0ec544f5c8fa53f371b24-3.png)]](/img/66/4fb11d95886edf957c515c65311d2e.png)
2) Ways to reduce data skew
(1) First, check whether the data skew caused by too many null values
In the production environment , You can filter out null values directly ; If you want to retain control , Just customize the partition , Break up null values plus random numbers , Finally, the second polymerization .
(2) Can be in map Stage ahead of time , Best in map Stage processing . Such as :Combiner、MapJoin
(3) Set up multiple reduce Number
边栏推荐
- Oracle database synonym creation
- MATLAB如何修改默认设置
- Time format record
- Click cesium to obtain three-dimensional coordinates (longitude, latitude and elevation)
- Migrate data from Mysql to tidb from a small amount of data
- Leetcode solution - 01 Two Sum
- 有意思的鼠標指針交互探究
- Oracle Database Introduction
- Project summary --04
- 认识弹性盒子flex
猜你喜欢
随机推荐
CKA certification notes - CKA certification experience post
SVN分支管理
conda和pip的区别
Nacos service installation
scroll-view指定滚动元素的起始位置
Difference between shortest path and minimum spanning tree
Simple understanding of ThreadLocal
致即将毕业大学生的一封信
Exportation et importation de tables de bibliothèque avec binaires MySQL
Svn branch management
POI dealing with Excel learning
冒泡排序的简单理解
Mysql5.7 group by error
Virtual memory technology sharing
智牛股--03
ruoyi接口权限校验
Skywalking8.7 source code analysis (I): agent startup process, agent configuration loading process, custom class loader agentclassloader, plug-in definition system, plug-in loading
opencv鼠标键盘事件
Kubernetes notes (10) kubernetes Monitoring & debugging
2022 CISP-PTE(三)命令执行








