当前位置:网站首页>Chapter 8. MapReduce production experience
Chapter 8. MapReduce production experience
2022-07-03 06:21:00 【Control the spiritual field】
8.1 MapReduce The reason for the slow running
MapReduce The bottleneck of program efficiency lies in two aspects :
1) Computer performance
CPU、 Memory 、 disk 、 The Internet
2)I/O Operation optimization
(1) Data skew
(2)Map Running too long , Lead to Reduce Waiting too long
(3) Too many small files
8.2 MapReduce Common tuning parameters
8.2.1 Map Stage tuning
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-C321GUiK-1653447928839)(../../_resources/b116edd9e4616bf689dc557595e25af4-3.png)]](/img/31/a29c139772bc82c43840333b5c9a26.png)
1) Custom partition , Reduce data skew ;
Defining classes , Inherit Partitioner Interface , rewrite getPartition Method
2) Reduce the number of overflow
mapreduce.task.io.sort.mb
Shuffle The size of the ring buffer , Default 100m, Sure Up to 200m
mapreduce.map.sort.spill.percent
Threshold for ring buffer overflow , Default 80% , Can be improved 90%
3) Increase each time Merge The number of merges
mapreduce.task.io.sort.factor Default 10, Sure Up to 20( High memory requirements , If there's not enough memory , It also needs to be reduced )
4) It can be adopted in advance without affecting the business results Combiner
job.setCombinerClass(xxxReducer.class);
5) To reduce the number of disks IO, May adopt Snappy perhaps LZO Compress
conf.setBoolean(“mapreduce.map.output.compress”, true);
conf.setClass(“mapreduce.map.output.compress.codec”,
SnappyCodec.class,CompressionCodec.class);( Enterprises often use Snappy Compress )
6)mapreduce.map.memory.mb Default MapTask Maximum memory 1024MB.
According to 128m Data corresponds to 1G Memory principle increase this memory .
7)mapreduce.map.java.opts: control MapTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
8)mapreduce.map.cpu.vcores Default MapTask Of CPU Check the number 1. Computing intensive tasks can increase CPU Check the number
9) Exception retry
mapreduce.map.maxattempts Every Map Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4. Properly improve the performance of the machine .
8.2.2 Reduce Stage tuning
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-jxbXU3AN-1653447928840)(../../_resources/ea9d519d97351e1c7c32c8ae3bd05a3b-2.png)]](/img/6d/2c9eece28cbd71cc38d3a70cc70559.png)
1)mapreduce.reduce.shuffle.parallelcopies Every Reduce Go to Map The parallel number of pull data in , The default value is 5. It can be raised to 10.
2)mapreduce.reduce.shuffle.input.buffer.percent
Buffer The size accounts for Reduce The ratio of available memory , The default value is 0.7. It can be raised to 0.8
3)mapreduce.reduce.shuffle.merge.percent Buffer What percentage of the data in starts to be written to disk , The default value is 0.66. It can be raised to 0.75
4)mapreduce.reduce.memory.mb Default ReduceTask Maximum memory 1024MB, according to 128m Data corresponds to 1G Memory principle , Appropriately increase the memory to 4-6G
5)mapreduce.reduce.java.opts: control ReduceTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
6)mapreduce.reduce.cpu.vcores Default ReduceTask Of CPU Check the number 1 individual . It can be raised to 2-4 individual .
7)mapreduce.reduce.maxattempts Every Reduce Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4.
8)mapreduce.job.reduce.slowstart.completedmaps When MapTask Only when the proportion of completion reaches this value will it be ReduceTask Application resources . The default is 0.05.
9)mapreduce.task.timeout If one Task There was no entry... For a certain period of time , That is, new data will not be read , No output data , Think of it as Task be in Block state , Maybe it's stuck , Maybe it will stick forever , To prevent the user program from ever Block Don't quit , The timeout is forced to be set ( Unit millisecond ), The default is 600000(10 minute ). If your program takes too long to process each input data , It is suggested to increase the parameter .
10) If you don't have to Reduce, Try not to use
8.3 MapReduce Data skew
1) Data skew phenomenon
Data frequency tilt — The amount of data in one area is much larger than that in other areas
Data size skew — Some records are much larger than the average
Here's the picture ,99% The data of has been run ,Reducer3、Reducer4 Still running , This is a manifestation of data skew ![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ZfXzPPD7-1653447928841)(../../_resources/36a4d48478a0ec544f5c8fa53f371b24-3.png)]](/img/66/4fb11d95886edf957c515c65311d2e.png)
2) Ways to reduce data skew
(1) First, check whether the data skew caused by too many null values
In the production environment , You can filter out null values directly ; If you want to retain control , Just customize the partition , Break up null values plus random numbers , Finally, the second polymerization .
(2) Can be in map Stage ahead of time , Best in map Stage processing . Such as :Combiner、MapJoin
(3) Set up multiple reduce Number
边栏推荐
- Creating postgre enterprise database by ArcGIS
- Kubernetes cluster environment construction & Deployment dashboard
- Kubesphere - set up redis cluster
- SQL实现将多行记录合并成一行
- Mysql database binlog log enable record
- Characteristics and isolation level of database
- JMeter performance automation test
- 23 design models
- 冒泡排序的简单理解
- Leetcode solution - 01 Two Sum
猜你喜欢

Oauth2.0 - Introduction and use and explanation of authorization code mode

2022 CISP-PTE(三)命令执行

Important knowledge points of redis

轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷

Kubesphere - set up redis cluster

Docker advanced learning (container data volume, MySQL installation, dockerfile)

SQL实现将多行记录合并成一行

IE browser flash back, automatically open edge browser

Cesium Click to obtain the longitude and latitude elevation coordinates (3D coordinates) of the model surface

YOLOV3学习笔记
随机推荐
conda和pip的区别
Es remote cluster configuration and cross cluster search
Project summary --01 (addition, deletion, modification and query of interfaces; use of multithreading)
CKA certification notes - CKA certification experience post
【无标题】5 自用历程
智牛股--03
Jedis source code analysis (I): jedis introduction, jedis module source code analysis
Mysql database binlog log enable record
Luogu problem list: [mathematics 1] basic mathematics problems
数值法求解最优控制问题(一)——梯度法
YOLOV2学习与总结
Solve the problem that Anaconda environment cannot be accessed in PowerShell
Kubesphere - set up redis cluster
【5G NR】UE注册流程
从 Amazon Aurora 迁移数据到 TiDB
When PHP uses env to obtain file parameters, it gets strings
tabbar的设置
Creating postgre enterprise database by ArcGIS
2022 CISP-PTE(三)命令执行
Selenium - 改变窗口大小,不同机型呈现的宽高长度会不一样