当前位置:网站首页>Chapter 8. MapReduce production experience
Chapter 8. MapReduce production experience
2022-07-03 06:21:00 【Control the spiritual field】
8.1 MapReduce The reason for the slow running
MapReduce The bottleneck of program efficiency lies in two aspects :
1) Computer performance
CPU、 Memory 、 disk 、 The Internet
2)I/O Operation optimization
(1) Data skew
(2)Map Running too long , Lead to Reduce Waiting too long
(3) Too many small files
8.2 MapReduce Common tuning parameters
8.2.1 Map Stage tuning
1) Custom partition , Reduce data skew ;
Defining classes , Inherit Partitioner Interface , rewrite getPartition Method
2) Reduce the number of overflow
mapreduce.task.io.sort.mb
Shuffle The size of the ring buffer , Default 100m, Sure Up to 200m
mapreduce.map.sort.spill.percent
Threshold for ring buffer overflow , Default 80% , Can be improved 90%
3) Increase each time Merge The number of merges
mapreduce.task.io.sort.factor Default 10, Sure Up to 20( High memory requirements , If there's not enough memory , It also needs to be reduced )
4) It can be adopted in advance without affecting the business results Combiner
job.setCombinerClass(xxxReducer.class);
5) To reduce the number of disks IO, May adopt Snappy perhaps LZO Compress
conf.setBoolean(“mapreduce.map.output.compress”, true);
conf.setClass(“mapreduce.map.output.compress.codec”,
SnappyCodec.class,CompressionCodec.class);( Enterprises often use Snappy Compress )
6)mapreduce.map.memory.mb Default MapTask Maximum memory 1024MB.
According to 128m Data corresponds to 1G Memory principle increase this memory .
7)mapreduce.map.java.opts: control MapTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
8)mapreduce.map.cpu.vcores Default MapTask Of CPU Check the number 1. Computing intensive tasks can increase CPU Check the number
9) Exception retry
mapreduce.map.maxattempts Every Map Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4. Properly improve the performance of the machine .
8.2.2 Reduce Stage tuning
1)mapreduce.reduce.shuffle.parallelcopies Every Reduce Go to Map The parallel number of pull data in , The default value is 5. It can be raised to 10.
2)mapreduce.reduce.shuffle.input.buffer.percent
Buffer The size accounts for Reduce The ratio of available memory , The default value is 0.7. It can be raised to 0.8
3)mapreduce.reduce.shuffle.merge.percent Buffer What percentage of the data in starts to be written to disk , The default value is 0.66. It can be raised to 0.75
4)mapreduce.reduce.memory.mb Default ReduceTask Maximum memory 1024MB, according to 128m Data corresponds to 1G Memory principle , Appropriately increase the memory to 4-6G
5)mapreduce.reduce.java.opts: control ReduceTask Heap memory size .( If there's not enough memory , newspaper :java.lang.OutOfMemoryError)
6)mapreduce.reduce.cpu.vcores Default ReduceTask Of CPU Check the number 1 individual . It can be raised to 2-4 individual .
7)mapreduce.reduce.maxattempts Every Reduce Task max retries , Once the number of retries exceeds this value , Think Map Task Run failed , The default value is :4.
8)mapreduce.job.reduce.slowstart.completedmaps When MapTask Only when the proportion of completion reaches this value will it be ReduceTask Application resources . The default is 0.05.
9)mapreduce.task.timeout If one Task There was no entry... For a certain period of time , That is, new data will not be read , No output data , Think of it as Task be in Block state , Maybe it's stuck , Maybe it will stick forever , To prevent the user program from ever Block Don't quit , The timeout is forced to be set ( Unit millisecond ), The default is 600000(10 minute ). If your program takes too long to process each input data , It is suggested to increase the parameter .
10) If you don't have to Reduce, Try not to use
8.3 MapReduce Data skew
1) Data skew phenomenon
Data frequency tilt — The amount of data in one area is much larger than that in other areas
Data size skew — Some records are much larger than the average
Here's the picture ,99% The data of has been run ,Reducer3、Reducer4 Still running , This is a manifestation of data skew
2) Ways to reduce data skew
(1) First, check whether the data skew caused by too many null values
In the production environment , You can filter out null values directly ; If you want to retain control , Just customize the partition , Break up null values plus random numbers , Finally, the second polymerization .
(2) Can be in map Stage ahead of time , Best in map Stage processing . Such as :Combiner、MapJoin
(3) Set up multiple reduce Number
边栏推荐
- 有意思的鼠標指針交互探究
- Nacos service installation
- Important knowledge points of redis
- Oauth2.0 - using JWT to replace token and JWT content enhancement
- Cannot get value with @value, null
- 輕松上手Fluentd,結合 Rainbond 插件市場,日志收集更快捷
- Common interview questions
- Cesium entity(entities) 实体删除方法
- Mysql database
- About the difference between count (1), count (*), and count (column name)
猜你喜欢
scroll-view指定滚动元素的起始位置
Important knowledge points of redis
从小数据量分库分表 MySQL 合并迁移数据到 TiDB
The most responsible command line beautification tutorial
Phpstudy setting items can be accessed by other computers on the LAN
2022 CISP-PTE(三)命令执行
有意思的鼠标指针交互探究
Kubernetes notes (IV) kubernetes network
Advanced technology management - do you know the whole picture of growth?
. Net program configuration file operation (INI, CFG, config)
随机推荐
Naive Bayes in machine learning
Various usages of MySQL backup database to create table select and how many days are left
The mechanical hard disk is connected to the computer through USB and cannot be displayed
After the Chrome browser is updated, lodop printing cannot be called
When PHP uses env to obtain file parameters, it gets strings
Leetcode problem solving summary, constantly updating!
PHP用ENV获取文件参数的时候拿到的是字符串
Clickhouse learning notes (2): execution plan, table creation optimization, syntax optimization rules, query optimization, data consistency
Fluentd is easy to use. Combined with the rainbow plug-in market, log collection is faster
Cesium entity (entities) entity deletion method
. Net program configuration file operation (INI, CFG, config)
Reinstalling the system displays "setup is applying system settings" stationary
phpstudy设置项目可以由局域网的其他电脑可以访问
[set theory] relational closure (relational closure solution | relational graph closure | relational matrix closure | closure operation and relational properties | closure compound operation)
Svn branch management
What's the difference between using the Service Worker Cache API and regular browser cache?
【无标题】5 自用历程
【无标题】8 简易版通讯录
Mysql5.7 group by error
简易密码锁