当前位置:网站首页>With the implementation of MapReduce job de emphasis, a variety of output folders
With the implementation of MapReduce job de emphasis, a variety of output folders
2022-07-06 18:25:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Summarize a problem encountered in previous work .
background : Operation and maintenance and scribe from apacheserver Pushed to the log record again and again , So here ETL Deal with ongoing heavy . There is a need for multiple folders according to the output type of the business . Easy to hang partition , Use back . There is no problem with these two requirements, and they are handled separately , One mapreduce It's over , It takes a little skill .
1、map input data , After a series of processing . When the output :
if(ttype.equals("other")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%400;
}else if(ttype.equals("client")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%260;
}else{
file = (result.toString().hashCode()& 0x7FFFFFFF)%60;
}
tp = new TextPair(ttype+"_"+file, result.toString());
context.write(tp, valuet);
valuet It's empty. , Nothing there? .
I have three types here .other,client,wap, Respectively represent the log source platform . Output by folder according to them . result It's the whole record .
file What you get is the final output file name ,hash. Bit operation , The purpose of taking modulus is to balance the output .
map The output structure of <key,value> =(ttype+”_”+file,result.toString()) The purpose of this is : Ensure that the same records get the same key, At the same time, save the type .partition To press textPair Of left, That's it key, It ensures that all records to be written to the same output file later will go to the same reduce In go to . One reduce Can write multiple output files . However, an output file cannot come from multiple reduce, The reason is very clear . Such words are probably 400+260+60=720 Output files , The amount of data in each file is almost the same ,job Of reduce Count what I set here 240, This number, together with modulus 400,260,60 It's all based on my data , To avoid reduce Data skew . 2、reduce Method de duplication :
public void reduce(TextPair key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
rcfileCols = getRcfileCols(key.getSecond().toString().split("\001"));
context.write(key.getFirst(), rcfileCols);
}
No iteration , Yes, the same key Group . Output only once . Note that there job Comparator used , It must not be FirstComparator, But the whole textpair Right comparison .( Compare first left. Compare again right) The output file format of my program is rcfile. 3、 Multi folder output :
job.setOutputFormatClass(WapApacheMutiOutputFormat.class);
public class WapApacheMutiOutputFormat extends RCFileMultipleOutputFormat<Text, BytesRefArrayWritable> {
Random r = new Random();
protected String generateFileNameForKeyValue(Text key, BytesRefArrayWritable value,
Configuration conf) {
String typedir = key.toString().split("_")[0];
return typedir+"/"+key.toString();
}
}
there RCFileMultipleOutputFormat I inherited it from FileOutputFormat His writing . Mainly achieved recordWriter.
Finally output the weight removed , Sub folder data file .
The key to understanding , Mainly partition key Design .reduce principle .
Copyright notice : This article is an original blog article , Blog , Without consent , Shall not be reproduced .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117394.html Link to the original text :https://javaforall.cn
边栏推荐
- UDP协议:因性善而简单,难免碰到“城会玩”
- Alibaba cloud international ECS cannot log in to the pagoda panel console
- Declval (example of return value of guidance function)
- Echart simple component packaging
- win10系统下插入U盘有声音提示却不显示盘符
- Rb157-asemi rectifier bridge RB157
- Reproduce ThinkPHP 2 X Arbitrary Code Execution Vulnerability
- 传输层 拥塞控制-慢开始和拥塞避免 快重传 快恢复
- 递归的方式
- [swoole series 2.1] run the swoole first
猜你喜欢
Grafana 9.0 正式发布!堪称最强!
阿里云国际版ECS云服务器无法登录宝塔面板控制台
2019阿里集群数据集使用总结
std::true_type和std::false_type
CSRF vulnerability analysis
推荐好用的后台管理脚手架,人人开源
CSRF漏洞分析
The third season of Baidu online AI competition is coming in midsummer, looking for you who love AI!
Compilation principle - top-down analysis and recursive descent analysis construction (notes)
虚拟机VirtualBox和Vagrant安装
随机推荐
具体说明 Flume介绍、安装和配置
Transport layer congestion control - slow start and congestion avoidance, fast retransmission, fast recovery
2022暑期项目实训(三)
转载:基于深度学习的工业品组件缺陷检测技术
Windows connects redis installed on Linux
DNS hijacking
最新财报发布+天猫618双榜第一,耐克蓄力领跑下个50年
Top command details
2022暑期项目实训(二)
TCP packet sticking problem
Jerry is the custom background specified by the currently used dial enable [chapter]
FMT open source self driving instrument | FMT middleware: a high real-time distributed log module Mlog
[sword finger offer] 60 Points of N dice
Jerry's watch reads the file through the file name [chapter]
30 minutes to understand PCA principal component analysis
Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning 论文阅读
推荐好用的后台管理脚手架,人人开源
解读云原生技术
This article discusses the memory layout of objects in the JVM, as well as the principle and application of memory alignment and compression pointer
使用block实现两个页面之间的传统价值观