当前位置:网站首页>With the implementation of MapReduce job de emphasis, a variety of output folders
With the implementation of MapReduce job de emphasis, a variety of output folders
2022-07-06 18:25:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Summarize a problem encountered in previous work .
background : Operation and maintenance and scribe from apacheserver Pushed to the log record again and again , So here ETL Deal with ongoing heavy . There is a need for multiple folders according to the output type of the business . Easy to hang partition , Use back . There is no problem with these two requirements, and they are handled separately , One mapreduce It's over , It takes a little skill .
1、map input data , After a series of processing . When the output :
if(ttype.equals("other")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%400;
}else if(ttype.equals("client")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%260;
}else{
file = (result.toString().hashCode()& 0x7FFFFFFF)%60;
}
tp = new TextPair(ttype+"_"+file, result.toString());
context.write(tp, valuet);
valuet It's empty. , Nothing there? .
I have three types here .other,client,wap, Respectively represent the log source platform . Output by folder according to them . result It's the whole record .
file What you get is the final output file name ,hash. Bit operation , The purpose of taking modulus is to balance the output .
map The output structure of <key,value> =(ttype+”_”+file,result.toString()) The purpose of this is : Ensure that the same records get the same key, At the same time, save the type .partition To press textPair Of left, That's it key, It ensures that all records to be written to the same output file later will go to the same reduce In go to . One reduce Can write multiple output files . However, an output file cannot come from multiple reduce, The reason is very clear . Such words are probably 400+260+60=720 Output files , The amount of data in each file is almost the same ,job Of reduce Count what I set here 240, This number, together with modulus 400,260,60 It's all based on my data , To avoid reduce Data skew . 2、reduce Method de duplication :
public void reduce(TextPair key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
rcfileCols = getRcfileCols(key.getSecond().toString().split("\001"));
context.write(key.getFirst(), rcfileCols);
}
No iteration , Yes, the same key Group . Output only once . Note that there job Comparator used , It must not be FirstComparator, But the whole textpair Right comparison .( Compare first left. Compare again right) The output file format of my program is rcfile. 3、 Multi folder output :
job.setOutputFormatClass(WapApacheMutiOutputFormat.class);
public class WapApacheMutiOutputFormat extends RCFileMultipleOutputFormat<Text, BytesRefArrayWritable> {
Random r = new Random();
protected String generateFileNameForKeyValue(Text key, BytesRefArrayWritable value,
Configuration conf) {
String typedir = key.toString().split("_")[0];
return typedir+"/"+key.toString();
}
}
there RCFileMultipleOutputFormat I inherited it from FileOutputFormat His writing . Mainly achieved recordWriter.
Finally output the weight removed , Sub folder data file .
The key to understanding , Mainly partition key Design .reduce principle .
Copyright notice : This article is an original blog article , Blog , Without consent , Shall not be reproduced .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117394.html Link to the original text :https://javaforall.cn
边栏推荐
- Transfer data to event object in wechat applet
- 使用block实现两个页面之间的传统价值观
- 從交互模型中蒸餾知識!中科大&美團提出VIRT,兼具雙塔模型的效率和交互模型的性能,在文本匹配上實現性能和效率的平衡!...
- Grafana 9.0 is officially released! It's the strongest!
- 图片缩放中心
- Jerry's watch reads the file through the file name [chapter]
- SAP Fiori 应用索引大全工具和 SAP Fiori Tools 的使用介绍
- 首先看K一个难看的数字
- [sword finger offer] 60 Points of N dice
- 转载:基于深度学习的工业品组件缺陷检测技术
猜你喜欢
J'aimerais dire quelques mots de plus sur ce problème de communication...
Splay
Jerry's access to additional information on the dial [article]
CSRF漏洞分析
Virtual machine VirtualBox and vagrant installation
【中山大学】考研初试复试资料分享
面向程序员的精品开源字体
Distiller les connaissances du modèle interactif! L'Université de technologie de Chine & meituan propose Virt, qui a à la fois l'efficacité du modèle à deux tours et la performance du modèle interacti
微信为什么使用 SQLite 保存聊天记录?
小程序在产业互联网中的作用
随机推荐
Echart simple component packaging
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
Alibaba cloud international ECS cannot log in to the pagoda panel console
Windows连接Linux上安装的Redis
2022 Summer Project Training (II)
Release of the sample chapter of "uncover the secrets of asp.net core 6 framework" [200 pages /5 chapters]
测试123
287. 寻找重复数
Introduction to the usage of model view delegate principal-agent mechanism in QT
MSF horizontal MSF port forwarding + routing table +socks5+proxychains
Jerry's updated equipment resource document [chapter]
【剑指 Offer】 60. n个骰子的点数
Distill knowledge from the interaction model! China University of science and Technology & meituan proposed virt, which combines the efficiency of the two tower model and the performance of the intera
AFNetworking框架_上传文件或图像server
STM32按键状态机2——状态简化与增加长按功能
The latest financial report release + tmall 618 double top, Nike energy leads the next 50 years
Reprint: defect detection technology of industrial components based on deep learning
MS-TCT:Inria&SBU提出用于动作检测的多尺度时间Transformer,效果SOTA!已开源!(CVPR2022)...
Markdown syntax for document editing (typera)
Brief description of SQL optimization problems