当前位置:网站首页>With the implementation of MapReduce job de emphasis, a variety of output folders
With the implementation of MapReduce job de emphasis, a variety of output folders
2022-07-06 18:25:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack .
Summarize a problem encountered in previous work .
background : Operation and maintenance and scribe from apacheserver Pushed to the log record again and again , So here ETL Deal with ongoing heavy . There is a need for multiple folders according to the output type of the business . Easy to hang partition , Use back . There is no problem with these two requirements, and they are handled separately , One mapreduce It's over , It takes a little skill .
1、map input data , After a series of processing . When the output :
if(ttype.equals("other")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%400;
}else if(ttype.equals("client")){
file = (result.toString().hashCode() & 0x7FFFFFFF)%260;
}else{
file = (result.toString().hashCode()& 0x7FFFFFFF)%60;
}
tp = new TextPair(ttype+"_"+file, result.toString());
context.write(tp, valuet);
valuet It's empty. , Nothing there? .
I have three types here .other,client,wap, Respectively represent the log source platform . Output by folder according to them . result It's the whole record .
file What you get is the final output file name ,hash. Bit operation , The purpose of taking modulus is to balance the output .
map The output structure of <key,value> =(ttype+”_”+file,result.toString()) The purpose of this is : Ensure that the same records get the same key, At the same time, save the type .partition To press textPair Of left, That's it key, It ensures that all records to be written to the same output file later will go to the same reduce In go to . One reduce Can write multiple output files . However, an output file cannot come from multiple reduce, The reason is very clear . Such words are probably 400+260+60=720 Output files , The amount of data in each file is almost the same ,job Of reduce Count what I set here 240, This number, together with modulus 400,260,60 It's all based on my data , To avoid reduce Data skew . 2、reduce Method de duplication :
public void reduce(TextPair key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
rcfileCols = getRcfileCols(key.getSecond().toString().split("\001"));
context.write(key.getFirst(), rcfileCols);
}
No iteration , Yes, the same key Group . Output only once . Note that there job Comparator used , It must not be FirstComparator, But the whole textpair Right comparison .( Compare first left. Compare again right) The output file format of my program is rcfile. 3、 Multi folder output :
job.setOutputFormatClass(WapApacheMutiOutputFormat.class);
public class WapApacheMutiOutputFormat extends RCFileMultipleOutputFormat<Text, BytesRefArrayWritable> {
Random r = new Random();
protected String generateFileNameForKeyValue(Text key, BytesRefArrayWritable value,
Configuration conf) {
String typedir = key.toString().split("_")[0];
return typedir+"/"+key.toString();
}
}
there RCFileMultipleOutputFormat I inherited it from FileOutputFormat His writing . Mainly achieved recordWriter.
Finally output the weight removed , Sub folder data file .
The key to understanding , Mainly partition key Design .reduce principle .
Copyright notice : This article is an original blog article , Blog , Without consent , Shall not be reproduced .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/117394.html Link to the original text :https://javaforall.cn
边栏推荐
- High precision operation
- TCP packet sticking problem
- Rb157-asemi rectifier bridge RB157
- This article discusses the memory layout of objects in the JVM, as well as the principle and application of memory alignment and compression pointer
- Jerry's watch reading setting status [chapter]
- 华为0基金会——图片整理
- Jielizhi obtains the currently used dial information [chapter]
- Docker安装Redis
- 44所高校入选!分布式智能计算项目名单公示
- F200 - UAV equipped with domestic open source flight control system based on Model Design
猜你喜欢
Excellent open source fonts for programmers
IP, subnet mask, gateway, default gateway
std::true_type和std::false_type
[Android] kotlin code writing standardization document
Windows连接Linux上安装的Redis
declval(指导函数返回值范例)
MySQL查询请求的执行过程——底层原理
Easy to use PDF to SVG program
Distiller les connaissances du modèle interactif! L'Université de technologie de Chine & meituan propose Virt, qui a à la fois l'efficacité du modèle à deux tours et la performance du modèle interacti
287. 寻找重复数
随机推荐
關於這次通信故障,我想多說幾句…
2019阿里集群数据集使用总结
287. Find duplicates
Take you through ancient Rome, the meta universe bus is coming # Invisible Cities
2022 Summer Project Training (III)
用友OA漏洞学习——NCFindWeb 目录遍历漏洞
Reproduce ThinkPHP 2 X Arbitrary Code Execution Vulnerability
Windows connects redis installed on Linux
Redis的五种数据结构
[.Net core] solution to error reporting due to too long request length
30 分钟看懂 PCA 主成分分析
Four processes of program operation
node の SQLite
递归的方式
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
华为0基金会——图片整理
Distill knowledge from the interaction model! China University of science and Technology & meituan proposed virt, which combines the efficiency of the two tower model and the performance of the intera
Declval (example of return value of guidance function)
Markdown syntax for document editing (typera)
The difference between parallelism and concurrency