当前位置:网站首页>Detailed comments on MapReduce instance code on the official website
Detailed comments on MapReduce instance code on the official website
2022-07-03 15:05:00 【Brother Xing plays with the clouds】
introduction
1. This article does not describe MapReduce Introduction , There are many such knowledge online , Please check by yourself
2. The example code of this article comes from the official website
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
final WordCount v2.0, This code is compared with org.apache.Hadoop.examples.WordCount Be complex and complete , More suitable for MapReduce Template code
3. The purpose of this article is to develop MapReduce My classmates provided a template with detailed comments , You can develop based on this template .
--------------------------------------------------------------------------------
Official website instance code ( A slight change )
WordCount2.java
import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.net.URI; import java.util.ArrayList; import java.util.HashSet; import java.util.List; import java.util.Set; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Counter; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.StringUtils; public class WordCount2 { // Log group name MapCounters, The log of INPUT_WORDS static enum MapCounters { INPUT_WORDS } static enum ReduceCounters { OUTPUT_WORDS } // static enum CountersEnum { INPUT_WORDS,OUTPUT_WORDS } // Log group name CountersEnum, The log of INPUT_WORDS and OUTPUT_WORDS public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); // map Output value private Text word = new Text(); // map Output key private boolean caseSensitive; // Case sensitive or not , Read the assignment from the configuration file private Set<String> patternsToSkip = new HashSet<String>(); // Used to save keywords to be filtered , Read the assignment from the configuration file private Configuration conf; private BufferedReader fis; // Save file input stream /** * Whole setup Just two things : 1. Read... In the configuration file wordcount.case.sensitive, Assign a value to caseSensitive Variable * 2. Read... In the configuration file wordcount.skip.patterns, If true, take CacheFiles All files are added to the filtering range */ @Override public void setup(Context context) throws IOException, InterruptedException { conf = context.getConfiguration(); // getBoolean(String name, boolean defaultValue) // obtain name Specify the value of the property , If the attribute is not specified , Or the specified value is invalid , Just use defaultValue return . // Attributes can be passed on the command line -Dpropretyname Appoint , for example -Dwordcount.case.sensitive=true // Attributes can also be found in main Function through job.getConfiguration().setBoolean("wordcount.case.sensitive", // true) Appoint caseSensitive = conf.getBoolean("wordcount.case.sensitive", true); // In the configuration file wordcount.case.sensitive Whether the function is turned on // wordcount.skip.patterns The value of the attribute depends on whether the command line parameter has -skip, Specific logic in main In the method if (conf.getBoolean("wordcount.skip.patterns", false)) { // In the configuration file wordcount.skip.patterns Whether the function is turned on URI[] patternsURIs = Job.getInstance(conf).getCacheFiles(); // getCacheFiles() Method can fetch the cached localization file , In this case, main Set up for (URI patternsURI : patternsURIs) { // every last patternsURI All represent a file Path patternsPath = new Path(patternsURI.getPath()); String patternsFileName = patternsPath.getName().toString(); parseSkipFile(patternsFileName); // Add the file to the filtering range , See... For specific logic parseSkipFile(String // fileName) } } } /** * Add the contents of the specified file into the filtering range * * @param fileName */ private void parseSkipFile(String fileName) { try { fis = new BufferedReader(new FileReader(fileName)); String pattern = null; while ((pattern = fis.readLine()) != null) { // SkipFile Every line of is one that needs to be filtered pattern, for example \! patternsToSkip.add(pattern); } } catch (IOException ioe) { System.err .println("Caught exception while parsing the cached file '" + StringUtils.stringifyException(ioe)); } } @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // there caseSensitive stay setup() Method String line = (caseSensitive) ? value.toString() : value.toString() .toLowerCase(); // If case sensitivity is set , Just leave it as it is , Otherwise, all are converted to lowercase for (String pattern : patternsToSkip) { // All the data will meet patternsToSkip Of pattern All filtered out line = line.replaceAll(pattern, ""); } StringTokenizer itr = new StringTokenizer(line); // take line With \t\n\r\f Delimit the separator while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); // getCounter(String groupName, String counterName) Counter // The name of the enumeration type is the name of the Group , The field of enumeration type is the counter name Counter counter = context.getCounter( MapCounters.class.getName(), MapCounters.INPUT_WORDS.toString()); counter.increment(1); } } } /** * Reducer There are no special upgrade features * * @author Administrator */ public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); Counter counter = context.getCounter( ReduceCounters.class.getName(), ReduceCounters.OUTPUT_WORDS.toString()); counter.increment(1); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); GenericOptionsParser optionParser = new GenericOptionsParser(conf, args); /** * The command line syntax is :hadoop command [genericOptions] [application-specific * arguments] getRemainingArgs() All I got was [application-specific arguments] * such as :$ bin/hadoop jar wc.jar WordCount2 -Dwordcount.case.sensitive=true * /user/joe/wordcount/input /user/joe/wordcount/output -skip * /user/joe/wordcount/patterns.txt * getRemainingArgs() What you get is /user/joe/wordcount/input * /user/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt */ String[] remainingArgs = optionParser.getRemainingArgs(); // remainingArgs.length == 2 when , Including input and output paths : ///user/joe/wordcount/input /user/joe/wordcount/output // remainingArgs.length == 4 when , Include input and output paths and skip files : ///user/joe/wordcount/input /user/joe/wordcount/output -skip /user/joe/wordcount/patterns.txt if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) { System.err .println("Usage: wordcount <in> <out> [-skip skipPatternFile]"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount2.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); List<String> otherArgs = new ArrayList<String>(); // except -skip Other parameters except for (int i = 0; i < remainingArgs.length; ++i) { if ("-skip".equals(remainingArgs[i])) { job.addCacheFile(new Path(remainingArgs[++i]).toUri()); // take // -skip // Later parameters , namely skip Schema file url, Add to the localization cache job.getConfiguration().setBoolean("wordcount.skip.patterns", true); // Set up here wordcount.skip.patterns attribute , stay mapper Use in } else { otherArgs.add(remainingArgs[i]); // Will be in addition to -skip // Add other parameters except otherArgs in } } FileInputFormat.addInputPath(job, new Path(otherArgs.get(0))); // otherArgs The first parameter of is the input path FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1))); // otherArgs The second parameter of is the output path System.exit(job.waitForCompletion(true) ? 0 : 1); } }
边栏推荐
- mmdetection 学习率与batch_size关系
- 什么是Label encoding?one-hot encoding ,label encoding两种编码该如何区分和使用?
- 创业团队如何落地敏捷测试,提升质量效能?丨声网开发者创业讲堂 Vol.03
- 什么是embedding(把物体编码为一个低维稠密向量),pytorch中nn.Embedding原理及使用
- Yolov5进阶之九 目标追踪实例1
- 【可能是全中文网最全】pushgateway入门笔记
- PS tips - draw green earth with a brush
- Container of symfony
- 远程服务器后台挂起 nohup
- NOI OPENJUDGE 1.3(06)
猜你喜欢
[graphics] hair simulation in tressfx
[engine development] in depth GPU and rendering optimization (basic)
[transform] [practice] use pytoch's torch nn. Multiheadattention to realize self attention
[ue4] material and shader permutation
【Transformer】入门篇-哈佛Harvard NLP的原作者在2018年初以逐行实现的形式呈现了论文The Annotated Transformer
远程服务器后台挂起 nohup
ASTC texture compression (adaptive scalable texture compression)
Didi off the shelf! Data security is national security
Besides lying flat, what else can a 27 year old do in life?
How can entrepreneurial teams implement agile testing to improve quality and efficiency? Voice network developer entrepreneurship lecture Vol.03
随机推荐
Write a 2-minute countdown.
Global and Chinese market of solder bars 2022-2028: Research Report on technology, participants, trends, market size and share
Simulation of LS -al command in C language
socket.io搭建分布式Web推送服务器
Global and Chinese market of marketing automation 2022-2028: Research Report on technology, participants, trends, market size and share
Tensor ellipsis (three points) slice
Mmdetection learning rate and batch_ Size relationship
Leetcode sword offer find the number I (nine) in the sorted array
Global and Chinese markets for infrared solutions (for industrial, civil, national defense and security applications) 2022-2028: Research Report on technology, participants, trends, market size and sh
[graphics] real shading in Unreal Engine 4
B2020 points candy
Centos7 deployment sentry redis (with architecture diagram, clear and easy to understand)
Vs+qt multithreading implementation -- run and movetothread
QT program font becomes larger on computers with different resolutions, overflowing controls
Global and Chinese markets for ionization equipment 2022-2028: Research Report on technology, participants, trends, market size and share
[transform] [NLP] first proposed transformer. The 2017 paper "attention is all you need" by Google brain team
Global and Chinese market of lighting control components 2022-2028: Research Report on technology, participants, trends, market size and share
使用Tengine解决负载均衡的Session问题
NOI OPENJUDGE 1.6(09)
PHP GD image upload bypass