当前位置：网站首页>MapReduce programming basics

MapReduce programming basics

2022-07-01 09:23:00 【Half haochunshui】

（ One ） Basic for word frequency statistics MapReduce Programming .
① stay /user/hadoop/input Folder ( The folder is empty ), create a file wordfile1.txt and wordfile2.txt Upload to HDFS Medium input Under the folder .
file wordfile1.txt Is as follows ：
I love Spark
I love Hadoop
file wordfile2.txt Is as follows ：
Hadoop is good
Spark is fast
② start-up Eclipse, After startup, the interface shown in the figure below will pop up , Prompt to set up workspace （workspace）. You can directly use the default settings “/home/hadoop/workspace”, Click on “OK” Button . It can be seen that , Due to the current use of hadoop The user logged in Linux System , therefore , The default workspace directory is located in hadoop User directory “/home/hadoop” Next .

③Eclipse After the start , choice “File–>New–>Java Project” menu , Start creating a Java engineering .

④ stay “Project name” Enter the project name later “WordCount”, Choose “Use default location”, Let this Java All files of the project are saved to “/home/hadoop/workspace/WordCount” Under the table of contents . stay “JRE” In this tab , You can select the current Linux Installed in the system JDK, such as jdk1.8.0_162. then , Click... At the bottom of the interface “Next>” Button , Go to the next setting .

⑤ After entering the next step of setting , You need to load this... In this interface Java Required for the project JAR package , these JAR The package contains and Hadoop dependent Java API. these JAR The bags are all located in Linux Systematic Hadoop Installation directory , For this tutorial , Namely “/usr/local/hadoop/share/hadoop” Under the table of contents . Click on the “Libraries” tab , then , Click... On the right side of the interface “Add External JARs…” Button , The interface as shown in the figure below pops up .

⑥ In this interface , There is a row of directory buttons on it （ namely “usr”、“local”、“hadoop”、“share”、“hadoop”、“mapreduce” and “lib”）, When you click a directory button , The contents of the directory will be listed below .
In order to write a MapReduce Program , It is generally necessary to Java Add the following... To the project JAR package ：
a.“/usr/local/hadoop/share/hadoop/common” In the catalog hadoop-common-3.1.3.jar and haoop-nfs-3.1.3.jar;
b.“/usr/local/hadoop/share/hadoop/common/lib” All under directory JAR package ;
c.“/usr/local/hadoop/share/hadoop/mapreduce” All under directory JAR package , however , barring jdiff、lib、lib-examples and sources Catalog .

⑦ Write a Java Applications , namely WordCount.java. stay Eclipse On the left side of the work interface “Package Explorer” The palette （ As shown in the figure below ）, Find the project name just created “WordCount”, Then right click the project name , Select from the pop-up menu “New–>Class” menu .

⑧ choice “New–>Class” After the menu, the interface shown in the figure below will appear , In this interface, you only need to “Name” Then enter the new Java The name of the class file , The name is used here “WordCount”, Others can use the default settings , then , Click the bottom right corner of the interface “Finish” Button .

⑨ It can be seen that Eclipse Automatically created a named “WordCount.java” Source code file for , And contains code “public class WordCount{}”, Clear the code in the file , Then input the complete word frequency statistics program code in the file .
（ Two ） To configure eclipse Environmental Science , Run the program of word frequency statistics .
（1） Compile the packer
① Compile the code written above , Just click Eclipse The shortcut button for running the program on the upper part of the work interface , When you move the mouse over the button , Select from the pop-up menu “Run as”, Continue to select... From the pop-up menu “Java Application”, As shown in the figure below .
② then , The interface shown in the figure below will pop up , Click... In the lower right corner of the interface “OK” Button , Start running program .

③ After program running , It'll be at the bottom “Console” The operation result information is displayed in the panel （ As shown in the figure below ）.

④ Now we can put Java Application package generation JAR package , Deploy to Hadoop Run on the platform . Now you can put the word frequency statistics program in “/usr/local/hadoop/myapp” Under the table of contents . If the directory does not exist , You can use the following command to create .
cd /usr/local/hadoop
mkdir myapp
⑤ stay Eclipse On the left side of the work interface “Package Explorer” The palette , In the project name “WordCount” Right click , Select from the pop-up menu “Export”, As shown in the figure below .

⑥ Then the interface as shown in the figure below will pop up , Select... In this interface “Runnable JAR file”.

⑦ then , Click on “Next>” Button , The interface as shown in the figure below pops up . In this interface ,“Launch configuration” Used to set the generated JAR The package is deployed to the main class running at startup , You need to select the class just configured from the drop-down list “WordCount-WordCount”. stay “Export destination” Need to be set in JAR Which directory do you want to save the package output to , For example, this is set to “/usr/local/hadoop/myapp/WordCount.jar”. stay “Library handling” Next choice “Extract required libraries into generated JAR”.

⑧ And then click “Finish” Button , The interface shown in the figure below will appear .

⑨ You can ignore the information in this interface , Directly click... In the lower right corner of the interface “OK” Button , Start the packaging process . After the packaging process , A warning message screen will appear , As shown in the figure below .

⑩ You can ignore the information in this interface , Directly click... In the lower right corner of the interface “OK” Button . thus , Have successfully put WordCount The project package generates WordCount.jar. You can go to Linux Check the generated... In the system WordCount.jar file , Can be in Linux Execute the following command in the terminal , You can see ,“/usr/local/hadoop/myapp” A... Already exists in the directory WordCount.jar file .

（2） Run the program
① Before running the program , Need to start the Hadoop.

② Start up Hadoop after , You need to delete... First HDFS China and the present Linux user hadoop Corresponding input and output Catalog （ namely HDFS Medium “/user/hadoop/input” and “/user/hadoop/output” Catalog ）, This ensures that there will be no problems in the later program .

③ then , And then HDFS New and current Linux user hadoop Corresponding input Catalog , namely “/user/hadoop/input” Catalog .

④ Then put the previous in Linux Two new files in the local file system wordfile1.txt and wordfile2.txt（ Two files are located in “/usr/local/hadoop” Under the table of contents , And it contains some English sentences ）, Upload to HDFS Medium “/user/hadoop/input” Under the table of contents .

⑤ If HDFS Directory already exists in “/user/hadoop/output”, Then use the following command to delete the directory .

⑥ Now it can be in Linux The system uses hadoop jar Command run program . After the order is executed , When the operation ends smoothly , A message similar to the following will be displayed on the screen .

⑦ At this time, the word frequency statistics results have been written HDFS Of “/user/hadoop/output” Directory , Executing the following commands will display the following word frequency statistics results on the screen .

thus , The word frequency statistics program runs smoothly and ends . It should be noted that , If you want to run again WordCount.jar, You need to delete... First HDFS Medium output Catalog , Otherwise, an error will be reported .
（ 3、 ... and ） To write MapReduce Program , The program to calculate the average score .

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Score {
    
    public static class Map extends
            Mapper<LongWritable, Text, Text, IntWritable> {
    
        //  Realization map function 
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
    
            //  Convert the data of the input plain text file into String
            String line = value.toString();
            //  The input data is first divided into rows 
            StringTokenizer tokenizerArticle = new StringTokenizer(line, "\n");
            //  Process each row separately 
            while (tokenizerArticle.hasMoreElements()) {
    
                //  Divide each line by 
                StringTokenizer tokenizerLine = new StringTokenizer(tokenizerArticle.nextToken());
                String strName = tokenizerLine.nextToken();//  Student name section 
                String strScore = tokenizerLine.nextToken();//  Results section 
                Text name = new Text(strName);
                int scoreInt = Integer.parseInt(strScore);
                //  Output name and grade 
                context.write(name, new IntWritable(scoreInt));
            }
        }
    }

    public static class Reduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {
    
        //  Realization reduce function 
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
    
            int sum = 0;
            int count = 0;
            Iterator<IntWritable> iterator = values.iterator();
            while (iterator.hasNext()) {
    
                sum += iterator.next().get();//  Calculate the total score 
                count++;//  Count the total number of subjects 
            }
            int average = (int) sum / count;//  Calculate average 
            context.write(key, new IntWritable(average));
        }
    }

    public static void main(String[] args) throws Exception {
    
        Configuration conf = new Configuration();
        // "localhost:9000"  It needs to be set according to the actual situation 
        conf.set("mapred.job.tracker", "localhost:9000");
        //  One hdfs In the file system   Enter Directory   And   The output directory 
        String[] ioArgs = new String[] {
     "input/score", "output" };
        String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();
        if (otherArgs.length != 2) {
    
            System.err.println("Usage: Score Average <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "Score Average");
        job.setJarByClass(Score.class);
        //  Set up Map、Combine and Reduce Processing class 
        job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);
        //  Set output type 
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //  Divide the input data set into small data blocks splites, Provide a RecordReder The implementation of the 
        job.setInputFormatClass(TextInputFormat.class);
        //  Provide a RecordWriter The implementation of the , Responsible for data output 
        job.setOutputFormatClass(TextOutputFormat.class);
        //  Set input and output directories 
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

（ Four ）`MapReduce` What is the working principle of ？

adopt Client、JobTracker and TaskTracker From the perspective of MapReduce How it works .

Insert picture description here

First, the client client Be prepared mapreduce Program , Good configuration mapreduce My homework is job, The next step is to start job 了 , start-up job Is to inform JobTracker Job to run on , This is the time JobTracker Will return to the client a new job Mission ID value , And then it does the check operation , This check is to determine whether the output directory exists , If there is, then job It won't work properly ,JobTracker Will throw an error to the client , Next, check whether the input directory exists , If not, throw the same error , If there is JobTracker Will calculate input slice based on input （Input Split）, If the partition can't be calculated, an error will be thrown , It's all done JobTracker It will be configured Job We need more resources . Get jobID after , Copy the resource files needed to run the job to HDFS On , Include MapReduce It's packaged JAR file 、 Configuration file and calculated input fragment information . These documents are stored in jobTracker In a folder created specifically for this job , The folder name is... For this job Job ID.JAR By default, the file will have 10 Copies (mapred.submit.replication Attribute control ); Input partition information to tell JobTracker How many should be started for this job map Mission information . When the resource folder is created , The client will submit job inform jobTracker I have written the required resources into hdfs On , Next, please help me to really implement job.
After allocating resources ,JobTracker Receive submit job The job is initialized after the request , The main thing initialization does is to Job Put in an internal queue , Wait for the job scheduler to schedule it . When the job scheduler schedules the job according to its own scheduling algorithm , The job scheduler will create a running job object （ Encapsulating tasks and recording information ）, In order to JobTracker track job The state and process of . establish job Object, the job scheduler will get hdfs Input fragment information in folder , According to the fragment information for each input split Create a map Mission , And will map The task is assigned to tasktracker perform . about map and reduce Mission ,tasktracker According to the number of host cores and the size of memory, there is a fixed number of map Slot and reduce Slot . What needs to be emphasized here is ：map Tasks are not randomly assigned to someone tasktracker Of , Here we will talk about data localization later .
And then there's the assignment , This is the time tasktracker Will run a simple loop mechanism to send heartbeat to jobtracker, The heartbeat interval is 5 second , The programmer can configure this time , The heartbeat is jobtracker and tasktracker A bridge of communication , By heartbeat ,jobtracker Can monitor tasktracker Survival , Also available tasktracker The status and problems of the process , meanwhile tasktracker It can also be obtained by the return value in the heartbeat jobtracker The instructions given to it .tasktracker Will get the run job Resources needed , Such as code , Prepare for real implementation . After the task is assigned, it's time to perform the task . During the mission jobtracker It can be monitored by heartbeat mechanism tasktracker Status and progress of , You can also calculate the whole job Status and progress of , and tasktracker You can also monitor your status and progress locally .TaskTracker Every once in a while JobTracker Send a heartbeat , tell JobTracker It's still running , At the same time, the heartbeat also carries a lot of information , Such as the current map Information about the progress of the task . When jobtracker Got the last one to complete the assigned task tasktracker When the operation is successful ,jobtracker Will take the whole job Set status to success , And then when the client queries job When it's running （ Be careful ： This is an asynchronous operation ）, The client will find job Completed notification of . If job Failure in the middle ,mapreduce There will also be corresponding mechanisms to deal with , Generally speaking, if it's not for the programmer, the program itself has bug,mapreduce Error handling mechanism can guarantee the submitted job It can be done normally .
（ 5、 ... and ）Hadoop How it works MapReduce programmatic ？
① Combine the compiled software with hadoop Connected to a （ Such as Eclipse To link hadoop）, Run the program directly .
② take mapreduce The program is packaged into jar file .

原网站

版权声明
本文为[Half haochunshui]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010905166651.html

当前位置：网站首页>MapReduce programming basics

MapReduce programming basics

（ One ） Basic for word frequency statistics `MapReduce` Programming .

（ Two ） To configure `eclipse` Environmental Science , Run the program of word frequency statistics .

（ 3、 ... and ） To write MapReduce Program , The program to calculate the average score .

（ Four ）`MapReduce` What is the working principle of ？

（ 5、 ... and ）`Hadoop` How it works `MapReduce` programmatic ？

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>MapReduce programming basics

MapReduce programming basics

（ One ） Basic for word frequency statistics MapReduce Programming .

（ Two ） To configure eclipse Environmental Science , Run the program of word frequency statistics .

（ 3、 ... and ） To write MapReduce Program , The program to calculate the average score .

（ Four ）MapReduce What is the working principle of ？

（ 5、 ... and ）Hadoop How it works MapReduce programmatic ？

边栏推荐

猜你喜欢

随机推荐

（ One ） Basic for word frequency statistics `MapReduce` Programming .

（ Two ） To configure `eclipse` Environmental Science , Run the program of word frequency statistics .

（ Four ）`MapReduce` What is the working principle of ？

（ 5、 ... and ）`Hadoop` How it works `MapReduce` programmatic ？