当前位置:网站首页>MapReduce instance (III): data De duplication
MapReduce instance (III): data De duplication
2022-07-27 16:15:00 【Laugh at Fengyun Road】
MR Realization Data De duplication
Hello everyone , I am Fengyun , Welcome to my blog perhaps WeChat official account 【 Laugh at Fengyun Road 】, In the days to come, let's learn about big data related technologies , Work hard together , Meet a better self !
Realize the idea
The ultimate goal of data De duplication is to make the original data appear more than once in the output file only once . stay MapReduce In the process ,map Output <key,value> after shuffle Process aggregation <key,value-list> Later to reduce. We naturally think of giving all records of the same data to one station reduce machine , No matter how many times this data appears , Just output it once in the final result . The concrete is reduce The input should be data key, And yes value-list There is no demand for ( It can be set to null ). When reduce Received a <key,value-list> Directly input the key Copy to output key in , And will value Set to null , Then the output <key,value>.
MaprReduce The process of weight removal is shown in the figure below :
Write code
Mapper Code
public static class Map extends Mapper<Object , Text , Text , NullWritable>
//map Will input value Copy to output data key On , And output directly
{
private static Text newKey=new Text(); // The type of data for each row obtained from the input
public void map(Object key,Text value,Context context) throws IOException, InterruptedException
// Realization map function
{
// Get and output each processing process
String line=value.toString();
System.out.println(line);
String arr[]=line.split("\t");
newKey.set(arr[1]);
context.write(newKey, NullWritable.get());
System.out.println(newKey);
}
}
Mapper Phase adoption Hadoop The default job input method , Put the input value use split() Method interception , Intercepted goods id Field set to key, Set up value It's empty , And then directly output <key,value>.
Reducer Code
public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable>{
public void reduce(Text key,Iterable<NullWritable> values,Context context) throws IOException, InterruptedException
// Realization reduce function
{
context.write(key,NullWritable.get()); // Get and output each processing process
}
}
map Output <key,value> Key value pair process shuffle The process , Coalescence <key,value-list> after , I'll give it to you reduce function .reduce function , Regardless of each key How many value, It directly assigns the input value to the output key, Will output value Set to null , Then the output <key,value> That's all right. .
Complete code
package mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Filter{
public static class Map extends Mapper<Object , Text , Text , NullWritable>{
private static Text newKey=new Text();
public void map(Object key,Text value,Context context) throws IOException, InterruptedException{
String line=value.toString();
System.out.println(line);
String arr[]=line.split("\t");
newKey.set(arr[1]);
context.write(newKey, NullWritable.get());
System.out.println(newKey);
}
}
public static class Reduce extends Reducer<Text, NullWritable, Text, NullWritable>{
public void reduce(Text key,Iterable<NullWritable> values,Context context) throws IOException, InterruptedException{
context.write(key,NullWritable.get());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{
Configuration conf=new Configuration();
System.out.println("start");
Job job =new Job(conf,"filter");
job.setJarByClass(Filter.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path in=new Path("hdfs://localhost:9000/mymapreduce2/in/buyer_favorite1");
Path out=new Path("hdfs://localhost:9000/mymapreduce2/out");
FileInputFormat.addInputPath(job,in);
FileOutputFormat.setOutputPath(job,out);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
-------------- end ----------------
WeChat official account : Below scan QR code or Search for Laugh at Fengyun Road Focus on 
边栏推荐
- 减小程序rom ram,gcc -ffunction-sections -fdata-sections -Wl,–gc-sections 参数详解
- profileapi. h header
- Addition, deletion, query and modification of MySQL table data
- Short video mall system, system prompt box, confirmation box, click blank to close the pop-up box
- Wechat applet personal number opens traffic master
- [sword finger offer] interview question 41: median in data flow - large and small heap implementation
- Common problems of mobile terminal H5
- Pycharm导入已有的本地安装包
- MySQL索引
- [sword finger offer] interview question 53-i: find the number 1 in the sorted array -- three templates for binary search
猜你喜欢

openwrt 增加RTC(MCP7940 I2C总线)驱动详解

MySQL索引

文本截取图片(哪吒之魔童降世壁纸)

The risk of multithreading -- thread safety
![[sword finger offer] interview question 42: the maximum sum of continuous subarrays -- with 0x80000000 and int_ MIN](/img/01/bbf81cccb47b6351d7265ee4a77c55.png)
[sword finger offer] interview question 42: the maximum sum of continuous subarrays -- with 0x80000000 and int_ MIN

Web test learning notes 01

Determine the exact type of data

DRF学习笔记(五):视图集ViewSet

Openwrt adds support for SD card

Servlet基础知识点
随机推荐
Pychart imports the existing local installation package
[sword finger offer] interview question 56-i: the number of numbers in the array I
Addition, deletion, query and modification of MySQL table data
openwrt 编译驱动模块(在openwrt源代码外部任意位置编写代码,独立模块化编译.ko)
It can carry 100 people! Musk releases the strongest "starship" in history! Go to Mars as early as next year!
Coding technique - Global log switch
突发!海康/大华/商汤/旷视/依图/科大讯飞等28家中国实体被美列入黑名单
web测试学习笔记01
携手SiFive,格兰仕进军半导体领域!两款自研芯片曝光
[sword finger offer] interview question 52: the first common node of two linked lists - stack, hash table, double pointer
Leetcode 226 翻转二叉树(递归)
Baidu picture copy picture address
Personal perception of project optimization
Solve the compilation warning of multiple folders with duplicate names under the openwrt package directory (call subdir function)
Nacos
借5G东风,联发科欲再战高端市场?
时间序列——使用tsfresh进行分类任务
Common problems of mobile terminal H5
Content ambiguity occurs when using transform:translate()
These questions~~
