当前位置:网站首页>MapReduce instance (VI): inverted index
MapReduce instance (VI): inverted index
2022-07-06 09:33:00 【Laugh at Fengyun Road】
MR Realization Inverted index
Hello everyone , I am Fengyun , Welcome to my blog perhaps WeChat official account 【 Laugh at Fengyun Road 】, In the days to come, let's learn about big data related technologies , Work hard together , Meet a better self !
The principle of inverted index
- " Inverted index " It is the most commonly used data structure in document retrieval system , It is widely used in full text search engine .
- It's mainly used to store a word ( Or phrases ) Mapping of storage locations in a document or group of documents , That is, it provides a basis " Content to find documents " The way . Because it is not based on " Document to determine what the document contains " The content of , Instead, do the opposite , So it's called inverted index (Inverted Index)
- Realization " Inverted index " The main information of concern is : word 、 file URL And word frequency
Inverted index is mainly used to store a word ( Or phrases ) Mapping of storage locations in a document or group of documents , That is, it provides a basis " Content to find documents " The way .
Realize the idea
according to MapReduce The design idea of inverted index is given :
(1)Map The process
First, use the default TextInputFormat Class to process the input file , Get the offset of each line in the text and its content . obviously ,Map The process must first analyze the input <key,value> Yes , Get the three information needed in the inverted index : word 、 file URL And word frequency , Then we use the read data Map Operation for pretreatment , As shown in the figure below :
There are two problems :
First of all ,<key,value> Yes, there can only be two values , Without using Hadoop In the case of custom data types , Two of these values need to be combined into one value according to the situation , As key or value value .
second , Through one Reduce The process cannot complete word frequency statistics and generate document list at the same time , So we must add one Combine Process complete word frequency statistics .
Here's the product ID and URL form key value ( Such as "1024600:goods3"), Will word frequency ( goods ID Number of occurrences ) As value, The advantage of this is that you can take advantage of MapReduce The frame comes with Map End sort , Make a list of the word frequencies of the same words in the same document , Pass to Combine The process , The implementation is similar to WordCount The function of .
(2)Combine The process
after map Method after treatment ,Combine The process will key Same value value Value accumulation , Get the word frequency of a word in the document , As shown in the figure below . If you directly use the output shown in the following figure as Reduce Input to the process , stay Shuffle The process will face a problem : All records with the same word ( By word 、URL And word frequency ) It should be handed over to the same Reducer Handle , But the current key Value does not guarantee this , So we have to modify key Values and value value . This time, put the word ( goods ID) As key value ,URL And word frequency value value ( Such as "goods3:1"). The advantage of this is that you can take advantage of MapReduce Frame default HashPartitioner Class completion Shuffle The process , Send all records of the same word to the same Reducer To deal with .
(3)Reduce The process
After the above two processes ,Reduce The process only needs to be the same key All that's worth value The values are combined into the format required by the inverted index file , The rest can be handed over directly to MapReduce The framework deals with . As shown in the figure below
Code writing
Map Code
First, use the default TextInputFormat Class to process the input file , Get the offset of each line in the text and its content . obviously ,Map The process must first analyze the input <key,value> Yes , Get the three information needed in the inverted index : word 、 file URL And word frequency , There are two problems : First of all ,<key,value> Yes, there can only be two values , Without using Hadoop In the case of custom data types , Two of these values need to be combined into one value according to the situation , As key or value value . second , Through one Reduce The process cannot complete word frequency statistics and generate document list at the same time , So we must add one Combine Process complete word frequency statistics .
public static class doMapper extends Mapper<Object, Text, Text, Text>{
public static Text myKey = new Text(); // Store words and URL Combine
public static Text myValue = new Text(); // Stored word frequency
//private FileSplit filePath; // Storage Split object
@Override // Realization map function
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String filePath=((FileSplit)context.getInputSplit()).getPath().toString();
if(filePath.contains("goods")){
String val[]=value.toString().split("\t");
int splitIndex =filePath.indexOf("goods");
myKey.set(val[0] + ":" + filePath.substring(splitIndex));
}else if(filePath.contains("order")){
String val[]=value.toString().split("\t");
int splitIndex =filePath.indexOf("order");
myKey.set(val[2] + ":" + filePath.substring(splitIndex));
}
myValue.set("1");
context.write(myKey, myValue);
}
}
Combiner Code
after map Method after treatment ,Combine The process will key Same value value Value accumulation , Get the word frequency of a word in the document . If the output is directly used as Reduce Input to the process , stay Shuffle The process will face a problem : All records with the same word ( By word 、URL And word frequency ) It should be handed over to the same Reducer Handle , But the current key Value does not guarantee this , So we have to modify key Values and value value . This time use the word as key value ,URL And word frequency value value . The advantage of this is that you can take advantage of MapReduce Frame default HashPartitioner Class completion Shuffle The process , Send all records of the same word to the same Reducer To deal with .
public static class doCombiner extends Reducer<Text, Text, Text, Text>{
public static Text myK = new Text();
public static Text myV = new Text();
@Override // Realization reduce function
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Count the frequency of words
int sum = 0 ;
for (Text value : values) {
sum += Integer.parseInt(value.toString());
}
int mysplit = key.toString().indexOf(":");
// To reset value Values are determined by URL And word frequency
myK.set(key.toString().substring(0, mysplit));
myV.set(key.toString().substring(mysplit + 1) + ":" + sum);
context.write(myK, myV);
}
}
Reduce Code
After the above two processes ,Reduce The process only needs to be the same key It's worth it value The values are combined into the format required by the inverted index file , The rest can be handed over directly to MapReduce The framework deals with .
public static class doReducer extends Reducer<Text, Text, Text, Text>{
public static Text myK = new Text();
public static Text myV = new Text();
@Override // Realization reduce function
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Generate document list
String myList = new String();
for (Text value : values) {
myList += value.toString() + ";";
}
myK.set(key);
myV.set(myList);
context.write(myK, myV);
}
}
Complete code
package mapreduce;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyIndex {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance();
job.setJobName("InversedIndexTest");
job.setJarByClass(MyIndex.class);
job.setMapperClass(doMapper.class);
job.setCombinerClass(doCombiner.class);
job.setReducerClass(doReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Path in1 = new Path("hdfs://localhost:9000/mymapreduce9/in/goods3");
Path in2 = new Path("hdfs://localhost:9000/mymapreduce9/in/goods_visit3");
Path in3 = new Path("hdfs://localhost:9000/mymapreduce9/in/order_items3");
Path out = new Path("hdfs://localhost:9000/mymapreduce9/out");
FileInputFormat.addInputPath(job, in1);
FileInputFormat.addInputPath(job, in2);
FileInputFormat.addInputPath(job, in3);
FileOutputFormat.setOutputPath(job, out);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class doMapper extends Mapper<Object, Text, Text, Text>{
public static Text myKey = new Text();
public static Text myValue = new Text();
//private FileSplit filePath;
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String filePath=((FileSplit)context.getInputSplit()).getPath().toString();
if(filePath.contains("goods")){
String val[]=value.toString().split("\t");
int splitIndex =filePath.indexOf("goods");
myKey.set(val[0] + ":" + filePath.substring(splitIndex));
}else if(filePath.contains("order")){
String val[]=value.toString().split("\t");
int splitIndex =filePath.indexOf("order");
myKey.set(val[2] + ":" + filePath.substring(splitIndex));
}
myValue.set("1");
context.write(myKey, myValue);
}
}
public static class doCombiner extends Reducer<Text, Text, Text, Text>{
public static Text myK = new Text();
public static Text myV = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int sum = 0 ;
for (Text value : values) {
sum += Integer.parseInt(value.toString());
}
int mysplit = key.toString().indexOf(":");
myK.set(key.toString().substring(0, mysplit));
myV.set(key.toString().substring(mysplit + 1) + ":" + sum);
context.write(myK, myV);
}
}
public static class doReducer extends Reducer<Text, Text, Text, Text>{
public static Text myK = new Text();
public static Text myV = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String myList = new String();
for (Text value : values) {
myList += value.toString() + ";";
}
myK.set(key);
myV.set(myList);
context.write(myK, myV);
}
}
}
-------------- end ----------------
WeChat official account : Below scan QR code
or Search for Laugh at Fengyun Road
Focus on
边栏推荐
- Design and implementation of film and television creation forum based on b/s (attached: source code paper SQL file project deployment tutorial)
- In order to get an offer, "I believe that hard work will make great achievements
- [oc]- < getting started with UI> -- learning common controls
- 英雄联盟轮播图自动轮播
- Pytest parameterization some tips you don't know / pytest you don't know
- QML control type: menu
- Advanced Computer Network Review(4)——Congestion Control of MPTCP
- leetcode-14. Longest common prefix JS longitudinal scanning method
- Global and Chinese market of AVR series microcontrollers 2022-2028: Research Report on technology, participants, trends, market size and share
- Advance Computer Network Review(1)——FatTree
猜你喜欢
Redis cluster
Chapter 1 :Application of Artificial intelligence in Drug Design:Opportunity and Challenges
Redis connection redis service command
Kratos ares microservice framework (I)
软件负载均衡和硬件负载均衡的选择
IDS cache preheating, avalanche, penetration
O & M, let go of monitoring - let go of yourself
基于B/S的网上零食销售系统的设计与实现(附:源码 论文 Sql文件)
Sentinel mode of redis
Redis之哨兵模式
随机推荐
Redis分布式锁实现Redisson 15问
Reids之缓存预热、雪崩、穿透
Redis cluster
Parameterization of postman
Go redis initialization connection
【深度学习】语义分割:论文阅读:(CVPR 2022) MPViT(CNN+Transformer):用于密集预测的多路径视觉Transformer
基于WEB的网上购物系统的设计与实现(附:源码 论文 sql文件)
Heap (priority queue) topic
[shell script] use menu commands to build scripts for creating folders in the cluster
美团二面:为什么 Redis 会有哨兵?
Global and Chinese market of electric pruners 2022-2028: Research Report on technology, participants, trends, market size and share
Basic usage of xargs command
Persistence practice of redis (Linux version)
Leetcode problem solving 2.1.1
Minio distributed file storage cluster for full stack development
QDialog
基于B/S的影视创作论坛的设计与实现(附:源码 论文 sql文件 项目部署教程)
Blue Bridge Cup_ Single chip microcomputer_ PWM output
AcWing 2456. Notepad
为什么要数据分层