倒排索引是文档检索系统中最常用到的数据结果,应用于搜索引擎,根据内容来查找文档的一种方式。进行相反的操作,因称为倒排索引。下列如图可以清楚的解释:
分析与设计
1 Map过程:Map过程首先分析输入的<key,value>对,得到索引中需要的信息:单词,文档URI 和词频。key:单词和URI.value:出现同样单词的次数。
2 Combine过程:经过map方法处理后,Combine过程将key值相同的value值累加,得到一个单词在文档中的词频。
3 Reduce过程:经过上述的俩个过程后,Reduce过程只需要将相同的key值的value值组合成倒排引索文件的格式即可,其余的事情直接交给MapReduce框架进行处理。下面我从百度得带一张图给大家展示:
实现代码:环境的搭配与建工程导包,我就不在这里介绍了,你可以参考下面网址:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | package org.apache.hadoop.daopai; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InvertedIndex { public static class InvertedIndexMap extends Mapper<Object,Text,Text,Text>{ private Text valueInfo = new Text(); private Text keyInfo = new Text(); private FileSplit split; public void map(Object key, Text value,Context context) throws IOException, InterruptedException { //获取<key value>对所属的FileSplit对象 split = (FileSplit) context.getInputSplit(); StringTokenizer stk = new StringTokenizer(value.toString()); while (stk.hasMoreElements()) { //key值由(单词:URI)组成 keyInfo.set(stk.nextToken()+":"+split.getPath().toString()); //词频 valueInfo.set("1"); context.write(keyInfo, valueInfo); } } } public static class InvertedIndexCombiner extends Reducer<Text,Text,Text,Text>{ Text info = new Text(); public void reduce(Text key, Iterable<Text> values,Context contex) throws IOException, InterruptedException { int sum = 0; for (Text value : values) { sum += Integer.parseInt(value.toString()); } int splitIndex = key.toString().indexOf(":"); //重新设置value值由(URI+:词频组成) info.set(key.toString().substring(splitIndex+1) +":"+ sum); //重新设置key值为单词 key.set(key.toString().substring(0,splitIndex)); contex.write(key, info); } } public static class InvertedIndexReduce extends Reducer<Text,Text,Text,Text>{ private Text result = new Text(); public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { //生成文档列表 String fileList = new String(); for (Text value : values) { fileList += value.toString()+";"; } result.set(fileList); context.write(key, result); } } public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf,"InvertedIndex"); job.setJarByClass(InvertedIndex.class); job.setMapperClass(InvertedIndexMap.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setCombinerClass(InvertedIndexCombiner.class); job.setReducerClass(InvertedIndexReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path("./in/invertedindex/")); FileOutputFormat.setOutputPath(job, new Path("./out/")); System.exit(job.waitForCompletion(true)?0:1); } } |
在导入到ecipse中,如图:根据自己的需要点击creat new directory 创建目录,再在目录里点击upload files to DFS添加本例文档。再点击Refresh。就会出现你想要的目录和文件。在运行之前,现在本地建立几个文档(里面写些单词 字母 都可以)
输入文档,我就是在桌面建立的文档,out文件以及下面文件运行后出现的文件及结果文档.
这是运行输出的结果:我把部分标记出来,绿色字体就是代表文档的里单词,红色字体就是代表这个单词在哪个文档出现的次数—词频。黑色字体就是代表我的路径。
MapReduce hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:4;hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;
and
hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;
apache hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:2;
bye hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;
conf hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:1;
hadoop hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:2;hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:4;
hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;
hbase hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;
hdfs hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:2;
hello hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;
import hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:4;
io.IOException; hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:1;
is hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;
java hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:2;
org hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:2;
powerful hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;
simple hdfs://masters:9000/user/hadoop/in/invertedindex/1.txt:1;hdfs://masters:9000/user/hadoop/in/invertedindex/2.txt:1;
util.StringTokenizer; hdfs://masters:9000/user/hadoop/in/invertedindex/4.txt:1;
在这里:我遇到的问题还是路径问题,我开始输入文档的路径设置在/user/root/test1/下面,结果运行出错,提示改进路径:hdfs://masters:9000/user/hadoop/in/invertedindex/下面。后来我安找提示要求改正。结果运行成功!大家看了有说明问题一起留言讨论学习