【发布时间】:2014-11-07 12:32:42
【问题描述】:
我想用hadoop来实现一个简单的搜索引擎。
所以我使用 hadoop 流 api 和 bash 创建了一个倒排索引。它输出一个像这样的文件:
ab (744 1) 1
abbrevi (122 1) 1
abil (51 1) (77 1) (738 1) 3
abl (99 1) (132 1) (536 1) (581 1) (695 1) (763 1) (908 1) (914 1) (986 1) (1114 2) 10
ablat (82 2) (274 2) (553 7) (587 1) (1065 3) (1096 2) (1097 7) (1098 3) (10Sorry if 99 4) (1100 4) (1101 3) (1226 3) (1241 3) (1279 1) 14
about (27 1) (32 1) (39 1) (46 1) (49 2) (56 1) (57 1) (69 2) (77 2) (81
2) (83 2) (113 1) (134 1) (139 2) (140 1) (155 1) (156 2) (162 1) (163 1) (165 2) (171 1) (174 1) (177 1) (193 5) (205 1) (206 3) (212 1) (216 3) (218 1)
(225 2) (249 3) (255 1) (257 1) (262 1) (266 3) (272 6) (273 1) (285 1) (292
2) (313 1) (315 2) (346 2) (368 1) (370 1) (371 1) (372 1) (373 1) (381 2) (391 1) (410 3) (420 1) (452 1) (456 4) (469 1) (479 1) (489 1) (498 3) (511 1)
(518 1) (531 1) (536 1) (548 1) (555 1) (556 1) (560 2) (565 1) (567 1) (572
1) (575 1) (577 1) (589 1) (601 1) (603 1) (610 1) (612 1) (614 1) (620 1) (621 4) (625 3) (626 1) (646 1) (649 1) (651 2) (657 2) (662 1) (679 1) (685 2)
(686 1) (704 2) (706 2) (709 1) (717 2) (721 1) (740 2) (757 2) (759 1) (774
1) (786 1) (792 2) (793 1) (794 2) (796 2) (801 2) (805 1) (806 1) (807 2) (808 2) (811 1) (815 1) (816 1) (829 2) (844 1) (869 1) (876 1) (912 1) (917 1)
(921 1) (927 1) (928 2) (958 1) (976 6) (991 1) (992 2) (993 1) (994 1) (996
1) (999 1) (1000 1) (1002 1) (1004 2) (1006 1) (1040 1) (1092 1) (1095 2) (1104 4) (1105 1) (1115 1) (1143 4) (1156 2) (1162 1) (1164 3) (1165 1) (1166 3) (1169 1) (1191 1)
(1194 1) (1202 1) (1209 1) (1212 1) (1218 1) (1223 1) (1224 1) (1229 1) (1230 1) (1231
1) (1239 1) (1241 1) (1244 1) (1246 1) (1248 1) (1255 2) (1262 1) (1275 2) (1282 1) (1303 1) (1304 1) (1307 1) (1310 3) (1316 1) (1335 1) (1341 1) (1344 1) (1345 1) (1353 1)
(1354 3) (1355 1) (1363 1) (1377 1) 178
这意味着例如单词 ab 在文档编号 744 中仅重复一次。
现在我想使用hadoop流api实现and query searching(这意味着文档应该包含查询中的所有单词)。
那么究竟什么是搜索中的 map 和 reduce 阶段?您能否给我一些提示,我该如何使用流式 API 来实现它? (输入字段应该是什么?),我不知道该怎么做?)
谢谢
【问题讨论】:
标签: search hadoop mapreduce hadoop-streaming