【发布时间】:2016-02-14 16:38:14
【问题描述】:
我尝试在 python 中使用 mapreducer 和库小飞象。 下面是我的实验测试代码,我希望我能收到从映射器到减速器输出的所有记录。
def mapper(key, value):
fields = value.split("\t");
myword = fields[0] + "\t" + fields[1]
yield myword, value
def reducer(key, values):
for value in values:
mypid = value
words = value.split("\t")
global count
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
if __name__ == "__main__":
dumbo.run(mapper, reducer)
以下是 Map-Reduce Framework 的日志。 我希望“减少输入记录”等于“减少输出记录”,但事实并非如此。 我的测试代码有什么问题,或者我误解了 mapreducer 中的某些内容? 谢谢。
Map-Reduce Framework
Map input records=405057
Map output records=405057
Map output bytes=107178919
Map output materialized bytes=108467155
Input split bytes=2496
Combine input records=0
Combine output records=0
Reduce input groups=63096
Reduce shuffle bytes=108467155
Reduce input records=405057
Reduce output records=63096
Spilled Records=810114
如下修改reducer就可以了:
def reducer(key, values):
global count
for value in values:
mypid = value
words = value.split("\t")
count = count + 1
myword = str(count) + "--" + words[1] ##to count total lines in recuder's output records
yield myword, 1
【问题讨论】:
标签: hadoop reduce records mapper