为什么减少输入记录与减少输出记录不同？答案

【问题标题】：Why Reduce input records different with Reduce output records?为什么减少输入记录与减少输出记录不同？
【发布时间】：2016-02-14 16:38:14
【问题描述】：

我尝试在 python 中使用 mapreducer 和库小飞象。下面是我的实验测试代码，我希望我能收到从映射器到减速器输出的所有记录。

def mapper(key, value):
    fields = value.split("\t");    
    myword = fields[0] + "\t" + fields[1]
    yield myword, value

def reducer(key, values):
    for value in values:
        mypid = value
        words = value.split("\t")
    global count
    count = count + 1
    myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
    yield myword, 1

if __name__ == "__main__":
    dumbo.run(mapper, reducer)

以下是 Map-Reduce Framework 的日志。我希望“减少输入记录”等于“减少输出记录”，但事实并非如此。我的测试代码有什么问题，或者我误解了 mapreducer 中的某些内容？谢谢。

    Map-Reduce Framework
            Map input records=405057
            Map output records=405057
            Map output bytes=107178919
            Map output materialized bytes=108467155
            Input split bytes=2496
            Combine input records=0
            Combine output records=0
            Reduce input groups=63096
            Reduce shuffle bytes=108467155
            Reduce input records=405057
            Reduce output records=63096
            Spilled Records=810114

如下修改reducer就可以了：

def reducer(key, values):
    global count
    for value in values:
        mypid = value
        words = value.split("\t")

        count = count + 1
        myword = str(count) + "--" + words[1]  ##to count total lines in recuder's output records
        yield myword, 1

【问题讨论】：

标签： hadoop reduce records mapper

【解决方案1】：

我希望“减少输入记录”等于“减少输出记录”，但事实并非如此。

我不知道你为什么期望这个。 reducer 的全部意义在于它一次接收一组值（基于映射器发出的键）；并且您的 reducer 只为每个组发出一条记录 (yield myword, 1)。因此，您的“减少输入记录”等于您的“减少输出记录”的唯一方法是，如果每个组恰好包含一条记录 - 也就是说，如果每个值中的前两个字段在您的记录集中是唯一的。由于显然情况并非如此，因此您的 reducer 发出的记录少于它接收的记录。

（实际上，这是通常的模式；这就是“reducer”被称为“reducer”的原因。这个名称来自函数式语言中的“reduce”，它将值的集合简化为单个值。）

【讨论】：