内存中的 Hadoop Reducer 值？答案

【问题标题】：Hadoop Reducer Values in Memory?内存中的 Hadoop Reducer 值？
【发布时间】：2012-06-16 23:41:02
【问题描述】：

我正在编写一个 MapReduce 作业，最终可能会在 reducer 中生成大量值。我担心所有这些值会一次加载到内存中。

Iterable<VALUEIN> values 的底层实现是否在需要时将值加载到内存中？ Hadoop：权威指南似乎暗示了这种情况，但没有给出“明确”的答案。

reducer 的输出将比输入的值大得多，但我相信输出会根据需要写入磁盘。

【问题讨论】：

【解决方案1】：

正如其他用户所引用的，整个数据并未加载到内存中。查看Apache 文档链接中的一些mapred-site.xml 参数。

mapreduce.reduce.merge.inmem.threshold

默认值：1000。它是阈值，以内存中合并过程的文件数表示。

mapreduce.reduce.shuffle.merge.percent

默认值为 0.66。将启动内存合并的使用阈值，表示为分配给存储内存映射输出的总内存的百分比，由mapreduce.reduce.shuffle.input.buffer.percent 定义。

mapreduce.reduce.shuffle.input.buffer.percent

默认值为 0.70。在 shuffle 期间从最大堆大小分配到存储映射输出的内存百分比。

mapreduce.reduce.input.buffer.percent

默认值为 0。内存百分比（相对于最大堆大小）在减少期间保留映射输出。当 shuffle 结束时，内存中任何剩余的 map 输出必须消耗少于此阈值，然后才能开始 reduce。

mapreduce.reduce.shuffle.memory.limit.percent

默认值为：0.25。单个 shuffle 可以消耗的内存限制的最大百分比

【讨论】：

【解决方案2】：

不完全在内存中，有一部分来自磁盘，看代码好像是框架把Iterable分解成段，从磁盘一个一个地加载到内存中。

org.apache.hadoop.mapreduce.task.ReduceContextImpl org.apache.hadoop.mapred.BackupStore

【讨论】：

你能解释一下它是如何解决这个问题的吗？

【解决方案3】：

您正在正确地阅读这本书。 reducer 不会将所有值都存储在内存中。相反，当循环遍历 Iterable 值列表时，每个 Object 实例都会被重复使用，因此它只会在给定时间保留一个实例。

例如在下面的代码中，objs ArrayList 在循环后将具有预期的大小，但每个元素都将是相同的 b/c 每次迭代都会重复使用 Text val 实例。

public static class ReducerExample extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) {
    ArrayList<Text> objs = new ArrayList<Text>();
            for (Text val : values){
                    objs.add(val);
            }
    }
}

（如果出于某种原因您确实想对每个 val 采取进一步的措施，您应该制作一个深层副本然后存储它。）

当然，即使是单个值也可能大于内存。在这种情况下，建议开发人员采取措施减少前面 Mapper 中的数据，以使值不会太大。

更新：参见 Hadoop The Definitive Guide 2nd Edition 的第 199-200 页。

This code snippet makes it clear that the same key and value objects are used on each 
invocation of the map() method -- only their contents are changed (by the reader's 
next() method). This can be a surprise to users, who might expect keys and vales to be 
immutable. This causes prolems when a reference to a key or value object is retained 
outside the map() method, as its value can change without warning. If you need to do 
this, make a copy of the object you want to hold on to. For example, for a Text object, 
you can use its copy constructor: new Text(value).

The situation is similar with reducers. In this case, the value object in the reducer's 
iterator are reused, so you need to copy any that you need to retain between calls to 
the iterator.

【讨论】：

我对你的回答感到困惑。首先，您说“reducer 不会将所有值都存储在内存中”，这意味着 Iterable 会根据需要加载值。后来，您说“即使是单个值列表实例也可能比内存大”，这意味着首先将值列表加载到内存中。你能澄清一下吗？
编辑澄清。我的意思是即使是单个值也可能很大。这是不太可能的。 “reducer 不会将所有值都存储在内存中”是一个真实的陈述。这有意义吗？
是的。谢谢你的澄清。碰巧你有这方面的参考吗？
非常感谢。谢谢。