如何在 Amazon Elastic MapReduce 中读取外部文件答案

【问题标题】：How to read an external file in Amazon Elastic MapReduce如何在 Amazon Elastic MapReduce 中读取外部文件
【发布时间】：2014-01-31 23:41:45
【问题描述】：

您好，我是使用 Amazon EMR 和 Hadoop 的新手。我想知道如何从 EMR 作业中读取外部文件（存储在 S3 中）。例如，我有一个文件，其中包含一长串列入黑名单的字符串。当我的 EMR 作业正在处理我的输入时，如何让作业事先读取此列入黑名单的字符串列表，以便在处理期间使用它？

我尝试使用常规 Java Scanner 类并对文件的 S3 路径进行硬编码，但这似乎不起作用，尽管我可能做错了...

【问题讨论】：

标签： file-io amazon elastic-map-reduce emr

【解决方案1】：

我会做这样的事情（对不起，代码是 scala 不是 java，但它是一样的）

将路径作为参数传递给 main 方法
将其设置为配置中的属性

val conf = new Configuration()    
conf.set("blacklist.file", args(0))

在mapper的setup方法中，读取文件：

var blacklist: List[String] = List()
    override def setup(context: Context) {
          val path = new Path(context.getConfiguration.get("blacklist.file"))
          val fileSystem = FileSystem.get(path.toUri, context.getConfiguration)
          blacklist = scala.io.Source.fromInputStream(fileSystem.open(path)).getLines.toList
        }

【讨论】：

不知道为什么，但格式化不起作用，这是第二部分作为 github 要点：gist.github.com/4289288
我已经为您编辑了帖子。在项目符号行之后使用 HTML 注释  以便格式化正确的代码。
我最终做了这个工作，虽然我相信 Amar 的解决方案也是正确的。

【解决方案2】：

如果您可以按如下方式将此文件添加到分布式缓存中会更好：

...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...

稍后，在 mapper/reducer 的 configure() 中，您可以执行以下操作：

...
Path s3FilePath;
@Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
// Read the file and build a HashMap/List or something which can be accessed from map/reduce methods as desired.
...
}

【讨论】：