Amazon Elastic MapReduce：输出目录答案

【问题标题】：Amazon Elastic MapReduce: Output directoryAmazon Elastic MapReduce：输出目录
【发布时间】：2012-07-29 23:51:30
【问题描述】：

我正在运行 Amazon 运行 Elastic MapReduce 的示例，并不断遇到以下错误：

启动作业时出错，输出路径已存在。

这是运行我正在使用的作业的命令：

C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
     --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
     --input   s3://elasticmapreduce/samples/wordcount/input \
     --output  [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
     --reducer aggregate

Here is where the example comes from here

我正在按照亚马逊关于输出目录的说明进行操作。存储桶名称为s3n://mp.maptester321mark/。我在this url上查看了他们对问题的所有建议

这是我的credentials.json 信息：

{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}

【问题讨论】：

为什么，为什么 S3 必须每次都强制我们创建一个新目录？

标签： hadoop amazon-ec2 amazon-web-services elastic-map-reduce

【解决方案1】：

hadoop 作业不会破坏已经存在的目录。你只需要运行：

hadoop fs -rmr <output_dir>

在你的工作之前，不要只使用 AWS 控制台来删除目录。

【讨论】：

我已经在开始工作之前删除了目录，但它仍然会抛出这个错误。
你能确认它确实被删除了吗？
尝试不同的输出目录
保罗，是的，我确认它已被删除。我试过删除目录并重新运行。我也尝试过使用新的目录名称（因此它将作为过程的一部分创建）。
很奇怪。是否有可能在您的作业开始之前创建目录的 elastic-mapreduce 命令之前运行一个步骤？您是否从字面上指定“s3n://mp.maptester321mark/”作为输出目录？如果是这样，您可能需要指定一个文件夹，如“s3n://mp.maptester321mark/output”。

【解决方案2】：

用途：

--output s3n://mp.maptester321mark/output

代替：

--output s3n://mp.maptester321mark/

我想 EMR 在运行之前会生成输出存储桶，这意味着如果您指定 --output s3n://mp.maptester321mark/，您将已经拥有输出目录 /，这可能是您收到此错误的原因。

【讨论】：

【解决方案3】：

--->如果文件夹（桶）已经存在，则删除它。

--->如果你删除它，你仍然得到上述错误，确保你的输出是这样的 s3n://some_bucket_name/your_output_bucket 如果你有这样的 s3n://your_output_bucket/ 这是 EMR 的问题！因为我认为它首先在路径（some_bucket_name）上创建存储桶，然后尝试创建（your_output_bucket）。

谢谢哈里

【讨论】：