MRJob 确定是否运行内联、本地、emr 或 hadoop答案

【问题标题】：MRJob determining if running inline, local, emr or hadoopMRJob 确定是否运行内联、本地、emr 或 hadoop
【发布时间】：2016-04-23 15:26:37
【问题描述】：

我是building on some old code from a few years back using the commoncrawl dataset，使用 MRJob 使用 EMR。该代码使用以下 MRJob 子类映射器函数来确定是在本地运行还是在 emr 上运行：

self.options.runner == 'emr'

这似乎从未奏效或不再奏效，self.options.runner 没有传递给任务，因此始终设置为默认值'inline'。问题是，有没有办法使用当前版本的 MRJob (v0.5.0) 确定代码是在本地运行还是在 emr 上运行。

【问题讨论】：

遇到了同样的错误。现在在github.com/commoncrawl/cc-mrjob/issues/7 跟踪问题

标签： python hadoop emr mrjob common-crawl

【解决方案1】：

感谢 @pykler 和 @sebastian-nagel 发布有关此问题的信息，因为尝试让 Common Crawl Python 示例在 Amazon EMR 上运行一直很头疼。

针对@pykler 发布的解决方案，我相信有一种更惯用的方式是shown in this PDF：

class CCJob(MRJob):
  def configure_options(self):
    super(CCJob, self).configure_options()
    self.pass_through_option('--runner')
    self.pass_through_option('-r')

然后剩下的代码，即if self.options.runner in ['emr', 'hadoop'] 检查，可以保持原样，它应该在 EMR 上正常工作，只需像往常一样传递-r emr 选项。

此外，在 EMR 上运行导入 mrcc 模块的脚本时似乎存在问题。我收到ImportError 说找不到模块。

要解决此问题，您应该创建一个新文件，其中包含要运行的代码，并将 from mrcc import CCJob 导入替换为实际的 mrcc.py 代码。这是 cc-mrjob 存储库的 shown in this fork。

【讨论】：

【解决方案2】：

我找到了一个解决方案，但如果有人知道的话，我仍在寻找内置解决方案。 You can add a custom passthrough option that gets passed to your tasks，看起来像这样：

class CCJob(MRJob):

def configure_options(self):
  super(CCJob, self).configure_options()
  self.add_passthrough_option(
   '--platform', default='local', choices=['local', 'remote'],
   help="indicate running remotely")

 def mapper(self, _, line):
   if self.options.platform == 'remote':
     pass

并且远程运行时必须通过--platform remote

【讨论】：