如何在 YARN Spark 作业中设置环境变量？答案

【问题标题】：How do I set an environment variable in a YARN Spark job?如何在 YARN Spark 作业中设置环境变量？
【发布时间】：2014-10-11 03:27:05
【问题描述】：

我正在尝试使用AccumuloInputFormat 和newAPIHadoopRDD 从Apache Spark 作业（用Java 编写）访问Accumulo 1.6。为了做到这一点，我必须通过调用setZooKeeperInstance 方法告诉AccumuloInputFormat 在哪里找到ZooKeeper。此方法接受一个ClientConfiguration 对象，该对象指定各种相关属性。

我正在通过调用静态loadDefault 方法创建我的ClientConfiguration 对象。此方法应该在各个位置查找 client.conf 文件以从中加载其默认值。它应该看的地方之一是$ACCUMULO_CONF_DIR/client.conf。

因此，我尝试设置ACCUMULO_CONF_DIR 环境变量，使其在Spark 运行作业时可见（作为参考，我尝试在yarn-cluster 部署模式下运行）。我还没有找到成功的方法。

到目前为止，我已经尝试过：

在SparkConf 上拨打setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf")
将ACCUMULO_CONF_DIR 导出到spark-env.sh
在spark-defaults.conf 中设置spark.executorEnv.ACCUMULO_CONF_DIR

他们都没有工作。当我在调用setZooKeeperInstance 之前打印环境时，ACCUMULO_CONF_DIR 不会出现。

如果相关，我将使用 CDH5 的所有版本。

这是我正在尝试做的一个示例（为简洁起见，省略了导入和异常处理）：

public class MySparkJob
{
    public static void main(String[] args)
    {
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("MySparkJob");
        sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
        // Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
        ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
        AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
        // Other calls to AccumuloInputFormat static functions to configure it properly.
        JavaPairRDD<Key, Value> accumuloRDD =
            sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
                               AccumuloInputFormat.class,
                               Key.class,
                               Value.class);
    }
}

【问题讨论】：

标签： java apache-spark hadoop-yarn cloudera-cdh accumulo

【解决方案1】：

所以我在写这个问题时发现了这个问题的答案（对不起，寻求声誉的人）。问题是 CDH5 使用 Spark 1.0.0，而我是通过 YARN 运行该作业的。显然，YARN 模式并没有关注 executor 环境，而是使用环境变量SPARK_YARN_USER_ENV 来控制其环境。因此，确保SPARK_YARN_USER_ENV 包含ACCUMULO_CONF_DIR=/etc/accumulo/conf 有效，并使ACCUMULO_CONF_DIR 在问题源示例中指定点的环境中可见。

独立模式和 YARN 模式工作方式的差异导致了 SPARK-1680，据报告该问题已在 Spark 1.1.0 中修复。

【讨论】：