使用 REST 触发 Spark 作业答案

【问题标题】：Triggering spark jobs with REST使用 REST 触发 Spark 作业
【发布时间】：2015-05-13 14:34:58
【问题描述】：

我最近一直在尝试apache spark。我的问题更具体到触发火花工作。 Here我发布了关于理解火花工作的问题。在工作变得肮脏之后，我转向了我的要求。

我有一个 REST 端点，我在其中公开 API 以触发 Jobs，我使用 Spring4.0 进行 Rest 实现。现在继续前进，我想在 Spring 中实现 Jobs as Service，我将以编程方式提交 Job，这意味着当端点被触发时，我将使用给定的参数触发作业。我现在几乎没有设计选择。

类似于下面的书面作业，我需要维护几个由抽象类调用的作业可能是 JobScheduler 。

 /*Can this Code be abstracted from the application and written as 
  as a seperate job. Because my understanding is that the 
 Application code itself has to have the addJars embedded 
 which internally  sparkContext takes care.*/

 SparkConf sparkConf = new SparkConf().setAppName("MyApp").setJars(
 new String[] { "/path/to/jar/submit/cluster" })
 .setMaster("/url/of/master/node");
  sparkConf.setSparkHome("/path/to/spark/");

        sparkConf.set("spark.scheduler.mode", "FAIR");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        sc.setLocalProperty("spark.scheduler.pool", "test");

    // Application with Algorithm , transformations

扩展上述点有多个版本的作业由服务处理。
或者使用 Spark 作业服务器来执行此操作。

首先，我想知道在这种情况下最好的解决方案是什么，无论是执行还是扩展。

注意：我使用的是 spark 的独立集群。请帮忙。

【问题讨论】：

我在这个问题中添加了 Spring for Apache Hadoop 标签。 Spring Batch Admin 提供了一个用于管理和启动作业的 REST API，我相信 Spring for Apache Hadoop 提供了从 Spring Batch 启动 Spark 作业的能力......
@MichaelMinella ：谢谢你的建议，我一定会考虑的。

标签： rest apache-spark spring-batch job-scheduling spring-data-hadoop

【解决方案1】：

事实证明，Spark 有一个隐藏的 REST API 来提交作业、检查状态和终止。

在此处查看完整示例：http://arturmkrtchyan.com/apache-spark-hidden-rest-api

【讨论】：

听起来很有趣，发现这个issues.apache.org/jira/secure/attachment/12696651/…所以它的意思是spark本身现在已经暴露了这个功能？
Afaik 他们已经从 v1.4 添加了它。但他们还没有公开宣传。
@ArturMkrtchyan 非常有趣的选项，谢谢！如果我通过 Spark REST API 同时提交两个应用程序会怎样？
你链接的网页并没有真正说明什么，因为页面上的图片已经死了。
这个可能会有所帮助，而提供的主要链接有损坏的图片：gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a

【解决方案2】：

只需使用 Spark JobServer https://github.com/spark-jobserver/spark-jobserver

制作服务需要考虑很多事情，Spark JobServer 已经涵盖了其中的大部分内容。如果您发现某些东西不够好，那么应该很容易提出请求并将代码添加到他们的系统中，而不是从头开始重新发明它

【讨论】：

在使用 Spark Job Server 之前也要考虑一下——它不支持 2.0 之后的 Spark。此外，查看他们的提交历史 - 它不是超级活跃
@VolodymyrBakhmatiuk 虽然它比 apache livy 更活跃。
Spark Job Server 支持 Spark 2.2 已经有一段时间了。

【解决方案3】：

Livy 是一个开源 REST 接口，用于从任何地方与 Apache Spark 进行交互。它支持在本地或 Apache Hadoop YARN 中运行的 Spark 上下文中执行代码或程序的 sn-ps。

【讨论】：

虽然此链接可能会回答问题，但最好在此处包含答案的基本部分并提供链接以供参考。如果链接页面发生更改，仅链接答案可能会失效。 - From Review
你说得对，我已经更新了我的答案，提供了更多细节。谢谢。
Livy 的发布周期很奇怪。他们几乎每年发布一次！

【解决方案4】：

这里有一个很好的客户端，您可能会觉得很有帮助：https://github.com/ywilkof/spark-jobs-rest-client

编辑：这个答案是在 2015 年给出的。现在有像 Livy 这样的选项。

【讨论】：

您不知道是否可以通过该客户端同时启动两个应用程序？
是的，这是可能的。客户端只是对 Spark Master 的 HTTP 调用的包装器。因此，如果您的设置可以处理，那么它是可能的。

【解决方案5】：

正如贡献者 Josemy 提到的那样，即使我有这个要求，我也可以使用 Livy Server 来完成。以下是我采取的步骤，希望对某人有所帮助：

Download livy zip from https://livy.apache.org/download/
Follow instructions:  https://livy.apache.org/get-started/


Upload the zip to a client.
Unzip the file
Check for the following two parameters if doesn't exists, create with right path
export SPARK_HOME=/opt/spark
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Enable 8998 port on the client

Update $LIVY_HOME/conf/livy.conf with master details any other stuff needed
Note: Template are there in $LIVY_HOME/conf
Eg. livy.file.local-dir-whitelist = /home/folder-where-the-jar-will-be-kept/


Run the server
$LIVY_HOME/bin/livy-server start

Stop the server
$LIVY_HOME/bin/livy-server stop

UI: <client-ip>:8998/ui/

Submitting job:POST : http://<your client ip goes here>:8998/batches
{
  "className" :  "<ur class name will come here with package name>",
  "file"  : "your jar location",
  "args" : ["arg1", "arg2", "arg3" ]

}

【讨论】：