已提交 Spark 作业 - 等待（TaskSchedulerImpl ：不接受初始作业）答案

【问题标题】：Spark Job submitted - Waiting (TaskSchedulerImpl : Initial job not accepted)已提交 Spark 作业 - 等待（TaskSchedulerImpl ：不接受初始作业）
【发布时间】：2016-11-16 12:29:53
【问题描述】：

为提交作业而进行的 API 调用。响应状态 - 它正在运行

在集群 UI 上 -

工人（奴隶） - worker-20160712083825-172.31.17.189-59433 还活着

已使用 2 个核心中的 1 个

已使用 6 个内存中的 1Gb

正在运行的应用程序

app-20160713130056-0020 - 等待 5 小时后

核心 - 无限

应聘职位描述

活跃阶段

reduceByKey at /root/wordcount.py:23

待定阶段

takeOrdered at /root/wordcount.py:26

正在运行的驱动程序 -

stderr log page for driver-20160713130051-0025 

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

根据Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 从站尚未启动 - 因此它没有资源。

但是在我的情况下 - 从站 1 正在工作

根据Unable to Execute More than a spark Job "Initial job has not accepted any resources" 我正在使用部署模式 = 集群（不是客户端）因为我有 1 个主 1 个从属，并且通过 Postman / 任何地方调用提交 API

集群还有可用的核心、RAM、内存 - 仍然作业抛出错误由 UI 传达的

根据TaskSchedulerImpl: Initial job has not accepted any resources; 我分配了

~/spark-1.5.0/conf/spark-env.sh

Spark 环境变量

SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2

在奴隶中复制那些

sudo /root/spark-ec2/copy-dir /root/spark/conf/spark-env.sh

上述问题答案中的所有案例 - 均适用，但仍未找到解决方案。因此，因为我正在使用 API 和 Apache SPark - 也许需要一些其他帮助。

2016 年 7 月 18 日编辑

Wordcount.py - 我的 PySpark 应用程序代码 -

from pyspark import SparkContext, SparkConf

logFile = "/user/root/In/a.txt"

conf = (SparkConf().set("num-executors", "1"))

sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf)
print("in here")
lines = sc.textFile(logFile)
print("text read")
c = lines.count()
print("lines counted")

错误

Starting job: count at /root/wordcount.py:11
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Got job 0 (count at /root/wordcount.py:11) with 2 output partitions
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at /root/wordcount.py:11)
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at count at /root/wordcount.py:11), which has no missing parents
16/07/18 07:46:39 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.6 KB, free 56.2 KB)
16/07/18 07:46:39 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.4 KB, free 59.7 KB)
16/07/18 07:46:39 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.17.189:43684 (size: 3.4 KB, free: 511.5 MB)
16/07/18 07:46:39 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/18 07:46:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[2] at count at /root/wordcount.py:11)
16/07/18 07:46:39 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/07/18 07:46:54 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

根据Spark UI showing 0 cores even when setting cores in App，

Spark WebUI 声明使用的内核为零，并且无限期等待没有任务运行。该应用程序在运行时或内核期间也没有使用任何内存，并在启动时立即进入等待状态

Spark 版本 1.6.1 Ubuntu 亚马逊EC2

【问题讨论】：

尝试运行另一个代码 - 简单的 python 应用程序 - 错误仍然存在 from pyspark import SparkContext, SparkConf logFile = "/user/root/In/a.txt" conf = (SparkConf().set("num-executors", "1")) sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf) textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)wordCounts.saveAsTextFile("/user/root/In/output.txt")
你能用 spark submit 运行它吗？
尝试在 spark-submit 或 API 中减少每个节点的内存设置
看不到 mto 就 API 调用而言找到设置
Master的环境变量设置为/root/spark/conf/spark-env.conf - export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=1000 export HADOOP_HOME="/root/ephemeral- hdfs" export SPARK_MASTER_IP=ec2-wxyz.compute-1.amazonaws.com export MASTER=cat /root/spark-ec2/cluster-url

标签： api apache-spark amazon-ec2

【解决方案1】：

我也有同样的问题。以下是我发生时的评论。

1:17:46 WARN TaskSchedulerImpl：初始作业未接受任何资源;检查您的集群 UI 以确保工作人员已注册并且有足够的资源

我注意到它只发生在 scala shell 的第一个查询期间，我在其中运行从 hdfs 获取数据的东西。

当问题发生时，webui 指出没有任何正在运行的应用程序。

URL: spark://spark1:7077
REST URL: spark://spark1:6066 (cluster mode)
Alive Workers: 4
Cores in use: 26 Total, 26 Used
Memory in use: 52.7 GB Total, 4.0 GB Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed 
Status: ALIVE

好像启动失败了，具体是哪一个我也说不准。

但是，第二次重新启动集群会将 Applications 值设置为 1 一切正常。

URL: spark://spark1:7077
REST URL: spark://spark1:6066 (cluster mode)
Alive Workers: 4
Cores in use: 26 Total, 26 Used
Memory in use: 52.7 GB Total, 4.0 GB Used
Applications: 1 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

我仍在调查中，这种快速解决方法可以节省最终解决方案的时间。

【讨论】：

【解决方案2】：

你可以看看我在类似问题Apache Spark on Mesos: Initial job has not accepted any resources的回答：

虽然大多数其他答案都集中在 spark 从站上的资源分配（内核、内存）上，但我想强调一下，防火墙可能会导致完全相同的问题，尤其是当您在云平台上运行 spark 时。

如果你能在 web UI 中找到 spark slave，你可能已经打开了标准端口 8080、8081、7077、4040。但是，当你实际运行作业时，它使用了 SPARK_WORKER_PORT、spark.driver.port 和 spark。 blockManager.port 默认是随机分配的。如果您的防火墙阻止了这些端口，则主服务器无法从从服务器检索任何特定于作业的响应并返回错误。

您可以通过打开所有端口来运行快速测试，并查看从站是否接受作业。

【讨论】：