【发布时间】:2016-09-21 14:47:24
【问题描述】:
我在 Google Cloud Dataproc 上使用 Spark 流来执行一个框架(用 Python 编写),该框架由几个连续的管道组成,每个管道代表 Dataproc 上的一个作业,它基本上从 Kafka 队列中读取并将转换后的输出写入 Bigtable。所有管道每天通过 2 个集群处理数 GB 的数据,一个有 3 个工作节点,一个有 4 个。
在 5 月初(准确地说是 5 月 3 日)之前,在 Dataproc 上运行此 Spark 流式传输框架一直相当稳定:我们开始经常遇到导致管道终止的套接字超时异常。它似乎与集群上的负载无关,因为它并没有显着增加。它也全天随机发生,我检查了可能相关的代码更改,但我找不到任何。此外,这似乎只发生在具有 4 个工作节点的集群上,而具有 3 个节点的集群上的管道非常相似,完全没有超时。我已经重新创建了两次集群,但问题仍然存在,它会影响在此 dataproc 集群上运行的所有管道。 3节点的集群是n1-standard-4机器类型,而麻烦的4节点集群是n1-standard-8机器类型,除了它们的配置相同。
出现问题且作业终止时管道作业执行的示例输出:
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:645)
16/05/23 14:45:45 ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1464014740000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call
r = self.func(t, *rdds)
File "/tmp/b85990ba-e152-4d5b-8977-fb38915e78c4/transformfwpythonfiles.zip/transformationsframework/StreamManager.py", line 138, in process_kafka_rdd
.foreach(lambda *args: None)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 747, in foreach
self.mapPartitions(processPartition).count() # Force evaluation
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold
vals = self.mapPartitions(func).collect()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 772, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in _load_from_socket
for item in serializer.load_stream(rf):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream
yield self._read_with_length(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 156, in _read_with_length
length = read_int(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 543, in read_int
length = stream.read(4)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
timeout: timed out
stacktrace 的开始是在我们的StreamManager 模块中,方法 process_kafka_rdd:它在 Kafka 消息的直接流中处理单个离散 RDD。我们将 Kafka 与 Spark 流集成基于 http://spark.apache.org/docs/latest/streaming-kafka-integration.html 中描述的“直接方法”
【问题讨论】:
-
出现该错误时您拥有的消费者和分区的数量是多少?
标签: pyspark spark-streaming google-cloud-dataproc