Spark 批处理中的顺序处理答案

【问题标题】：Sequential processing within a Spark batchSpark 批处理中的顺序处理
【发布时间】：2018-10-30 07:41:34
【问题描述】：

我有一个关于 Spark 批处理中的顺序处理的问题。这是我试图回答的问题的程式化版本以保持简单。

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Simple Dataframe Processing")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val df = spark.read.json("devices.json")

// Displays the content of the DataFrame to stdout
df.show()

// +-------------------------+
// | device-guid|   Operation|
// +----+-------+-------------
// |1234        |   Add 3    |
// |1234        |   Sub 3    |
// |1234        |   Add 2    |
// |1234        |   Sub 2    |
// |1234        |   Add 1    |
// |1234        |   Sub 1    |
// +----+-------+------------+


//I have a Database with one table with following columns
//  device-guid (primary key)   result


//I would like to take df and for each row in the df do a update operation to a single DB row, Adding or removing number as described in Operation column
//So the result I am expecting at the end of this in the DB is a single row with 

// device-guid      result
// 1234             0


df.foreach { row => 
          UpdateDB(row)  //Update the DB with the row's Operation. 
                        //Actual method not shown
    }

假设我在一个带有 YARN 的 Spark 集群中运行它，该集群有 5 个执行程序，每个执行程序有 2 个核心，每个执行程序跨 5 个工作节点。 Spark 中的什么保证 UpdateDB 操作按数据帧中的行顺序调度和执行，而不是并行调度和执行？

即我总是想在我的数据库的结果列中得到 0 的答案。

更广泛意义上的问题是“即使有多个执行器和内核，如何保证数据帧上操作的顺序处理”？

您能否指出表明这些任务将按顺序处理的 Spark 文档？

是否需要设置任何 Spark 属性才能使其正常工作？

问候，

文卡特

【问题讨论】：

标签： apache-spark apache-spark-sql scheduled-tasks

【解决方案1】：

更广泛意义上的问题是“即使有多个执行器和内核，如何保证数据帧上的操作顺序处理”？

什么都没有，除了根本没有并行性，要么只有一个分区。

单个核心可能有类似的效果，但不保证块的特定顺序。

如果您确实需要顺序处理，那么您在工作中使用了错误的工具。

【讨论】：