【问题标题】:Calculate lower and upper bounds for partition Spark JDBC [duplicate]计算分区 Spark JDBC 的下限和上限 [重复]
【发布时间】:2019-04-08 10:15:16
【问题描述】:

我使用带有 Scala 的 Spark-jdbc 从 MS SQL 服务器读取数据,我想按指定的列对这些数据进行分区。我不想手动设置分区列的下限和上限。我可以在该字段中读取某种最大值和最小值并将其设置为上限/下限吗? 另外,使用这个查询我想从数据库中读取所有数据。 目前,查询机制如下所示:

def jdbcOptions() = Map[String,String](
    "driver" -> "db.driver",
    "url" -> "db.url",
    "user" -> "db.user",
    "password" -> "db.password",
    "customSchema" -> "db.custom_schema",
    "dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
    "partitionColumn" -> "db.partitionColumn",
    "lowerBound" -> "1",
    "upperBound" -> "30",
    "numPartitions" -> "5"
}

    val dataDF = sparkSession
      .read
      .format("jdbc")
      .options(jdbcOptions())
      .load()

【问题讨论】:

  • 嗨@Cassie db.partitionColumn 是一个数字列吗?
  • @AlexandrosBiratsis 是的,partitionColumn的数据类型是int

标签: sql-server scala apache-spark spark-jdbc


【解决方案1】:

如果dayColumn 是数字或日期字段,您可以使用下一个代码检索边界:

def jdbcBoundOptions() = Map[String,String]{
    "driver" -> "db.driver",
    "url" -> "db.url",
    "user" -> "db.user",
    "password" -> "db.password",
    "customSchema" -> "db.custom_schema",
    "dbtable" -> "(select max(db.partitionColumn), min(db.partitionColumn) from TestAllData where dayColumn > 'dayValue') as subq",
    "numPartitions" -> "1"
}

val boundRow = sparkSession
    .read
    .format("jdbc")
    .options(jdbcBoundOptions())
    .load()
    .first()

val maxDay = boundRow.getInt(0)
val mimDay = boundRow.getInt(1)

请注意,numPartitions 必须为 1,并且我们不需要在这种情况下指定分区详细信息,如 Spark documentation 中所述。

最后您可以将检索到的边界用于原始查询:

def jdbcOptions() = Map[String,String]{
    "driver" -> "db.driver",
    "url" -> "db.url",
    "user" -> "db.user",
    "password" -> "db.password",
    "customSchema" -> "db.custom_schema",
    "dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
    "partitionColumn" -> "db.partitionColumn",
    "lowerBound" -> minDay.toString,
    "upperBound" -> maxDay.toString,
    "numPartitions" -> "5"
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-17
    • 2015-06-26
    • 1970-01-01
    • 1970-01-01
    • 2016-04-22
    相关资源
    最近更新 更多