【问题标题】:Grouping DataSet of different classes in Spark在 Spark 中对不同类的数据集进行分组
【发布时间】:2018-01-19 06:38:36
【问题描述】:

是否可以将两个不同类的DataSet分组,使得结果为

key -> 数组 ([Class1 实例], [Class2 实例], [Class2 实例])

为了澄清这个问题,这里是简单的 scala 代码。

object DataSetGrouping {

  import org.apache.spark.sql.SparkSession
  import java.sql.Timestamp

  case class Loan(loanId: String, principalAmount: Double)
  case class Payment(loanId: String, paymentAmount: Double, paymentDate: Timestamp)

  def main(args: Array[String]) {

    val spark = SparkSession.builder().master("local").appName("DataSetGrouping").getOrCreate()
    import spark.implicits._

    val loanData = Seq(
      Loan("loan1", 30000),
      Loan("loan2", 60000)).toDS()

    val paymentsData = Seq(
      Payment("loan1", 10000, date("2017-07-31")),
      Payment("loan1", 10000, date("2017-08-31")),
      Payment("loan2", 20000, date("2017-07-31")),
      Payment("loan2", 20000, date("2017-08-31"))).toDS()

    val paymentMap = paymentsData.map(p => (p.loanId, p))
    val loanMap = loanData.map(l => (l.loanId, l))

    paymentMap.show()
    loanMap.show()
  }

  def date(date: String): Timestamp = {
    return java.sql.Timestamp.valueOf(java.time.LocalDateTime.parse(date + "T00:00:00"))
  }

}

是否可以对这两个数据集进行分组,结果如下?

loan1 -> [ Loan("loan1",...), Payment("loan1",...), 付款(“loan1”,...)],

loan2 -> [ Loan("loan2",...), 付款(“loan2”,...),付款(“loan2”,...)]

【问题讨论】:

    标签: scala apache-spark dataset


    【解决方案1】:

    在不与 Kryo EncoderAny 打交道的情况下,您可以获得的最接近的东西可能是这样的:

    paymentsData.groupByKey(_.loanId).mapGroups { 
      case (id, xs) => (id, xs.toSeq) 
    }.toDF("loanID", "payments").join(loanData, Seq("loanID"))
    
    +------+------------------------------------------------------------------------------+---------------+
    |loanID|payments                                                                      |principalAmount|
    +------+------------------------------------------------------------------------------+---------------+
    |loan1 |[[loan1,10000.0,2017-07-31 00:00:00.0], [loan1,10000.0,2017-08-31 00:00:00.0]]|30000.0        |
    |loan2 |[[loan2,20000.0,2017-07-31 00:00:00.0], [loan2,20000.0,2017-08-31 00:00:00.0]]|60000.0        |
    +------+------------------------------------------------------------------------------+---------------+
    

    由于分组而相当昂贵。

    【讨论】:

      猜你喜欢
      • 2015-02-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-02-05
      • 2019-09-29
      • 2016-02-13
      • 2015-03-13
      相关资源
      最近更新 更多