【发布时间】:2018-01-19 06:38:36
【问题描述】:
是否可以将两个不同类的DataSet分组,使得结果为
key -> 数组 ([Class1 实例], [Class2 实例], [Class2 实例])
为了澄清这个问题,这里是简单的 scala 代码。
object DataSetGrouping {
import org.apache.spark.sql.SparkSession
import java.sql.Timestamp
case class Loan(loanId: String, principalAmount: Double)
case class Payment(loanId: String, paymentAmount: Double, paymentDate: Timestamp)
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("DataSetGrouping").getOrCreate()
import spark.implicits._
val loanData = Seq(
Loan("loan1", 30000),
Loan("loan2", 60000)).toDS()
val paymentsData = Seq(
Payment("loan1", 10000, date("2017-07-31")),
Payment("loan1", 10000, date("2017-08-31")),
Payment("loan2", 20000, date("2017-07-31")),
Payment("loan2", 20000, date("2017-08-31"))).toDS()
val paymentMap = paymentsData.map(p => (p.loanId, p))
val loanMap = loanData.map(l => (l.loanId, l))
paymentMap.show()
loanMap.show()
}
def date(date: String): Timestamp = {
return java.sql.Timestamp.valueOf(java.time.LocalDateTime.parse(date + "T00:00:00"))
}
}
是否可以对这两个数据集进行分组,结果如下?
loan1 -> [ Loan("loan1",...), Payment("loan1",...), 付款(“loan1”,...)],
loan2 -> [ Loan("loan2",...), 付款(“loan2”,...),付款(“loan2”,...)]
【问题讨论】:
标签: scala apache-spark dataset