【发布时间】:2015-08-05 21:19:45
【问题描述】:
我有 2 个 RDD。在 Spark scala 中,如果 event1001RDD 和 event2009RDD 具有相同的 id,我该如何加入?
val event1001RDD: schemaRDD = [eventtype,id,location,date1]
[1001,4929102,LOC01,2015-01-20 10:44:39]
[1001,4929103,LOC02,2015-01-20 10:44:39]
[1001,4929104,LOC03,2015-01-20 10:44:39]
val event2009RDD: schemaRDD = [eventtype,id,date1,date2]
[2009,4929101,2015-01-20 20:44:39,2015-01-20 20:44:39]
[2009,4929102,2015-01-20 15:44:39,2015-01-20 21:44:39]
[2009,4929103,2015-01-20 14:44:39,2015-01-20 14:44:39]
[2009,4929105,2015-01-20 20:44:39,2015-01-20 20:44:39]
预期结果将是:(唯一)(按 id 排序)
[eventtype,id,1001的位置,1001的日期1,2009的日期1,2009的日期2]
2009,4929101,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39
1001,4929102,LOC01,2015-01-20 10:44:39,2015-01-20 15:44:39,2015-01-20 21:44:39
1001,4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,2015-01-20 14:44:39
1001,4929104,LOC03,2015-01-20 10:44:39,NULL,NULL
2009,4929105,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39
请注意,对于 id 4929102,1001 用作事件类型。只有在 1001 中没有任何匹配的 id 时才会使用 2009 事件类型。
它可以是 RDD[String] - flat。或通过 aggregateByKey 的 RDD 元组。我只需要遍历 RDD。
【问题讨论】:
标签: sql hadoop mapreduce apache-spark rdd