您要做的是:您的值(在输入 K,V 中)是一个 iterable,您希望在其上对内部键求和并将结果返回为 =>
(outer_key(e.g 1,2) -> List(Inner_Key(E.g."K1","K2"),Summed_value))
如您所见,总和是在 inner Key-V 上计算的,
我们可以通过
首先从每个列表项中剥离元素
=> 将新键设为(外键,内键)
=> 对 (outer_key,inner_key) 求和 -> 值
=> 将数据格式改回 (outer_key ->(inner_key, summed_value))
=> 最后在外键上再次分组
我不确定 Python 是什么,但我相信只需用 python 替换 Scala 集合语法就足够了,这就是解决方案
SCALA 版本
scala> val keySeq = Seq((1,List(("K1",4),("K2",3),("K1",2))),
| (2,List(("K3",1),("K3",8),("K1",6))))
keySeq: Seq[(Int, List[(String, Int)])] = List((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
scala> val inRdd = sc.parallelize(keySeq)
inRdd: org.apache.spark.rdd.RDD[(Int, List[(String, Int)])] = ParallelCollectionRDD[111] at parallelize at <console>:26
scala> inRdd.take(10)
res64: Array[(Int, List[(String, Int)])] = Array((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
// And solution :
scala> inRdd.flatMap { case (i,l) => l.map(l => ((i,l._1),l._2)) }.reduceByKey(_+_).map(x => (x._1._1 ->(x._1._2,x._2))).groupByKey.map(x => (x._1,x._2.toList.sortBy(x =>x))).collect()
// RESULT ::
res65: Array[(Int, List[(String, Int)])] = Array((1,List((K1,6), (K2,3))), (2,List((K1,6), (K3,9))))
更新 => Python 解决方案
>>> data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
... (2,[('k3',1),('k3',8),('k1',6)])])
>>> data.collect()
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]), (2, [('k3', 1), ('k3', 8), ('k1', 6)])]
# Similar operation
>>> data.flatMap(lambda x : [ ((x[0],y[0]),y[1]) for y in x[1]]).reduceByKey(lambda a,b : (a+b)).map(lambda x : [x[0][0],(x[0][1],x[1])]).groupByKey().mapValues(list).collect()
# RESULT
[(1, [('k1', 6), ('k2', 3)]), (2, [('k3', 9), ('k1', 6)])]