这是一种可能性:
给定这个输入 RDD:
var input = sc
.parallelize(Array(
"aa Droomstele 1 8030",
"aa Wikiquote 1 78261",
"aa Special 1 20493",
"aa.b Droomstele 7 4749",
"aa.b Droomstele 1 4751",
"af Blowback 2 16896",
"af Bluff 2 21442",
"en Bloubok 1 0"
))
.map(row => row.split(" "))
以下返回Droomstele:
input.map(split => (split(1), 1)) // RDD[("Droomstele", 1), ...]
.reduceByKey(_ + _) // RDD[..., ("Droomstele", 3), ...]
.sortBy(_._2) // RDD[("Droomstele", 3), ...] (Droomstele is first)
.first // ("Droomstele", 3)
._1 // "Droomstele"
或者,稍微快一点:
input.map(split => (split(1), 1)) // RDD[("Droomstele", 1), ...]
.reduceByKey(_ + _) // RDD[..., ("Droomstele", 3), ...]
.takeOrdered(1)(Ordering[Int].reverse.on(_._2)) // Array[("Droomstele", 3)]
.head // ("Droomstele", 3)
._1 // "Droomstele"