【问题标题】:Spark - most frequent word following a given wordSpark - 给定单词后出现频率最高的单词
【发布时间】:2018-04-28 22:56:29
【问题描述】:

我正在学习 Scala,我正在尝试弄清楚如何在 Scala 中创建一个 MapReduce 程序,以便为文件中的每个单词查找最跟随的单词。 这就是我所拥有的。它有效,但我想实际使用 map reduce,我正在尝试找到尽可能减少循环的方法

 //initialize the list with first two words
  val list = scala.collection.mutable.MutableList((words.collect()(0), 
    words.collect()(1)));

   for (x <- 1 to (words.collect().length - 2)) {
  // add element into the list
  list += ((words.collect()(x), words.collect()(x + 1)))
   }
val rdd1 = spark.parallelize(list)

val rdd2 = rdd1.map(word => (word, 1)) // ex: key is (basketball,is)  value is 1

val counter = rdd2.reduceByKey((x, y) => x + y).sortBy(_._2, false) // sort in dec

val result2 = counter.collect();

print("the most frequent follower for basketball, the, and competitive \n")

println(" ")

// calls the function

findFreq("basketball", result2)

findFreq("the", result2)

findFreq("competitive", result2)

  }

 // method to find the most frequent follower for the specific word
   def findFreq(str: String, RDD: Array[((String, String), (Int))]): Unit = 
{

var max = -1;

for (x <- RDD) {
  }

  // display the results
  if (x._1._1.equals(str) && x._2 == max) {
    println("\"" + x._1._1 + "\"" + " is followed by " + "\"" + x._1._2 + "\"" + " " + x._2 + " times.\n")
     }
   }
  }
}

【问题讨论】:

    标签: scala apache-spark mapreduce


    【解决方案1】:

    给定一个单词数组(作为 RDD),您可以通过几次转换获得紧跟给定 word 的最常见单词:

    第 1 步:使用 sliding(2) 的单词对 RDD

      .sliding(2)
    

    第 2 步:以 (word, w2) 为键的 pair-RDD,然后以 reduceByKey 计算给定 word 的单词对的出现次数

      .collect{ case Array(`word`, w2) => ((word, w2), 1) }
      .reduceByKey( _ + _ )
    

    第 3 步:以word 为键的pair-RDD,然后reduceByKey 以捕获具有最大计数的单词对

      .map{ case ((`word`, w2), c) => (word, (w2, c)) }
      .reduceByKey( (acc, x) => if (x._2 > acc._2) (x._1, x._2) else acc )
    

    将所有内容与封装在方法中的转换一起放在一起:

    import org.apache.spark.sql.functions._
    import org.apache.spark.rdd.RDD
    import org.apache.spark.mllib.rdd.RDDFunctions._
    
    // load a RDD of words from the text file
    val rdd = sc.textFile("/path/to/basketball.txt")
      .flatMap( _.split("""[\s,.;:!?]+""") )
      .map( _.toLowerCase )
    
    def mostFreq(word: String, rdd: RDD[String]): RDD[(String, (String, Int))] =
      rdd
        .sliding(2)
        .collect{ case Array(`word`, w2) => ((word, w2), 1) }
        .reduceByKey( _ + _ )
        .map{ case ((`word`, w2), c) => (word, (w2, c)) }
        .reduceByKey( (acc, x) => if (x._2 > acc._2) (x._1, x._2) else acc )
    

    显示给定word 之后出现频率最高的单词:

    mostFreq("basketball", rdd).foreach{ case (word, (w2, c)) =>
      println(s"'$word' is followed most frequently by '$w2' for $c times. ")
    }
    // 'basketball' is followed most frequently by 'leagues' for 2 times. 
    

    示例文本文件:/path/to/basketball.txt(内容来自Wikipedia):

    篮球是世界上最受欢迎和广受关注的运动之一 运动的。美国国家篮球协会 (NBA) 是最 世界上重​​要的职业篮球联赛 知名度,薪水,人才和竞争水平。北外 美国,来自国家篮球联赛的顶级俱乐部有资格获得 欧洲联赛和FIBA美洲等洲际锦标赛 联盟。 FIBA篮球世界杯和男子奥运篮球 锦标赛是这项运动的主要国际赛事,吸引 来自世界各地的顶级国家队。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-12-20
      • 2023-03-14
      • 2017-07-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多