【发布时间】:2015-10-08 17:41:32
【问题描述】:
假设这些是我的文件:
very pleased product . phone lightweight comfortable sound quality good house yard .
quality construction phone base unit good . ample supply cable adapter . plug computer soundcard .
shop unit mail rebate . unit battery pack hold play time strap carr headphone adapter cable perfect digital copy optical. component micro plug stereo connector cable micro plug rca cable .
unit primarily record guitar jam session . input plug provide power plug microphone . decent stereo mic need digital recording performance . mono mode double recording time .
admit like new electronic toy . digital camera not impress .
我想从每个文档中的每个句子中提取所有二元组和三元组及其出现次数。
我试过了:
case class trigram(first: String, second: String,third: String) {
def mkReplacement(s: String) = s.replaceAll(first + " " + second + " " + third, first + "-" + second + "-" + third)
}
def stringToTrigrams(s: String) = {
val words = s.split(".")
if (words.size >= 3) {
words.sliding(3).map(a => tigram(a(0),a(1),a(2)))
}
else
Iterator[tigram]()
}
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.textFile("docs")
val trigrams = data.flatMap {
stringToTrigrams
}.collect()
val trigramCounts = trigrams.groupBy(identity).mapValues(_.size)
但它没有显示任何三元组?
【问题讨论】:
-
虽然很高兴看到我的代码被重用,但如果能得到确认,那就太好了 (stackoverflow.com/a/30681833/21755)